Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Conformalized post-selection inference and structured prediction
(USC Thesis Other)
Conformalized post-selection inference and structured prediction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CONFORMALIZED POST-SELECTION INFERENCE AND STRUCTURED PREDICTION by Ziyi Liang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) May 2024 Copyright 2024 Ziyi Liang Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Out-of-distribution testing with model selection . . . . . . . . . . . . . . . . . . . . . 3 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Integrative conformal p-values: generic strategies and data-driven algorithms . . . . . . . 7 2.2.1 Integrative conformal p-values with data-driven weighting . . . . . . . . . . . . . 8 2.2.2 Integrative conformal p-values with automatic model selection and tuning . . . . . 11 2.2.3 Integrative p-values with transductive cross-validation+ . . . . . . . . . . . . . . . 13 2.3 FDR control via conditional calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Asymptotic power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Numerical experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Numerical experiments with real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 3: Conformal inference for neural networks trained with early stopping . . . . . . . . . . 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Conformalized early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.1 Existing benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2 CES for Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 CES for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.4 CES for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.5 Implementation Details and Computational Cost . . . . . . . . . . . . . . . . . . . 59 3.3 Numerical Experiments with real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 4: Structured multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Locally Adaptive Structure Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.1 Oracle FDR procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.2 A data-driven procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.3 Oracle-assisted weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.4 Oracle-assisted weights vs sparsity-adaptive weights . . . . . . . . . . . . . . . . . 75 ii 4.2.5 The LASLA procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3.1 Asymptotic validity of the LASLA procedure . . . . . . . . . . . . . . . . . . . . . 78 4.3.2 Asymptotic power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Numerical experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Numerical experiments with real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 5: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A1 FDR control with TCV+ integrative conformal p-values . . . . . . . . . . . . . . . . . . . . 97 A2 Analysis of computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 A3 TCV+ beyond conformal out-of-distribution testing . . . . . . . . . . . . . . . . . . . . . . 109 A4 Asymptotic power analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A4.1 Simplified analysis under independence and a mixture working model . . . . . . . 113 A4.2 The discrete beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A4.3 A heuristic rule of thumb for deciding when to weight? . . . . . . . . . . . . . . . 118 A4.4 Method of moments estimation of ν in Section A4.3 . . . . . . . . . . . . . . . . . 120 A5 Additional numerical experiments with synthetic data . . . . . . . . . . . . . . . . . . . . . 122 A5.1 Pairwise correlation between integrative conformal p-values . . . . . . . . . . . . 122 A5.2 Validation of the asymptotic power analysis . . . . . . . . . . . . . . . . . . . . . . 123 A6 Mathematical proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A7 Supplementary figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 B1 Naive Early Stopping Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 B2 Classification with Marginal Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 B3 Efficient Computation of the Lower Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . 166 B4 Review on related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 B4.1 Avoiding empty predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 B4.2 Implementation of the Naive Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 175 B5 Mathematical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Appendix to Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 C1 Additional applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 C2 Forming local neighborhoods: illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 C3 Details on sparsity-adaptive weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 C4 Alternative weight construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 C5 Implementation Details and Additional Numerical Results . . . . . . . . . . . . . . . . . . 186 C6 Proof of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 C7 Asymptotic theories under weak dependence . . . . . . . . . . . . . . . . . . . . . . . . . . 199 C7.1 Oracle weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 C7.2 Consistency of the weight estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 201 C7.3 FDP control under weak dependence . . . . . . . . . . . . . . . . . . . . . . . . . . 203 iii C8 Proof of the theoretical results under dependency . . . . . . . . . . . . . . . . . . . . . . . 205 iv List of Tables 4.1 Number of rejections by different methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Number of rejections by Bonferroni Correction . . . . . . . . . . . . . . . . . . . . . . . . 83 A1 Computational costs of computing conformal p-values using different methods. These results are derived in Sections A2–A2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 v List of Figures 2.1 Performance of conformal p-values for out-of-distribution testing with image data, in terms of power under false discovery rate control within a multiple testing setting. The results are shown as a function of the number of labeled inliers and outliers. The oracles are impractical but useful to approximate the power of inductive conformal p-values based on the optimal one-class or binary classifier from a rich machine learning toolbox. The integrative p-values often perform almost as well as those of the best oracle, and can sometimes outperform both of them. . . . . . . . . . . . 4 2.2 Schematic representation of Algorithm 2, which computes integrative conformal p-values with automatic model selection and tuning. The dashed lines represent the flow of information utilized for the evaluation of the machine-learning models, but not for training. The dotted lines indicate the flow of information utilized for model selection and tuning. . . . . . . . . . . . . . . . . . . 13 2.3 Performance of Storey’s BH with conformal p-values based on different methods, on simulated data. The results are shown as a function of the sample size, for calibration data with different proportions of inliers. All methods control the FDR below 10%. . . . . . . . . . . . . . . . . . . . 28 2.4 Performance of integrative conformal p-values and naive benchmarks on simulated data, as a function of the number of machine learning models considered. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure 2.3. . . . . . . . . . . . . 32 2.5 Performance on simulated data of Storey’s BH applied to integrative p-values computed with either TCV+ or sample splitting. To serve as benchmarks, inductive conformal p-values based on one-class or binary classification models tuned by an ideal oracle are computed using cross-validation+. All methods control the FDR below 10%. Other details are as in Figure 2.3. . . . . . . . . . . . . . . . 33 2.6 Performance on simulated data of different methods for FDR control with integrative conformal p-values, as a function of the sample size. Other details are as in Figure 2.3. . . . . . . . . . . . . 34 2.7 Power of Storey’s BH with conformal p-values based on different machine-learning models, on animal image classification data. Integrative p-values are compared to inductive p-values based on six different binary classification models. Other details are as in Figure 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 vi 3.1 Schematic visualization of rigorous conformal inferences for models trained with early stopping. (a) Conventional pipeline requiring a three-way sample split. (b) Conformalized early stopping, requiring only a two-way split. . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Schematic visualization of heuristic conformal inferences based on a naive benchmark that utilizes the same hold-out data twice. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Average performance, as a function of the sample size, of conformal inferences based on neural networks trained and calibrated with different methods, on the bio regression data [95]. Ideally, the coverage of the conformal prediction intervals should be close to 90% and their width should be small. All methods shown here guarantee 90% marginal coverage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Squared-error loss on test-augmented hold-out data for three alternative regression models M1, M2 and M3, as a function of the place-holder outcome y for the test point. The CES method utilizes the best model for each possible value of y, which is identified by the lower envelope of these three parabolas. In this case, the lower envelope has two finite knots at k1 and k3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Pinball loss functions on test-augmented hold-out data for three alternative regression models, M1, M2 and M3, as a function of the place-holder outcome y for the test point. The CES method utilizes the best model for each possible value of y, which is identified by the lower envelope of these three pinball loss functions. In this case, the lower envelope has a single finite knot at k2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 Average performance, as a function of the sample size, of conformal inferences for outlier detection based on neural networks trained and calibrated with different methods, on the CIFAR10 data [67]. Ideally, the TPR should be as large as possible while maintaining the FPR below 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Average performance, as a function of the sample size, of conformal prediction sets for multi-class classification based on neural networks trained and calibrated with different methods, on the CIFAR10 data [67]. Ideally, the coverage should be close to 90% and the cardinality should be small. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.8 Average performance, as a function of the sample size, of conformal prediction sets for quantile regression based on neural networks trained and calibrated with different methods, on the homes data [55]. The marginal coverage is theoretically guaranteed to be above 90%. Ideally, the conditional coverage should high and the prediction intervals should be tight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1 Auxiliary linkage data provide structural information that can be utilized to identify significant SNPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Empirical FDR and power comparison for oracle LASLA, LAWS and PV (unweighted p-value) with significance level α = 0.05. (a): Example 1; (b): Example 2. . . . . . . . . . . 77 vii 4.3 Empirical FDR and power comparison for data-driven LASLA and BH. (a): Setting 1; (b): Setting 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4 Sub-network identified by LASLA (all nodes). BH only detects the blue nodes. . . . . . . . 84 A1 Empirical demonstration of the need to adaptively tune the sign of conformity scores calculated by one-class classifiers, as discussed in Section 2.2.2. Left: the inlier data are generated from the Gaussian mixture model described in Section 2.5, with p = 100 and a = 2, while the outlier data are generated with a = 1. Right: the data are generated from the same model as on the left, swapping the inlier/outlier labels. Top: first two principal components of the data. Bottom: distributions of log(ˆs0(Xi)) for inlier and outlier data points Xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A2 Schematic representation of Algorithm 3, computing integrative conformal p-values via TCV+ instead of data splitting. Other details are as in Figure 2.2. . . . . . . . . . . . . . . . 138 A3 Pairwise correlation between integrative conformal p-values for independent test points, on synthetic data from two different distributions. The results are shown as a function of the size of the shared calibration data set. The dotted line indicates the theoretical pairwise correlation between standard one-class classification (OCC) conformal p-values. . . . . . . . . . . . . . . . . 138 A4 Schematic representation of the conditional FDR calibration method with integrative pvalues described in Section 2.3. The operations in the green nodes are not computationally expensive because they utilize pre-computed conformity scores. The dashed lines represent the flow of information utilized for the evaluation of the machine-learning models, but not for training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A5 Schematic representation of the conditional FDR calibration method with integrative TCV+ p-values described in Section A1. The red nodes are computationally expensive. The dashed lines represent the flow of information utilized for the evaluation of the machine-learning models, but not for training. . . . . . . . . . . . . . . . . . . . . . . . . 139 A6 Performance of conformal p-values computed using different methods, in the experiments with simulated data of Figure 2.3. The results are shown as a function of the sample size, for calibration data with different proportions of inliers. Top: power obtained with Storey’s BH procedure. Center: FDR; the dashed line corresponds to the nominal 10% level. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆ calculated using (A16). These results are consistent with the theoretical analysis of Section 2.4, which predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, even though the technical assumptions of Section 2.4 may not hold exactly. Further, these results show that the empirical estimate Ξˆ of Ξ described in Section A4.3 is quite accurate as long as the outliers are exchangeable and the sample size is sufficiently large. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A7 Performance of Storey’s BH applied to conformal p-values computed with different methods, on simulated data. The results are shown as a function of the fraction of outliers in the labeled data, for different sample sizes. Other details are as in Figure 2.3. . . . . . . . 141 viii A8 Performance of conformal p-values computed with different methods, on simulated data. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A9 Performance of conformal p-values computed with different methods, on simulated data. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure A7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A10 QQ-plot comparing the empirical distribution of conformal p-values, computed with different methods on synthetic data, to the theoretical uniform distribution, separately for true inliers (left) and outliers (right). In the underlying experiments, the labeled sample size is 500 and the proportion of data points which are outliers is 50%. Other details are as in Figure 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A11 Performance of Storey’s BH with conformal p-values based on different methods, on simulated data with non-exchangeable outliers. The results are shown as a function of the sample size, for data sets experiencing different amounts of distribution shift between the labeled and test outliers. The proportion of inliers in the labeled data is 50%. If we denote the outlier shift parameter as δa, the outliers in the test set are generated using the same model described at the beginning of Section 2.5 with the signal amplitude a replaced by (1 − δa)a. Note that the case with δa = 0 matches the results shown in Figure 2.3. Values of δa > 0 intrinsically make the problem easier because they result in test outliers that are more easily recognizable from the inliers, while δa < 0 makes the problem harder. Other details are as in Figure 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A12 Performance of Storey’s BH applied to conformal p-values computed with different methods, on simulated data. The data are simulated from a Gaussian mixture model as in Section 2.5, but with parameter a = 1.25 instead of a = 0.7. Other details are as in Figure 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A13 Performance of Storey’s BH applied to conformal p-values based on different underlying machine-learning models, on simulated data. Integrative conformal p-values without weighting (ensemble method) are compared to standard conformal p-values based on two different one-class classification models (IF and LOF), as well as to those based on the most powerful one-class classification model selected by an imaginary oracle. Other details are as in Figure A12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A14 Performance of Storey’s BH applied to conformal p-values computed with different methods, on simulated data. The data are simulated from a logistic regression model with random parameters. Other details are as in Figure 2.3. . . . . . . . . . . . . . . . . . . . . . 145 A15 Performance of conformal p-values computed with different methods, on simulated data. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The data are simulated from a logistic regression model with random parameters. Other details are as in Figure A8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 ix A16 Performance on simulated data of Storey’s BH applied to integrative conformal p-values computed with either TCV+ or sample splitting. To serve as benchmarks, standard conformal p-values based on one-class or binary classification models tuned by an ideal oracle are computed using cross-validation+. Other details are as in Figure 2.3. Other results are as in Figure 2.5. . . . . . . . 146 A17 Performance on simulated data of integrative conformal p-values computed with either TCV+ or sample splitting. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure 2.5. . . . . . . . . . . . . . . . . . . . . . . . . 146 A18 Performance on simulated data of Storey’s BH applied to integrative conformal p-values computed with either TCV+ or sample splitting. The data are simulated from a logistic regression model with random parameters. Other details are as in Figure 2.5. . . . . . . . . 147 A19 Performance on simulated data of integrative conformal p-values computed with either TCV+ or sample splitting. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. The data are simulated from a logistic regression model with random parameters. Other details are as in Figure A18. . . . . . . . . . . . . . . . . . . . . . . . . . 147 A20 Empirical validation, using synthetic data with exchangeable outliers, of the power analysis for integrative p-values discussed in Section 2.4. Top: power of BH applied to weighted and unweighted integrative conformal p-values, as a function of the scale parameter of the underlying support vector machine. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆexch calculated using (A16). These results are consistent with the theoretical analysis of Section 2.4, which predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, even though the technical assumptions of Section 2.4 may not hold exactly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A21 Empirical validation, using synthetic data with non-exchangeable outliers, of the power analysis for integrative p-values discussed in Section 2.4. The data are generated from the same model utilized in Figure A11. The simulations corresponding to zero outlier shift are based on exchangeable outlier data, as in Figure A20. Top: power and FDR of BH applied to weighted and unweighted integrative conformal p-values, as a function of the scale parameter of the underlying support vector machine. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆ calculated either using (A16)—assuming outlier exchangeability—or using (A17)—without assuming exchangeability. These results are consistent with the theoretical analysis of Section 2.4, which predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, even though the technical assumptions of Section 2.4 may not hold exactly. Note that the values of Ξˆ estimated using (A17) are not shown in the panels corresponding to negative outlier shift because the method-of-moments estimator νˆ described in Section A4.4 does not converge to a non-trivial solution in those cases. . . . . . . . . . . . . . . . . . . . . . . . . 149 x A22 Performance of conformal p-values for out-of-distribution testing with animal image data, for the same experiments shown in Figure 2.1. Top: power and FDR of BH applied to different types of conformal p-values. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆ calculated using (A16). These results are not inconsistent with the theoretical analysis of Section 2.4, because the latter predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, but does not imply the converse. Interestingly, the left-hand-side panels show that weighting can sometimes lead to higher power even when Ξ < 1. Note that the estimator Ξˆ of Ξ is not accurate if the number of labeled outliers is very small. . . . . 150 A23 Power of Storey’s BH with conformal p-values based on different machine-learning models, on animal image classification data. Integrative p-values are compared to standard conformal p-values based on six different one-class classification models. Other details are as in Figure 2.7. Top: power and FDR of BH applied to integrative and standard conformal p-values, as a function of the scale parameter of the underlying support vector machine. Bottom: estimated “informativeness” ratio Ξˆ for the p-values calculated based on (A17). Values of Ξˆ ≥ 1 anticipate that integrative p-values may be more powerful. . . . 151 A24 Power of conformal p-values under FDR control with image data, as a function of the distribution shift between the labeled and test outliers. The number of inliers in the labeled data is 1000. (a): If the outlier shift is 1, the species of all labeled outliers belongs to S = {chimpanzee, orangutan}. (b): If the outlier shift is 1, the species of all labeled outliers belongs to S = {lynx, cheetah, jaguar, cat}. Other details are as in Figure 2.1, which corresponds to no outlier shift. . . . . . . . . . . . . . . 152 A25 Performance of conformal p-values for out-of-distribution testing with image data, as a function of the distribution shift between the outliers in the labeled data and those in the test data. The number of inliers in the labeled data is 7824. (a): If the outlier shift is 1, the species of all labeled outliers belongs to S = {chimpanzee, orangutan}. (b): If the outlier shift is 1, the species of all labeled outliers belongs to S = {lynx, cheetah, jaguar, cat}. Other details are as in Figure A24. Note that the results in (b) include an additional method in the comparison: integrative conformal p-values based on a single one-class classifier (support vector machine with sigmoid kernel) instead of an ensemble of different one-class and binary classifiers. This helps differentiate the relative sensitivity to outlier shift of two separate components of our method: data-driven p-value weighting and automatic model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A26 Performance of Storey’s BH applied to conformal p-values based on different machinelearning models, on flower classification data. Other details are as in Figure 2.1. . . . . . . 154 A27 Performance of Storey’s BH applied to conformal p-values based on different underlying machine-learning models, on flower image classification data. Integrative conformal p-values are compared to standard conformal p-values based on six different one-class classification models (a) and six different binary classification models (b). Other details are as in Figure A26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A28 Performance on flower classification data of Storey’s BH applied to integrative conformal p-values computed with either TCV+ or sample splitting. To serve as benchmarks, standard conformal p-values based on one-class or binary classification models tuned by an ideal oracle are computed using cross-validation+. Other details are as in Figure A26. . 155 xi A29 Performance of Storey’s BH applied to conformal p-values based on different underlying machine-learning models, on car image classification data. Other details are as in Figure A26. 155 A30 Performance of Storey’s BH with conformal p-values based on different machine-learning models, on two medical diagnostics data sets. Other details are as in Figure 2.1. . . . . . . . 155 B1 Numerical comparison of different theoretical lower bounds for the marginal coverage of conformal prediction sets computed with a naive early stopping benchmark (e.g., Algorithm B2). Left: lower bounds for the marginal coverage as a function of the number of candidate models T, when α = 0.1 and nes-cal = 8000. Right: lower bounds for the marginal coverage as a function of the number of hold-out data points, nes-cal, when α = 0.1 and T = 100. Higher values correspond to tighter bounds. . . . . . . . . . . . . . 164 B2 Numerical comparison of different theoretical lower bounds for the marginal coverage of conformal prediction sets computed with a naive early stopping benchmark (e.g., Algorithm B2), as a function of the nominal significance level α. Left: lower bounds for the marginal coverage as a function of α, when T = 1000 and nes-cal = 1000. Right: theoretically corrected significance level necessary needed to achieve the marginal coverage guarantees expected at the nominal α level, as a function of α when T = 1000 and nes-cal = 1000. The dashed grey lines indicate the ideal values corresponding to standard conformal inferences based on calibration data that are independent of those used for early stopping. Higher values correspond to tighter bounds. . . . . . . . . . . . . 164 B3 Visualization of the pinball loss function defined in (B12). . . . . . . . . . . . . . . . . . . . 171 C1 Empirical FDR and power comparison. (a): Regression Setting 1; (b): Regression Setting 2. 188 C2 Empirical FDR and power comparison. (a): Latent variable Setting 1; (b): Latent variable Setting 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 C3 Empirical FDR and power comparison. (a): Multiple auxiliary samples Setting 1; (b): Multiple auxiliary samples Setting 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 xii Abstract This thesis explores the development of computationally efficient algorithms to extract powerful and reliable inferences from large-scale datasets, focusing on conformalized post-selection inference and structured multiple testing. Conformalized post-selection inference utilizes conformal inference techniques to address the challenge of providing valid inferences post-model or parameter selection. The problem is especially challenging when the same hold-out dataset is used for both selection and inferential purposes due to potential selection biases. I present innovative conformalized methods that address the selection biases dilemma with provable finite-sample guarantees under the outlier detection, multi-class classification, and regression settings. In the area of structured multiple testing, I present a weighted hypothesis testing framework that tackles the dual challenges of leveraging side information to enhance test power and controlling error rates amidst data dependencies. The presented framework effectively utilizes generic side information, particularly highlighting the applications to genome-wide association studies. xiii Chapter 1 Introduction 1.1 Motivation and thesis outline The field of machine learning has witnessed remarkable growth in recent decades, with its applications spreading across various domains. This expansion, however, has invoked concerns about the reliability and explainability of complex machine learning models. In response to these challenges, my research is centered on extracting meaningful and dependable inferences from large-scale datasets, contributing to the ongoing discussion on trustworthy AI. More specifically, I am interested in the development of computationally efficient algorithms that harness the potential of powerful machine learning models while providing robust statistical guarantees. This overarching interest is delineated into two broad areas: conformalized post-selection inference and structured multiple testing. The two research areas form the main body of the thesis. Conformalized post-selection inference Post-selection inference (PSI) generally refers to providing valid inference after model or parameter selection [17, 38, 69]. Classic examples include selecting the covariates for significance testing in a regression model [106, 71]. The challenge in post-selection inference arises when the same dataset is utilized for both the selection process and the inference procedure, leading to potential issues like ‘data snooping’ or 1 selection bias, rendering the inference results often invalid. Conformalized post-selection inference refers to a class of methods utilizing conformal inference techniques to provide robust statistical guarantees for the PSI problem [140, 62]. Chapter 2 is based on my paper Integrative conformal p-values for outof-distribution testing with labeled outliers [82], which proposes an innovative method for outlier detection and multi-class classification that efficiently performs model selection and conformal calibration using the same hold-out samples, thus obtaining rigorous predictive inferences at no additional data cost compared to standard approach. Further, my paper titled Conformal inference is (almost) free for neural networks trained with early stopping [83] detailed in chapter 3 extends these ideas to regression setting in the context of providing finite-sample guarantees for deep neural network model trained with early stopping [98]. Structured multiple testing Multiple testing, a classical field in statistics research with abundant literature, has two important challenges: leveraging side information to boost testing power and controlling error rates (e.g., FDR) under data dependencies. While pioneering methods like BH procedures tackle independent test units without side information, later advancements like structured multiple testing primarily focus on incorporating side information U ∈ R m that has the same dimension as T ∈ R m [59, 7, 21]. However, the exploration of more generalized side information such as matrix-formed side information or multiple auxiliary sequences remains limited. Chapter 4 is based on my paper Locally Adaptive Algorithms for Multiple Testing with Network Structure, with Application to Genome-Wide Association Studies [81], which addresses these gaps by developing an innovative multiple testing procedure capable of utilizing generic side information within an effective and principle framework. 2 Chapter 2 Out-of-distribution testing with model selection 2.1 Introduction This chapter considers the problem of testing whether new observations follow the same unknown distribution as a reference data set [46], allowing for high-dimensional features and avoiding parametric assumptions. This problem—known as novelty detection, testing for outliers, or out-of-distribution testing—has a range of applications in medical diagnostics, security monitoring, fraud detection, and natural language processing [96]. Modern machine learning algorithms, including random forests [32], deep neural networks [89], nearest neighbor classifiers [51], and support vector machines [27], have become increasingly popular for performing this task, due to their flexibility and high power in diverse settings, but they lack rigorous uncertainty quantification. Conformal prediction can produce rigorous inferences in finite samples for the output of any machine learning model, assuming only that the data are exchangeable [136, 135]. In the context of out-ofdistribution testing, the conformal p-value [136, 10] can be constructed to test the null hypothesis that the new data point is an inlier. The standard (inductive) procedure relies on data splitting; it trains a oneclass classifier [63] on a random subset of the reference inliers, and then calculates the p-value based on the relative rank of one-class conformity scores among the hold-out inliers. Although [10] assumed that only inlier data are available, the idea can be easily extended to include labeled outliers. For example, a 3 plausible-seeming solution is to replace the one-class model with a binary classifier trained on all outliers and a subset of inliers [137, 50], and then proceed as above with binary conformity scores. However, as we will explain next, both one-class and binary classification strategies can be suboptimal. Labeled outlier data: to use or not to use? Labeled outliers are often available and may contain valuable information. For example, financial institutions likely have numerous records of confirmed illicit transactions. Even if fraudulent activities are relatively rare, many of them may share similar patterns that could help flag future suspicious transactions. While labeled outliers may be informative, the straightforward solution of computing conformal p-values based on binary classification scores can be inferior to the one-class approach. The issue is that off-the-shelf one-class classifiers can be more effective than binary classifiers at separating outliers, especially if the data are high-dimensional and involve class imbalance. We visualize this phenomenon in Figure 2.1, with detailed explanations postponed to Section 2.5. The main purposes of this illustration are (i) to highlight some limitations of binary classifiers, and (ii) to reveal these issues cannot be easily mitigated by switching to alternative models. Inliers: 1000 Inliers: 7824 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 Number of outliers Power Method Integrative OCC (oracle) Binary (oracle) Figure 2.1: Performance of conformal p-values for out-of-distribution testing with image data, in terms of power under false discovery rate control within a multiple testing setting. The results are shown as a function of the number of labeled inliers and outliers. The oracles are impractical but useful to approximate the power of inductive conformal p-values based on the optimal one-class or binary classifier from a rich machine learning toolbox. The integrative p-values often perform almost as well as those of the best oracle, and can sometimes outperform both of them. 4 Figure 2.1 considers two imaginary oracles respectively built upon two different families of classifiers: one-class classification (OCC) and binary classification (Binary) models. The oracles are given the unfair advantage of knowing the ground truth and can select the model that maximizes power separately in each setting, picking from a rich collection of one-class or binary classification algorithms. While such oracles are impractical, they provide useful benchmarks regarding the highest power achievable by inductive conformal p-values at a fixed type-I error rate. We can see from Figure 2.1 that the relative strength of OCC vs Binary critically depends on the degree of imbalance in the labeled data: the left panel illustrates that binary classifiers tend to work well if the outliers are abundant, whereas the right panel indicates that OCC is a clear winner when the inliers are dominant. This dilemma—if and how to use labeled outliers—is effectively resolved by our proposal of a novel integrative method for calculating conformal p-values. Our solution automatically combines the individual strengths of the OCC and Binary approaches, leveraging side information in a principled way through suitably-calibrated adaptive p-value weights. This method has robust power performance across a wide range of settings without requiring any unrealistic oracle knowledge, and it can sometimes even simultaneously outperform both OCC and Binary methods. Outline of this chapter Section 2.2 describes the main methodology. Section 2.2 begins by defining useful notation and reviews the relevant conformal inference background. Section 2.2.1 introduces our integrative method while focusing on leveraging a pair of one-class classifiers separately trained on labeled inliers and outliers. We establish the finite-sample validity of our approach by developing an innovative conformal calibration technique that goes beyond the inductive recipe based on fixed conformity score functions [135, 10]. Section 2.2.2 extends the integrative method to leverage an entire machine learning toolbox of one-class and binary classifiers, automatically selecting the model that approximately maximizes the power. Section 2.2.3 presents a more expensive but also more powerful version of the algorithm based on cross-validation. 5 Section 2.3 studies how to control the false discovery rate (FDR) [14] in multiple testing based on integrative p-values. This is not straightforward due to the correlation structure between integrative pvalues, which prevents us from applying the theories of [16] and [10] about the validity of classical FDR procedures. We overcome this challenge by adopting the conditional FDR calibration strategy of [37]. Section 2.4 contains a theoretical power analysis supporting our observations that integrative p-values often lead to higher power compared to inductive p-values. Sections 2.5 and 2.6 explore the empirical performance of integrative p-values through simulations with synthetic and real data. The Supplementary Material A contains all technical proofs and additional details. Related work This chapter builds upon conformal p-values [136, 135, 70, 10]. Our main innovations are to leverage in a principled way the information contained in labeled outlier data, and to improve power by gathering strength from a toolbox of classification models. Our work is inspired by prior research on weighted hypothesis testing [15, 43, 105] , but it departs from that framework as it accommodates weighting based on non-independent side information. Further, our solution differs from adaptive sequential testing with side information [79, 76] as it can produce valid inferences for one test point at a time. Two versions of our method are developed: one based on sample splitting [72] and one on cross-validation [5, 104]. We adapt the latter approach to address the novel challenges arising from our data-driven weighting scheme, re-purposing relevant ideas from transductive (or full) conformal inference [134]. The latter is a computationally expensive approach in general but becomes more attractive in our context. Conformal inference was developed to calibrate a pre-trained model [135], but related techniques have also been recently utilized to improve the underlying training algorithms [28, 35]. We similarly move beyond relying on a fixed model, but we study how to perform model selection and tuning with exact guarantees as opposed to training black-box classifiers. Several works have utilized model ensembles to 6 improve conformal predictors [85, 11, 84, 64, 49, 36] or out-of-distribution tests [139]. However, those focused on aggregating the output of simpler models and evaluated conformity scores based on out-of-bag predictions, while we aim to select and tune the best performing model from any toolbox. Other works have strengthened the standard marginal validity guarantees for conformal inferences, conditioning on the hold-out calibration data [133, 10] or on some features of the test point [40, 102]. Their results could in the future be combined with our method. Finally, there exist other statistical frameworks for obtaining rigorous inferences from the output of complex machine learning models in the context of testing for outliers, including in particular the NeymanPearson classification paradigm [100, 132, 131, 143]. However, the latter takes a different perspective compared to conformal inference as it focuses on controlling both type-I and type-II errors, and it is typically restricted to specific algorithms [100], assumes large sample sizes [132], or requires other assumptions such as low data dimensions [131] or feature independence [143]. 2.2 Integrative conformal p-values: generic strategies and data-driven algorithms Setup and review of inductive conformal p-values Consider n + 1 data points, (Xi , Yi) for i ∈ [n + 1] = {1, . . . , n + 1}, where Xi ∈ R d is a feature vector for the i-th sample with label Yi ∈ {0, 1}. The samples with Yi = 0 (Yi = 1) are referred to as inliers (outliers). We assume the inliers in D0 = {i ∈ [n + 1] : Yi = 0} are randomly sampled from some unknown distribution PX|Y =0, exchangeably conditional on the outliers. However, no assumptions are made about the data in D1 = {i ∈ [n] : Yi = 1}, which may even include some mislabeled inliers with true Yi = 0. Having observed (Xi , Yi) for all i ∈ [n] as well as Xn+1, the goal is to test the null hypothesis H0 : Yn+1 = 0; i.e., whether Xn+1 is an inlier sampled exchangeably with those in D0. 7 The conformal p-value, which promises to provide rigorous finite-sample error rate control under the exchangeability assumption, is a fundamental concept for hypothesis testing based on complex machine learning algorithms. Assuming the n labeled samples contain only inliers, the standard sample-splitting approach randomly partitions D0 into two disjoint subsets: Dtrain 0 and Dcal 0 . The data in Dtrain 0 are utilized to train a one-class classifier [96], learning a conformity score function sˆ0 : R d → R. By convention, a smaller value of sˆ0(X) provides evidence that X may be an outlier. For simplicity, we assume throughout the chapter that all conformity scores have no ties, which can always be achieved in practice by adding a small amount of independent noise. The conformal p-value uˆ0(Xn+1) for testing H0 is defined as the normalized rank of sˆ0(Xn+1) among the scores sˆ0(Xi) for the hold-out samples indexed by i ∈ Dcal 0 ; that is, uˆ0(Xn+1) = 1 + |i ∈ Dcal 0 : ˆs0(Xi) ≤ sˆ0(Xn+1)| 1 + |Dcal 0 | . (2.1) Proposition 1 ([137]). Suppose Xn+1 is an inlier that is exchangeable with the inliers in Dcal 0 conditional on Dtrain 0 and D1. Then, for all α ∈ (0, 1), P [ˆu0(Xn+1) ≤ α | Yn+1 = 0] ≤ α. The question is: if some labeled outliers are available, how should they be utilized to increase power? An intuitive solution is to include them in the training set, which thus becomes Dtrain 0 ∪ D1, and fit a binary classification model instead of a one-class classifier. Conformal p-values can then be constructed by applying (2.1) with conformity scores obtained from the binary model, but this approach does not always work well, as previewed in Figure 2.1. This motivates the new integrative method presented below. 2.2.1 Integrative conformal p-values with data-driven weighting This section introduces a first version of our integrative method based on two one-class classifiers separately trained on the labeled inlier and outlier data. 8 We begin by randomly splitting the labeled inliers into Dtrain 0 and Dcal 0 , and the labeled outliers into Dtrain 1 and Dcal 1 . The inliers in Dtrain 0 and Dcal 0 are utilized to compute a preliminary inductive conformal p-value uˆ0(Xn+1) for the test point Xn+1 via (2.1). Similarly, the outliers in Dtrain 1 and Dcal 1 are utilized to train a score function sˆ1 and then compute uˆ1(Xn+1), where the function uˆ1 is defined for any x ∈ R d as: uˆ1(x) = 1 + |j ∈ Dcal 1 : ˆs1(Xj ) ≤ sˆ1(x)| 1 + |Dcal 1 | . (2.2) One may think of uˆ1(Xn+1) as an approximate conformal p-value for testing whether Yn+1 = 1, bearing in mind that it is not valid for that purpose unless the outliers are exchangeable—an assumption we do not make. In any case, uˆ1(Xn+1) has two merits. First, by pooling side information from D1, it is useful to strengthen or weaken the evidence in uˆ0(Xn+1). Second, it takes values between 0 and 1, standing for a principled measure of statistical evidence comparable across different models. Having distilled separate sources of information into uˆ0(Xn+1) and uˆ1(Xn+1), we propose to integrate relevant evidence through the following combination function: rˆ(x) = ˆu0(x)/uˆ1(x). (2.3) This choice is inspired by classical p-value weighting [43, 56], but rˆ(Xn+1) cannot be directly analyzed as a legitimate weighted p-value [8, 58] because uˆ0(Xn+1) and uˆ1(Xn+1) are not independent. Our strategy is to employ r(x) as a score function to re-calibrate a new conformal p-value. As discussed in more detail later, this approach does not require new assumptions and introduces only relatively modest additional computational costs. 9 More precisely, let us go back to (2.1) and think of uˆ0(Xn+1) as the value corresponding to input Xn+1 of a function uˆ0 defined, for any x ∈ R d , as: uˆ0(x) = |i ∈ {n + 1} ∪ Dcal 0 : ˆs0(Xi) ≤ sˆ0(x)| 1 + |Dcal 0 | . (2.4) This perspective allows us to also calculate uˆ0(Xi) by applying (2.4) after swapping Xn+1 with Xi , for all i ∈ Dcal 0 . Similarly, we also go back to (2.2) and calculate uˆ1(Xi) by applying (2.2) to Xi instead of Xn+1, for all i ∈ Dcal 0 . Next, we can apply (2.3) to define rˆ(Xi) for all i ∈ Dcal 0 . Finally, the same conformalization strategy as in (2.1) is deployed to re-calibrate an integrative conformal p-value uˆ(Xn+1): uˆ(Xn+1) = 1 + |i ∈ Dcal 0 : ˆr(Xi) ≤ rˆ(Xn+1)| 1 + |Dcal 0 | . (2.5) This method, summarized in Algorithm 1, yields a valid conformal p-value, as established below. Further, if the weights uˆ1(Xn+1) are sufficiently informative, integrative p-values can lead to higher power, as we shall see in Sections 2.5 and 2.6; otherwise, they typically do not incur much loss, consistently with the robustness of classical strategies for weighted hypothesis testing [43, 101]. Theorem 1. Suppose Xn+1 is an inlier that is exchangeable with the inliers in Dcal 0 conditional on Dtrain 0 and D1. Then, P [ˆu(Xn+1) ≤ α | Yn+1 = 0] ≤ α, for all α ∈ (0, 1), where uˆ(Xn+1) is the output of Algorithm 1. Algorithm 1 Integrative conformal p-values 1: Input: inliers D0, outliers D1, one-class classifiers A0, A1 and test point Xn+1. 2: Randomly split D0 into two disjoint subsets, Dtrain 0 , Dcal 0 . 3: Randomly split D1 into two disjoint subsets, Dtrain 1 , Dcal 1 . 4: Train A0 on Dtrain 0 and A1 on Dtrain 1 . 5: Calculate conformity scores sˆ0(Xi) based on A0, for all i ∈ {n + 1} ∪ Dcal 0 . 6: Compute inductive conformal p-values uˆ0(Xi) with (2.4), for all i ∈ {n + 1} ∪ Dcal 0 . 7: Calculate conformity scores sˆ1(Xi) based on A1, for all i ∈ {n + 1} ∪ Dcal 0 ∪ Dcal 1 . 8: Compute inductive conformal p-values uˆ1(Xi) with (2.2), for all i ∈ {n + 1} ∪ Dcal 0 . 9: Combine uˆ0(Xi) and uˆ1(Xi) into rˆ(Xi) with (2.3), for all i ∈ {n + 1} ∪ Dcal 0 . 10: Output: conformal p-value uˆ(Xn+1) computed with (2.5). 10 Theorem 1 does not follow directly from Proposition 1. The key novelty involves dealing with rˆ(Xi) in (2.5), which are not typical conformity scores because rˆ depends not only on Dtrain 0 but also on Dcal 0 ∪ {Xn+1}. This situation shares similarities with transductive conformal inference [134], but our approach has the advantage of not requiring a new model to be trained for each test point. In fact, Algorithm 1 only needs to train sˆ0 and sˆ1 once, without looking at Xn+1. Further, Algorithm 1 can be implemented efficiently by streamlining redundant operations via appropriate pre-sorting and then evaluating uˆ0 and uˆ1 with fast binary searches, as detailed in Section A2. In summary, Algorithm 1 can produce a p-value for a single test point at cost O(T + nc log nc), where T = T(n) is the cost of training a classifier and O(nc) = O(|Dcal 0 |) = O(|Dcal 1 |) with nc = max{|Dcal 0 |, |Dcal 1 |}. This is similar to the O(T + nc) cost of an inductive p-value. Further, applying Algorithm 1 to m test points costs O(T + (m + min{mnc, n2 c}) log nc). Such cost is higher than the O(T + (m +nc) log nc) of inductive inference [135], but typically lower than the O(mT + mnc) of classical transductive (or full) approaches. In fact, T is the main bottleneck when dealing with complex machine learning models, and nc does not necessarily need to be very large in practice. As explained in [10], unless extremely small p-values are needed, moderate choices of nc (e.g., nc ≤ 1000) may often be sufficient to obtain reliable inferences, even if n is much larger. Thus, integrative p-values achieve a reasonable trade-off between computational costs and power. 2.2.2 Integrative conformal p-values with automatic model selection and tuning This section presents a refined method that endows integrative p-values with automatic model selection and tuning capabilities. This can boost power without invalidating the type-I error guarantee of Theorem 1. The idea is to train a toolbox of different classifiers on Dtrain 0 and Dtrain 1 , and then evaluate integrative pvalues through (2.3) and (2.5) based on the most promising model. Our method can deal with any model 11 toolboxes, which may include various families of classifiers (e.g., random forests, neural networks, etc.), as well as multiple instances of a model trained with different hyper-parameters. For each of the M0 models trained on Dtrain 0 , we evaluate sˆ0(Xi) for all i ∈ Dcal 1 and for all i ∈ {n + 1} ∪ Dcal 0 ; then, we pick the model maximizing the median difference in sˆ0(Xi) across the two groups. The intuition is that a powerful sˆ0 must separate well inliers from outliers. The model trained on Dtrain 1 is selected similarly among M1 candidates. This approach works well in practice, but our method could also be implemented under any alternative criteria for identifying the most promising model. The key to retaining the validity of the output p-values is to utilize a model selection criterion that is invariant to permutations of {n + 1} ∪ Dcal 0 . Further, any labeled outliers can be incorporated into the computation of sˆ0 based on a binary classifier trained on the combined data Dtrain 0 ∪ Dtrain 1 , eliminating the dilemma of whether one should rely on one-class or binary algorithms. This methodology is outlined in Algorithm 2 and sketched in Figure 2.2. The validity of this approach is established in Theorem 2. As explained in Section A2, the computational cost Algorithm 2 is not much higher than that Algorithm 1, besides from the unavoidable necessity of training M0 + M1 classifiers instead of only two. Theorem 2. Suppose Xn+1 is an inlier that is exchangeable with the inliers in Dcal 0 conditional on Dtrain 0 and D1. Then, P [ˆu(Xn+1) ≤ α | Yn+1 = 0] ≤ α for all α ∈ (0, 1), where uˆ(Xn+1) is the output of Algorithm 2. We present an intuitive demonstration of the advantages of our automated model selection scheme. Recall from Section 2.2, it is typically assumed that a one-class classifier trained on inlier data should produce smaller scores sˆ0(X)if X is an outlier. However, this is not always the case; for instance, FigureA1 in the Supplementary Material shows a scenario where a one-class classifier effectively separates outliers from inliers but assigns smaller scores to the latter. This “reverse direction” of the scores results in noninformative and powerless p-values. Algorithm 2 avoids this problem because it can adaptively determine whether sˆ0 (or sˆ1) should be transformed into −sˆ0 (or −sˆ1) prior to computing uˆ0 (or uˆ1). This tuning trick will be applied throughout the chapter. 12 D0 D1 Inlier data Outlier data Data splitting Data splitting Dtrain 0 Dcal 0 Test point Xn+1 Dcal 1 Dtrain 1 Classifiers {Am 0 }m∈[M0] Classifiers {Am 1 }m∈[M1] Training Training sˆ0(Dcal 0 ) sˆ0(Dcal 1 ) sˆ0(Xn+1) Conformity scores sˆ1(Dcal 0 ) sˆ1(Dcal 1 ) sˆ1(Xn+1) uˆ0(Dcal 0 ) uˆ0(Xn+1) Preliminary p-values uˆ1(Dcal 0 ) uˆ1(Xn+1) Preliminary p-values Model selection Model selection rˆ(Dcal 0 ) rˆ(Xn+1) uˆ(Xn+1) Ratio statistics Integrative p-value Figure 2.2: Schematic representation of Algorithm 2, which computes integrative conformal p-values with automatic model selection and tuning. The dashed lines represent the flow of information utilized for the evaluation of the machine-learning models, but not for training. The dotted lines indicate the flow of information utilized for model selection and tuning. Algorithm 2 Integrative conformal p-values with automatic model selection and tuning 1: Input: inlier data D0, outlier data D1, one-class and binary classifiers {Am 0 }m∈[M0] , one-class classifiers {Am 1 }m∈[M1] , and test point Xn+1. 2: Randomly split D0 into two subsets, Dtrain 0 , Dcal 0 . 3: Randomly split D1 into two subsets, Dtrain 1 , Dcal 1 . 4: For each m ∈ M0, train Am 0 on Dtrain 0 and, if the model is binary, also on Dtrain 1 . 5: Calculate scores sˆ m 0 (Xi) based on Am 0 for all i ∈ {n + 1} ∪ Dcal 0 ∪ Dcal 1 and m ∈ M0. 6: Select the model m∗ 0 that maximizes the median difference between the conformity scores sˆ m 0 (Xi) evaluated on {n + 1} ∪ Dcal 0 and those evaluated on Dcal 1 . 7: For each m ∈ M1, train Am 1 on Dtrain 1 . 8: Calculate scores sˆ m 1 (Xi) based on Am 1 for all i ∈ {n + 1} ∪ Dcal 0 ∪ Dcal 1 . 9: Select the model m∗ 1 that maximizes the median difference between the conformity scores sˆ m 1 (Xi) evaluated on {n + 1} ∪ Dcal 0 and those evaluated on Dcal 1 . 10: Compute uˆ0(Xi) with (2.4) for all i ∈ {n + 1} ∪ Dcal 0 , using model m∗ 0 . 11: Compute uˆ1(Xi) with (2.2) for all i ∈ {n + 1} ∪ Dcal 0 , using model m∗ 1 . 12: Combine uˆ0(Xi) and uˆ1(Xi) into rˆ(Xi) with (2.3) for all i ∈ {n + 1} ∪ Dcal 0 . 13: Output conformal p-value uˆ(Xn+1) computed with (2.5). 2.2.3 Integrative p-values with transductive cross-validation+ Algorithms 1 and 2 are computationally efficient because they require training the classifiers only once, even if conformal p-values are needed for many test points. However, their data splitting strategy has some 13 drawbacks, especially when dealing with small data sets. Firstly, it reduces the effective sample size available for training, potentially resulting in lower power. Secondly, it introduces additional randomness into the test, which tends to make the final inferences less stable and reliable [133, 10]. We address these issues by describing an alternative approach to constructing integrative p-values that is more computationally expensive but has the advantage of utilizing the available data more parsimoniously. Our solution is inspired by cross-validation+ [5] and transductive conformal inference [134], but it involves some novelties relative to both approaches. To facilitate the exposition, we begin by focusing on modifying Algorithm 1. Given a number of folds K1, randomly split the outliers in D1 into K1 disjoint subsets: D1 1 , . . . , D K1 1 . For any j ∈ D1, define k1(j) as the fold to which Xj is assigned, and for any k ∈ [K1], let sˆ k 1 denote the conformity score function computed by the one-class classifier trained on the data in D1 \ Dk 1 . Then, for any possible feature vector x, define uˆ1(x) = 1 + |{j ∈ D1 : ˆs k1(j) 1 (Xj ) ≤ sˆ k1(j) 1 (x)}| 1 + |D1| . (2.6) The main difference between (2.2) and (2.6) is that the latter can leverage models trained on more data, and it assesses the ranks of the conformity scores based on 1 + |D1| observations instead of 1 + |Dcal 1 |. As in Section 2.2, uˆ1(X) looks like a conformal p-value for testing H1 : Y = 1 but it does not need to be exactly valid for that purpose. Our approach to computing uˆ0 is similar but more closely inspired by transductive inference [134]. Given a number of folds K0, which may differ from K1, randomly partition D0∪{n+1} into D1 0 , . . . , D K0 0 . For any j ∈ D0 ∪ {n + 1}, define k0(j) as the fold to which Xj is assigned, and for any k ∈ [K0] let sˆ k 0 denote the score function obtained with the classifier trained on D0 \ Dk 0 . Then, for any x ∈ R d and l ∈ [K0], define uˆ0(x; l) = |j ∈ {n + 1} ∪ D0 : ˆs k0(j) 0 (Xj ) ≤ sˆ l 0 (x)| 1 + |D0| . (2.7) 14 Thus, uˆ0(Xn+1; k0(n+1)) may be seen as a cross-validation+ [5] type of conformal p-value for testing H0 : Y = 0. But there is an important distinction: we treat Xn+1 as a training point for K0 − 1 out of K0 models, while [5] do not use Xn+1 for training. Although our approach is more computationally expensive, it has the advantage of being symmetric with respect to the data in D0∪{n+1}. This will allow us to carry out data-driven p-value weighting and automatic model tuning while guaranteeing type-I error control in finite samples, as explained below. We pause for a moment to further elaborate on how (2.7) departs from full conformal inference and extends the transductive framework of [134]. Full conformal inference, in its most commonly used form, similarly utilizes Xn+1 for training, but it evaluates all conformity scores within-sample [72], which makes it more susceptible to over-fitting. Out-of-sample evaluation is desirable when using flexible models—e.g., deep neural networks—that can achieve zero training error while giving accurate out-of-sample predictions [142]. Overfitting can make full conformal inferences powerless, as zero training error tends to correspond to randomly ordered conformity scores for all in-sample data points, but it does not similarly impact our method. The approach proposed in this chapter is closer to the leave-one-out transductive framework of [134], which is recovered if we utilize folds of size one (K0 = |D0|); in that special case, the validity of our method follows from existing transductive conformal prediction theory [134]. In general, however, our method’s ability to operate with smaller values of K0 (e.g., K0 ∈ {5, 10}) increases computational efficiency, but the additional cross-validation randomness slightly complicates the proof of our theoretical result. Having defined and motivated (2.6)–(2.7), we proceed to integrate the information contained in uˆ0[Xn+1; k0(n+ 1)] and uˆ1(Xn+1) similarly to Section 2.2.1. First, we compute uˆ1(Xi) and uˆ0[Xi ; k0(i)] via (2.6) and (2.7), respectively, for all i ∈ D0∪{n+1}. Then, for each i ∈ D0∪{n+1}, we combine uˆ1(Xi) and uˆ0[Xi ; k0(i)] with rˆ(Xi) = ˆu0(Xi)/uˆ1(Xi), as in (2.3). Finally, we output an integrative p-value uˆ(Xn+1) based on (2.1), using D0 instead of Dcal 0 . This method, which we call transductive cross-validation+ (TCV+), is outlined in 15 Algorithm 3 and schematized in Figure A2. If the classifiers utilized to compute sˆ0 and sˆ1 are invariant to the order of the training data, uˆ(Xn+1) is a valid p-value. As explained in Section A2, the calibration components of Algorithm 3 can be implemented quite efficiently, but the training costs are higher than those of Algorithm 1. Theorem 3. Assume Xn+1 is an inlier and is exchangeable with the inliers in D0, and uˆ(Xn+1) is output by Algorithm 3 based on training algorithms that are invariant to the order of the input data. Then, P [ˆu(Xn+1) ≤ α | Yn+1 = 0] ≤ α for all α ∈ (0, 1). Algorithm 3 Integrative conformal p-values via transductive cross-validation+ 1: Input: inlier data D0, outlier data D1, permutation-invariant one-class classifiers A0, A1, test point Xn+1, numbers of folds K0 and K1. 2: Split D1 into K1 folds, D1 1 , . . . , D K1 1 . 3: for k = 1, . . . , K1 do 4: Fit a model Ak 1 by training A1 on D1 \ Dk 1 ; this yields a score function sˆ k 1 . 5: end for 6: Split D0 ∪ {n + 1} into K0 folds, D1 0 , . . . , D K0 0 ; let k0(i) denote the fold for i. 7: for k = 1, . . . , K0 do 8: Fit a model Ak 0 by training A0 on D0 \ Dk 0 ; this yields a score function sˆ k 0 . 9: end for 10: For all i ∈ D0 ∪ {n + 1}, compute uˆ1(Xi) with (2.6) and uˆ0(Xi ; k0(i)) with (2.7). 11: For all i ∈ D0 ∪ {n + 1}, compute rˆ(Xi) = ˆu0[Xi ; k0(i)]/uˆ1(Xi). 12: Output: conformal p-value uˆ(Xn+1) computed via (2.5), with D0 instead of Dcal 0 . We conclude this section with several remarks. First, TCV+ can also accommodate binary classifiers as well as automatic model tuning based on any machine learning toolbox, although that would make it even more expensive to train the underlying classifiers. The key idea is that model tuning does not invalidate the integrative p-values if the tuning criterion is invariant to permutations of the data in D0 ∪ {n + 1}. In particular, one can implement Algorithm 3 by training with cross-validation a toolbox of classifiers, as in Algorithm 2, and then computing the integrative p-value based on the most promising pair of classifiers. Second, TCV+ can also be applied if no labeled outliers are available. This still leads to conformal pvalues satisfying the desired super-uniformity property of Proposition 1, in contrast with standard crossvalidation+ which in theory requires an additional conservative correction [5]; see Section A3. Similarly, 16 TCV+ p-values may also be useful beyond the scope of testing for outliers, including for example in the context of multi-class classification, as discussed in Section A3. 2.3 FDR control via conditional calibration Consider a multiple testing problem with unlabeled data points Xn+1, Xn+2, . . . , Xn+m. Our goal is to discover outliers with finite-sample FDR control. Classical FDR methods, including the Benjamini-Hochberg (BH) procedure, assume the p-values are independent or positively dependent according to a specific PRDS property [16]. However, conformal p-values for different test points are generally dependent on one another if computed using the same calibration data. While standard (inductive) conformal p-values satisfy the PRDS property [10], and can hence still be applied with BH to control the FDR [16], it is not clear whether the same result extends to integrative p-values. The issue is that the function rˆ in (2.5) depends not only on the training data but also on observations in Dcal 0 ∪ {Xn+1}. Integrative p-values for different test points are thus based on different random score functions, which makes their dependence structure more complicated compared to that studied by [10]. This issue is investigated further in Section A5.1 and Figure A3, which show that the pairwise correlations between integrative p-values may also be affected by the distribution of the data. Consequently, it remains unclear whether it is theoretically safe to utilize BH with integrative p-values. This section presents a solution that theoretically guarantees FDR control with integrative p-values while maintaining relatively high power, based on the conditional calibration framework of [37]. Conditional FDR calibration is a versatile theoretical blueprint that may be computationally costly in general but has been implemented successfully in several cases [37]. Interestingly, this framework is practical and intuitive to implement with integrative conformal p-values. The conditional FDR calibration involves three steps, the high-level ideas of which are as follows. In the first step, one computes a data-adaptive rejection threshold τi separately for each test point i ∈ Dtest = 17 {n+1, . . . , n+m}, in such a way as to ensure that τi only depends on the data through a suitable sufficient statistic Φi . The choice of Φi is delicate and has to follow two key principles: (1) the p-value for the i-th test point must be super-uniform conditional on Φi ; (2) the total number of rejections obtained by thresholding each p-value u(Xi) at level τi should be bounded from below by a known function of Φi . Further, to achieve high power, the bound should hold if u(Xi) ≤ τi and be as tight as possible. In the second step, one applies the thresholds τi separately to each p-value and verifies whether the aforementioned lower-bound constraints are satisfied. If all constraints are satisfied, the set of rejections controls the FDR. Otherwise the rejection set must be randomly thinned (or pruned) through a final third step, which can decrease power. There are many ways of computing the thresholds τi , and it is generally unclear how to best translate the sequence of high-level ideas described above into a practical algorithm. This makes it non-trivial to apply conditional FDR calibration to integrative p-values. Here, we focus on addressing the challenge for the p-values output by Algorithm 1. The three steps of our solution are outlined below and schematized in Figure A4. The same idea can be directly extended to Algorithm 2 and could also be adapted to Algorithm 3, as explained in Section A1 and Figure A5 in the Supplementary Material. Step 1: Calibration. Construct a sufficient statistic Φi containing information on all scores calculated by Algorithm 1, up to a random permutation of those involving the test point i and the labeled calibration inliers in Dcal 0 . Concretely, for all i ∈ Dtest, define Φi = sˆ0(Xj ), sˆ1(Xj ), ∀j ∈ Dtest\{i}, sˆ1(Xj ), ∀j ∈ Dcal 1 , unordered set {(ˆs0(Xj ), sˆ1(Xj ))}j , for j ∈ Dcal 0 ∪ {i}. (2.8) Determining a threshold τi involves evaluating the FDR contribution of test i conditional on Φi . To this end, we compute for each i ∈ Dtest a vector of m variables u˜i(Xj ), for all j ∈ Dtest \ {i}, by applying Algorithm 1 with Dcal 0 replaced by Dcal 0 ∪ {i}. Note that u˜i(Xj ) is not a proper conformal p-value because 18 it is based on a perturbed calibration set, Dcal 0 ∪ {i}, which may contain an outlier, but it does not need to be valid and it is a reasonable approximation of uˆ(Xj ) if Dcal 0 is large. Next, we apply BH at level α to {u˜i(Xn+1), . . . , u˜i(Xi−1), 0, u˜i(Xi+1), . . . , u˜i(Xn+m)}, obtaining R˜ i ≥ 1 rejections. Step 2: Preliminary rejection. Define the preliminary rejection set R+ = {i ∈ Dtest : ˆu(Xi) ≤ αR˜ i/m}. Let R˜+ = |R+|. This means τi = αR˜ i/m. If R˜+ ≥ R˜ i for all i ∈ R+, return the final rejection set R = R+. Otherwise, proceed to the next step. Step 3: Pruning. Generate independent standard uniform random variables ϵi for each i ∈ R+. Let R = max r ∈ [R˜+] : i ∈ R+ : ϵi ≤ r/R˜ i ≥ r . The pruned rejection set R contains the R indices i ∈ R+ such that ϵi < R/R˜ i . Conditional FDR calibration may be less powerful than BH, but it has the advantage of rigorously controlling the FDR without being as conservative as the classical procedure of [16]. Further, it is not computationally unfeasible in our context. In fact, the algorithm described above only operates with precomputed scores, and thus its cost is typically smaller compared to that of training the classifiers. Theorem 4. Assume Xi is an inlier and is exchangeable with the data in Dcal 0 , conditional on Dtrain 0 and D1, for all i ∈ Dtest with Yi = 0. The expected proportion of inliers in the set R computed as above is smaller than αm0/m, where m = |Dtest| and m0 is the number of inliers in Dtest . 2.4 Asymptotic power analysis This section presents an asymptotic power analysis that provides some theoretical insight into the competitive practical performance of integrative conformal p-values. Let’s focus for simplicity on studying Algorithm 1, which weights the inductive p-values based on uˆ1(Xi) without performing automatic model tuning. Section 2.4 shows that the weighted p-value procedure outperforms the unweighted benchmark asymptotically, under some generic conditions. Section 2.4 simplifies those conditions by utilizing the weight structures of our integrative conformal p-values. 19 Generic power analysis under FDR control Our analysis is inspired by the rich literature on multiple testing with side information and weighted FDR analysis. For instance, [43] perform a power analysis for a BH-type algorithm with generic p-value weights, while [21] and [81] show that the power of unweighted FDR procedures can be improved with carefully constructed data-driven weights. As detailed below, this section requires some additional modeling and technical assumptions, and it begins from a convenient simplification that allows rˆ(Xi) to be seen as inductive conformal p-value weighted by uˆ1(Xi). Specifically, imagine the ranking of the p-values output by Algorithm 1 is the same as that of rˆ(Xi) = ˆu0(Xi)/uˆ1(Xi). This may not always be true, as the values of uˆ0(Xj ) for all j ∈ Dcal 0 are dependent on the test point. However, the simplification is reasonable if both Dcal 0 and Dcal 1 are large, and the difference between the two rankings is usually negligible. Consistently with the literature on weighted hypothesis testing, imagine the test data are sampled from a mixture model under which Y is also random; i.e., for all i ∈ Dtest , Yi ∼ Bernoulli(1 − π0), Xi | Yi = 0 exch. ∼ PX|Y =0, Xi | Yi = 1 ∼ PX|Y =1. Above, π0 ∈ [0, 1] is the (unknown) expected proportion of test inliers. Note that this assumes the test inliers are exchangeable with the labeled inliers, while allowing the outliers to be non-exchangeable. Despite several similarities, this model is stronger than that followed from Section 2.2 until now because it also assumes Y is random. Our power analysis departs from the prior literature in the way it handles the weights, as uˆ1(Xi) may depend on uˆ0(Xi) even under the null hypothesis. Our approach is inspired by [81], whose techniques can be repurposed to accommodate dependent weights. Without loss of generality, let us think of the one-class classifier A1 utilized to compute sˆ1(Xi) in Algorithm 1 as learning from the data in Dtrain 1 a function that maps Xi to a latent low-dimensional representation Z1,i, for each i ∈ Dtest. In the following, I will treat 20 as fixed the data in Dtrain 0 and Dtrain 1 , as well as the functions sˆ0 and sˆ1. With this premise, one may view Z1,i as the “side information” from which I extract the weight uˆ1(Xi) accompanying the p-value uˆ0(Xi). Define F as the collection of random variables {Z1,i}i∈Dtest , D0, and D1, and let F1,i(t) be the cumulative distribution function of uˆ0(Xi) conditional on (Yi = 1, F). Note that uˆ1(Xi) is measurable with respect to F as it depends on Xi only through Z1,i, while uˆ0(Xi)remains random conditional on F because it leverages a distinct classifier based on different features. Thus, F separates the randomness in uˆ0(Xi) from that in uˆ1(Xi). Throughout this section, all probabilities and expected values will be understood to be defined conditional on F, even when not written explicitly. Let δ(t; v) = {δi(t; vi) : n + 1 ≤ i ≤ n + m} be a family of rejection rules for m hypotheses with p-values uˆ0(Xn+1), . . . , uˆ0(Xn+m) and generic weights v = (vn+1, . . . , vn+m). By definition, δi(t; vi) = 1 if and only if pi/vi ≤ t. It follows from Theorem 2 in [21], under relatively mild weak dependence conditions [123], that the FDR of δ(t; v) becomes in the large-m limit: FDR{δ(t; v)} = Q(t; v) + o(1), where Q(t; v) = P i∈Dtest vit · P [Yi = 0 | Z1,i] P i∈Dtest vit · P [Yi = 0 | Z1,i] + P i∈Dtest F1,i(vit | Z1,i) · P [Yi = 1 | Z1,i] . Define Ψ(t; v) be the expected number of true discoveries conditional on F: Ψ(t; v) := E[ P i∈Dtest Yi · δi(t; vi) | F]. Further, for any v = (v1, . . . , vm), define the oracle threshold t α oracle(v) as the largest possible threshold t controlling Q(t; v) below a fixed α: t α oracle(v) = sup{t : Q(t; v) ≤ α}. Below, the FDR and the power of the oracle rule δ(t α oracle(v); v) will be compared under two choices of weights: v = {uˆ1(Xi) : i ∈ Dtest}, which corresponds to the integrative method, and v = (1, . . . , 1), which corresponds to inductive p-values. The following result is based on two generic assumptions, but it 21 will be recast into a more specific, interpretable and verifiable form for analysis with integrative p-values in the next section. Assumption 1. The weights uˆ1(Xi), for i ∈ Dtest, satisfy P i∈Dtest P [Yi = 0 | Z1,i] P i∈Dtest P [Yi = 0 | Z1,i] ˆu1(Xi) · P i∈Dtest P [Yi = 1 | Z1,i] P i∈Dtest P [Yi = 1 | Z1,i] ˆu1(Xi)−1 ≥ 1. Assumption 2. The cumulative distribution function F1,i(t) satisfies: X i∈Dtest aiF1,i(t/xi) ≥ X i∈Dtest aiF1,i t Pm j=1 aj Pm j=1 ajxj ! , for any 0 ≤ ai ≤ 1, min1≤i≤m w˜ −1 i ≤ xi ≤ max1≤i≤m w˜ −1 i , and t > 0, where w˜i is: w˜i = ˆu1(Xi) · P j∈Dtest P [Yj = 0 | Zj ] P j∈Dtest wj · P [Yj = 0 | Zj ] . (2.9) Assumption 1 is similar to assumptions utilized by [43] and [81]. It can be intuitively understood as saying that uˆ1(Xi) should be sufficiently informative; i.e., it should be large for the true outliers and small for the inliers. In Section 2.4 and Section A4 of the Supplementary Material, Assumption 1 will be made quantitatively more precise, as well as easier to interpret and verify, by introducing a useful working model for the distribution of the test data, and by leveraging the structure of the weights computed by our integrative method. Assumption 2 is a more technical condition on the shape of the alternative distribution of uˆ0(Xi) conditional on Z1,i. If these distributions are homogeneous, i.e., F1,i(t) = F1,n+1(t) for all i ∈ Dtest, it reduces to saying that the function x → F1(t/x) is convex [56, 81]. Theorem 5. Under Assumptions 1–2, we have Ψ {t α oracle(ˆu1); ˆu1} ≥ Ψ {t α oracle(1); 1} . 22 Recall that t α oracle(ˆu1) and t α oracle(1) are the most powerful thresholds (approximately) controlling the FDR below α in the two respective cases. The takeaway message is that FDR procedures based on integrative p-values asymptotically dominate those based on unweighted p-values, as long as the weights uˆ1 are sufficiently informative (Assumption 1) and some regularity conditions hold (Assumption 2). Simplified analysis under independence and oracle separability The above power analysis is not yet specific to Algorithm 1, but the structure of our weights uˆ1(Xi) allows recasting Assumption 1 into a more easily interpretable form under some additional simplifying assumptions. Assumption 3. The test data are independent of one another and of the labeled data. Assumption 4. The test outliers can be separated from the inliers based on the latent features Z1,i used by the machine-learning algorithm A1; that is, P [Yi = 1 | Z1,i] = Yi . Assumption 3 may often be justified, but Assumption 4 is stronger, as it requires the existence of an oracle classifier that can perfectly separate outliers from inliers. The latter, which may be a reasonable approximation in low-noise applications (e.g., in computer vision), is useful because it simplifies Assumption 1 without making the power analysis trivial. In fact, Assumption 4 does not imply that the specific model used to compute our conformity scores is a perfectly accurate classifier. Instead, it simply says that the weights uˆ1(Xi) are a function of the same latent sufficient statistics Z1,i employed by the imaginary oracle, without specifying how informative these weights might actually be. In any case, it is also possible to derive a qualitatively similar, but technically more involved, power analysis under a weaker version of Assumption 4 that does not require perfect oracle separability; see Section A4.1. 23 Under Assumptions 3 and 4, for m = |Dtest| large, Assumption 1 becomes approximately equivalent to Ξ ≥ 1, where the informativeness ratio Ξ is defined as: Ξ := (E [1/uˆ1(X) | Y = 1, D])−1 E [ˆu1(X) | Y = 0, D] . (2.10) Above, the conditioning on all labeled data in D = D0 ∪ D1 signifies that the expectation is taken with respect to the test point X, while the function uˆ1 is fixed. Combined with Theorem 5 and under Assumptions 2–4, this makes more precise the intuition that the integrative p-values output by Algorithm 1 can be more powerful than inductive conformal p-values if uˆ1(Xi) is sufficiently small for the true inliers in the sense of (2.10). To better understand when it may be the case that Ξ ≥ 1, consider a working model saying that the conditional distribution of uˆ1(X) given Y = 1 and D is a discrete Beta distribution on {1/(n1+1), 2/(n1+ 1), . . . , 1}, with shape parameters (ν, 1) for some ν > 0, where n1 = |Dcal 1 |. This distribution, which is motivated below, takes the form P [ˆu1(X) = k/(n1 + 1) | Y = 1, D] = k −(1−ν) /H(1−ν) n1+1 , (2.11) for any k ∈ {1, . . . , n1 + 1}. Above, H (r) n := Pn k=1 k −r is the n-th generalized harmonic number of order r, for n ∈ N and r ≥ 0. If ν = 1, this is simply the uniform distribution on {1/(n1+1), 2/(n1+1), . . . , 1}, which is what one would expect if n1 is large and the test outliers are exchangeable with the labeled outliers. By contrast, if ν < 1 (or ν > 1), this distribution can describe more general situations in which the outliers are not exchangeable and uˆ1(X) | Y = 1 is stochastically smaller (or larger) than uniform. 24 As explained in Section A4.2, it follows from this working model that Ξ can be approximated, for n1 sufficiently large, as: Ξ ≈ 1 E [ˆu1(X) | Y = 0, D] · 1 ν·ζ(2−ν) · 1 (n1+1)1−ν , if ν < 1, 1 γ+log(n1+1) , if ν = 1, ν−1 ν − (ν−1)2 ·ζ(2−ν) ν·(n1+1)ν−1 , if ν > 1, (2.12) where γ ≈ 0.577 is Euler’s constant and ζ is the Riemann zeta function. As the second term in (2.12) dominates the smallest possible value of E [ˆu1(X) | Y = 0, D], namely 1/(n1 + 1), the result suggests that weighting can be beneficial regardless of whether the outlier are exchangeable. This intuition will be verified empirically in Sections 2.5–2.6. Of course, testing for outliers tends to be more challenging if the latter are non-exchangeable and ν < 1, as in that case even the p-values of true inliers may receive large weights. One may wonder whether the above power analysis can have direct practical implications, perhaps by inspiring a data-driven rule-of-thumb for automatically determining when weighting should be applied. While the ratio Ξ defined in (2.10) is generally unknown, it may be tempting to replace it with an empirical estimate Ξˆ based on a parametric expression similar to (2.12) if n1 is large, as explained in Section A4.3. Then, one could think of applying weighting if and only if Ξˆ ≥ 1. However, our simulation studies suggest that such heuristic is not reliable because it often underestimates the benefits of weighting. This can be understood by noting that the power analysis is one-directional and does not preclude the possibility that weighting may lead to higher power even if Ξ < 1. Further, the simplifying assumptions introduced in this section may not hold in practice and are not necessarily needed for our integrative method to be useful. 25 In brief, it is not advisable to rely on the simplified conditions as a practical rule-of-thumb. Instead, the value of this simplified power analysis is to offer a clearer understanding of the more complex general framework presented in Section 2.4, while also providing a precise and tangible interpretation of the informativeness criterion (2.10). 2.5 Numerical experiments with synthetic data Setup We compute integrative p-values with Algorithm 2 and compare their performance on synthetic data to that of inductive p-values. The latter are calculated based on two families of classifiers: one-class classification models (which look only at the labeled inliers) and binary classification models (which also utilize the labeled outliers). Six one-class models are considered: a support vector machine with four alternative choices of kernel (linear, radial basis function, sigmoid, and polynomial of degree 3), an isolation forest, and a “local outlier factor” nearest neighbor method. These models are taken off-the-shelf, as implemented in the Python package scikit-learn. Further, six binary models are considered: a random forest classifier, a K-nearest neighbor classifier, a support vector classifier, a Gaussian naive Bayes model, a quadratic discriminant analysis model, and a multi-layer perceptron. These are also applied as implemented in scikitlearn, with default parameters. The integrative method has access to all models and automatically selects on a case-by-case basis the most promising one as detailed in Section 2.2.2. The performance of integrative p-values is compared to that of inductive p-values, after giving to the latter the unfair advantage of leveraging the most powerful model from each family, selected by an imaginary oracle. Of course, such oracle is unavailable in real applications, but it provides an informative and challenging benchmark for our experiments. 26 To highlight the distinct practical benefits of the two key ideas of integrative p-values (data-driven weighting and model tuning or selection), a simpler variation of our method is also considered as an additional benchmark. This utilizes the labeled outlier data only to automatically select the most promising model from the available toolbox, as well as to tune any relevant hyper-parameters, but it does not weigh the p-values. In other words, this ensemble benchmark consists of applying Algorithm 2 with rˆ(Xi) = uˆ0(Xi) instead of rˆ(Xi) = ˆu0(Xi)/uˆ1(Xi), and it may be thought of as a practical and theoretically valid approximation of the imaginary oracle. The results will show that the ensemble benchmark is not as powerful as the full integrative method, demonstrating the advantage of leveraging side information by weighting the p-values. By contrast, without the ensemble benchmark, it would unclear whether the power gains obtained with our method can be fully explained by its ability to automatically select the most effective model from the machine learning toolbox. To emphasize the relative advantages of different families of classification models under different data distributions, the ensemble benchmark and the imaginary oracle are typically applied separately with the one-class and binary models. Finally, note that there is an additional subtle but sometimes practically important difference between the imaginary oracle and the ensemble benchmark. The former takes as input the raw conformity scores output by a one-class classifier, while the latter also has the ability to automatically flip their signs to increase power, as explained in Section 2.2.2. Integrative p-values We begin by simulating synthetic data from a model analogous to that utilized by [10]. Each sample Xi ∈ R 1000 is generated from a multivariate Gaussian mixture P a X, such that Xi = √ a Vi + Wi , for some constant a ≥ 0 and appropriate random vectors Vi , Wi ∈ R 1000. The inlier samples are obtained with a = 1, and the outliers with a ̸= 1. The vector Vi has independent standard Gaussian components, and each coordinate of Wi is independent and uniformly distributed on a discrete set W ⊆ R 1000 with 27 cardinality |W| = 1000. The vectors in W are sampled independently from the uniform distribution on [−3, 3]1000 prior to the first experiment. The number of labeled inliers and outliers in the simulated data set is varied as a control parameter. All methods are applied to calculate conformal p-values for an independent test set containing 1000 data points, half of which are outliers. Note that, under this setup, the labeled outliers are exchangeable with the test outliers, although this is not required in theory by our method. Analogous results obtained with non-exchangeable outliers will be discussed later. Unless specified otherwise, the performance of each method is evaluated in terms of the empirical FDR and power obtained by applying to the corresponding p-values the BH procedure with Storey’s correction [122, 123], at the nominal 10% level. The results are averaged over 100 independent experiments. All methods are applied using half of the labeled inliers for training and half for calibration. The methods leveraging binary classification models also utilize half of the labeled outliers for training. Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 Power 30 300 3000 30 300 3000 30 300 3000 0.0 0.2 0.4 0.6 0.8 Sample size Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure 2.3: Performance of Storey’s BH with conformal p-values based on different methods, on simulated data. The results are shown as a function of the sample size, for calibration data with different proportions of inliers. All methods control the FDR below 10%. Figure 2.3 shows the empirical power obtained with conformal p-values computed by different methods, as a function of the sample size and of the fraction of inliers in the data. The data are generated from the model described above with parameter a = 0.7. The integrative and one-class ensemble approaches achieve the highest power, with the former being slightly more powerful when the proportion of outliers in the data is large. The results suggest that the adaptive weighting scheme used to compute integrative p-values tends to be beneficial. This is consistent with the supplementary results reported in Figure A6, 28 which show that the informativeness ratio Ξ in (2.10) is typically greater than one in these experiments. Thus, Figure A6 provides some empirical validation for our theoretical power analysis, even though the technical assumptions of Section 2.4 may not hold exactly in these experiments. Further, Figure A6 also shows that the estimate Ξˆ of Ξ described in Section A4.3 can sometimes be quite accurate, although it is not always so, as we shall see later. Note that all methods considered here produce valid p-values, but FDR control is not theoretically guaranteed by applying (Storey’s) BH to integrative p-values, as explained in Section 2.3. Nonetheless, the results reported in Figure A6 also confirm that all methods compared in Figure 2.3 empirically achieve valid FDR control, consistently with the general practical robustness of BH. It is interesting to observe that, under this data distribution, the one-class classifier separates the two classes effectively but the outliers tend to receive less extreme scores than the inliers. This occurs because the model learns a low-dimensional representation of the data that captures well the distribution of inliers but results in unusually concentrated scores for the outliers. Inductive conformal p-values thus become powerless, which explains the poor performance of the one-class oracle. The integrative p-values improve the OCC oracle via an automatic model tuning scheme (cf. the last paragraph of Section 2.2.2). Moreover, it should not be too surprising that all binary classifiers have no power here unless the sample size is large and the class imbalance is small, because the data are high dimensional and intrinsically difficult to separate. The integrative method, which is built upon a pair of OCCs, outperforms the Binary oracle by leveraging the effectiveness of OCC in dealing with high-dimensional imbalanced data. Figure A7 reports on analogous experiments with different sample sizes and proportions of outliers. Figures A8–A9 show results from the same experiments as in Figures 2.3 and A7, respectively, with the difference that now possible outliers in the test data are identified by rejecting the null hypotheses for which the p-values are below the fixed threshold 0.1, instead of seeking to control the FDR. Figure A10 visualizes the empirical distribution on test data of conformal p-values computed with the alternative methods, separately for inliers and outliers, under the same experimental setting as in Figure 2.3. These 29 results indicate all methods produce uniformly distributed p-values for the inliers, as expected, but the integrative method leads to smaller p-values for the outliers. Figure A11 reports on experiments similar to those in Figure 2.3, but based on data in which the outliers in the test set are generated from models with different values of a, and they are thus not exchangeable with the labeled outliers. The results are qualitatively equivalent to those discussed above: all methods control the type-I errors below the nominal level and integrative conformal p-values achieve the highest power, even when the outliers are far from being exchangeable. Figure A12 reports on experiments similar to those in Figure 2.3, with the difference that now the data are simulated with parameter a = 1.25 instead of a = 0.7. Under this distribution, the outlier detection problem becomes intrinsically easier and the outliers naturally tend to receive more extreme one-class conformity scores compared to the inliers, as one would typically expect. Consequently, the one-class oracle benchmark is able to achieve much higher power compared to the previous experiments, although its performance does not surpass that of the integrative method unless the sample size is very small. Figure A13 compares under the same setting as in Figure A12 the performance of integrative conformal p-values to that of inductive conformal p-values based on two alternative one-class classification models, as well as to those based on the most powerful model selected by an imaginary oracle. These results show distinct models may perform very differently depending on the data, which strengthens the practical motivation for our integrative approach and highlights the unfair advantage of the oracle benchmarks. Figures A14–A15 report on similar experiments with data from a different distribution. For each observation, X ∈ R 100 is sampled from a standard multivariate Gaussian with diagonal covariance. The conditional distribution of the outlier label Y ∈ {0, 1} given X = x is binomial with weights wk(x) = zk(x)/ P1 k ′=0 zk ′(x), for k ∈ {0, 1}, where zk(x) = exp(β¯+x T βk) and each βk ∈ R p is sampled from an independent centered Gaussian distribution with variance 3. The intercept β¯ is determined by an independent Monte Carlo simulation with binary search to achieve the desired expected proportion of 30 outliers. Under this setting, binary classification models tend to perform better than one-class models, but the integrative method remains competitive with the oracle binary benchmark throughout a wide range of sample sizes and proportions of labeled outliers. Naive model selection yields invalid p-values The integrative method is capable of leveraging outlier data to automatically select the most powerful model from any machine learning toolbox without losing its theoretical guarantees. In fact, different models (e.g., a neural network vs. a random forest, or variations thereof based on diverse hyper-parameters) may exhibit inconsistent relative performances on distinct data sets, and there is no practical way of knowing in advance which one will work best for a particular outlier testing application. Further, flexibility without clear guidance tends to be confusing, and it may even involuntarily encourage dangerous “data snooping” heuristics, such as naively trying different models and then picking the one leading to the discovery of the largest number of likely out-of-distribution samples, for example. Unlike the aforementioned heuristic, which we will demonstrate empirically to yield severely inflated type-I errors, our Algorithm 2 can select near-optimal models while remaining theoretically valid in finite samples regardless of the size or variety of the toolbox of models considered. The strength of our integrative method is emphasized here by the following experiments that demonstrate empirically the invalidity of a naive alternative approach, which consists of computing inductive conformal p-values separately for each model and greedily picking the model reporting the largest number of potential outliers in the test set. These simulations are conducted using independent data samples of size 1000 generated from the same Gaussian mixture model as in Section 2.5. Conformity scores are calculated using the same families of one-class and binary classification models as in Section 2.5, with the only difference that now several independent instances of the isolation forest and random forest classifier models are utilized with different random seeds; the number of models in each family is thus varied as a 31 control parameter and increased up to 100. Figure 2.4 compares the performance of integrative conformal p-values to that of the aforementioned naive benchmark based on one-class or binary models, as a function of the number of models considered. In this experiment, possible outliers in the test data are identified by rejecting the null hypotheses for which the conformal p-values are below the fixed threshold 0.1, and the performance is measured in terms of true positive rate and false positive rate (i.e., the probabilities of rejecting the null hypothesis for a true outlier or a true inlier, respectively). The results demonstrate that the integrative method is more powerful and always controls the type-I errors, unlike the naive benchmark. Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 TPR FPR 1 10 100 1 10 100 1 10 100 0.0 0.2 0.4 0.6 0.8 0.10 0.12 0.14 Number of models Method Integrative OCC (naive) Binary (naive) Figure 2.4: Performance of integrative conformal p-values and naive benchmarks on simulated data, as a function of the number of machine learning models considered. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure 2.3. Integrative conformal p-values via TCV+ We investigate the performance of TCV+ integrative p-values using the same Gaussian mixture data and the same families of one-class and binary classifiers as in Section 2.5. The performance of integrative pvalues calculated via 5-fold TCV+ is compared to that of integrative p-values based on sample splitting, and to that of p-values produced by the imaginary oracles described in Section 2.2.3. Each method is evaluated in terms of FDR and power obtained by applying Storey’s BH to the corresponding p-values. Figure 2.5 reports the average results of 100 independent experiments, as a function of the sample size and 32 of the fraction of outliers in the labeled data. TCV+ leads to higher power compared to sample splitting, consistently with its higher data parsimony. Figure A16 confirms that all methods lead to FDR below the nominal 10% level. Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 Power 30 100 300 1000 30 100 300 1000 30 100 300 1000 0.0 0.2 0.4 Sample size Method Integrative (TCV+) Integrative One−Class (oracle, CV+) Binary (oracle, CV+) Figure 2.5: Performance on simulated data of Storey’s BH applied to integrative p-values computed with either TCV+ or sample splitting. To serve as benchmarks, inductive conformal p-values based on one-class or binary classification models tuned by an ideal oracle are computed using cross-validation+. All methods control the FDR below 10%. Other details are as in Figure 2.3. Additional results are in the Supplementary Material. Figure A17 shows results analogous to those in Figure 2.5, although here the performances of different methods are quantified in terms of true positive rate and false positive rate after thresholding the p-values at the fixed level 0.1. Figures A18–A19 report on results similar to those in Figures 2.5 and A17, respectively, from experiments in which the data are generated from the binomial model described in Section 2.2.3 instead of the Gaussian mixture model. FDR control via conditional calibration This section investigates the performance of conditional calibration for FDR control with integrative conformal p-values, as discussed in Section 2.3. The experiments are based on the same data-generating models and the same families of one-class or binary classifiers as in Section 2.5. Integrative conformal p-values are calibrated via sample splitting as in Section 2.5. Then, we aim to control the FDR over a test set of size 10, containing 5 inliers and 5 outliers on average, utilizing three alternative methods: BH [14], the Benjamini-Yekutieli procedure [16], and the conditional calibration method described in Section 2.3. Figure 2.6 reports on these experiments as a function of the number of labeled data points. All methods 33 empirically control the FDR, and BH appears to be the most powerful one, especially if the sample size is moderate. Conditional calibration is almost as powerful as the BH benchmark, and typically outperforms the Benjamini-Yekutieli method. Gaussian mixture Binomial Power FDR 30 300 3000 30 300 3000 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 Sample size Method for FDR control Benjiamini−Hochberg Benjiamini−Yekutieli Conditional calibration Figure 2.6: Performance on simulated data of different methods for FDR control with integrative conformal p-values, as a function of the sample size. Other details are as in Figure 2.3. Validation of the asymptotic power analysis Section A5.2 and Figures A20–A21 in the Supplementary Material focus further on validating the theoretical power analysis presented in Section 2.4. The results confirm that weighting tends to lead to higher power when the estimated ratio Ξ defined in (2.10) is larger than one. Further, Figures A20–A21 show that the empirical estimate Ξˆ discussed in Section A4.3 can provide a reasonable approximation of Ξ, both when the outliers are exchangeable and when they are not. While these results are encouraging, they should be interpreted with care, bearing in mind that our asymptotic power analysis relies on several technical assumptions that are unlikely to hold exactly in practice. Further, our power analysis does not say that weighting cannot increase power even if Ξ < 1, and indeed this sometimes occurs, as we will observe later on real data. 34 2.6 Numerical experiments with real data Animal classification from image data We consider the Animal-10N resource [117], a data set containing 50,000 images for 10 different species of animals: cat, lynx, jaguar, cheetah, wolf, coyote, chimpanzee, orangutan, hamster, and guinea pig. The images of hamsters and guinea pigs are taken to be the inliers, while the other animals are treated as outliers. Our goal is to compute p-value for testing the null hypothesis that a new image is an inlier. To avoid the high computational cost of fitting a complex classification model from scratch, we pre-process the 50,000 raw images with a standard pre-trained ResNet50 deep neural network [52], available through the TensorFlow Python package. Each image in Animal-10N is fed to the ResNet50 model and the predicted values in the final hidden layer of the network are utilized as input features for the subsequent analysis; this gives a vector of 2048 pre-processed features for each image. To further reduce the data dimensions, we proceed with only the top 512 principal components for each image. Note that ResNet50 was trained on over 14 million images in the separate Imagenet [31] data set, which we assume to be independent of Animal-10N. After such pre-processing, the experiments are carried out as in Section 2.5. A subset of 1000 inliers and 1000 outliers is chosen at random to serve as test set, while the remaining images are randomly assigned to the training and calibration sets. The outlier data are thus fully exchangeable here. The training and calibration sets have the same cardinality and contain equal proportions of inliers; both of the aforementioned quantities are varied. Integrative p-values and inductive conformal p-values tuned by an imaginary oracle are calculated based on the same six one-class and six binary classifiers as in Section 2.5. The performance of the p-values produced by each method is quantified in terms of the power and FDR obtained by applying BH with Storey’s correction, at the nominal 10% level. All results are averaged over 100 experiments based on independent choices of the training, calibration, and test sets. 35 Figure 2.1, previewed in Section 2.1, compares the power achieved by the integrative method to that of the imaginary oracles which compute inductive p-values based on the best-performing model from each class, as a function of the number of inliers and outliers in the training/calibration data. The results show that inductive p-values based on one-class classifiers tend to be more powerful compared to the alternative approaches when the number of labeled outliers is very small, and their absolute performance further improves when the number of inliers is large, as expected. Further, inductive p-values based on binary classifiers perform poorly when the labeled inliers greatly outnumber the outliers, but otherwise they can be preferable to their one-class counterparts. Integrative p-values are able to gather strength from both approaches: their power is relatively high under all settings considered here, and they sometimes outperform both alternatives. The more detailed results reported in Figure A22 confirm that all methods considered in Figure 2.1 lead to FDR below the nominal level. Further, Figure A22 compares the power of integrative p-values to that of the unweighted ensemble benchmark described in Section 2.5. Recall that the ensemble approach applies Algorithm 2 with rˆ(Xi) = ˆu0(Xi) instead of rˆ(Xi) = ˆu0(Xi)/uˆ1(Xi). This comparison thus helps distinguish between the two benefits of integrative p-values, data-driven weighting and automatic model selection. While such results demonstrate that weighting tends to be beneficial, it is interesting to note that the informativeness ratio Ξ defined in (2.10) is always smaller than one in Figure A22. This is an important warning about the limited direct applicability of the theoretical power analysis from Section 2.4, which appears to be sometimes too pessimistic as our weighting method can be beneficial even when Ξ < 1. The competitive practical performance of the integrative method is especially remarkable considering that the imaginary oracle benchmarks enjoy the unfair advantage of leveraging unrealistic knowledge to pick the most powerful model for each data set. The ability to carry out adaptive model selection in a practical way, automatically gathering strength from a broad collection of classifiers, is especially useful given that 36 inductive p-values based on different models sometimes perform very differently, as shown in Figures 2.7 and A23. Inliers: 1000 Inliers: 7824 10 100 1000 10 100 1000 0.0 0.2 0.4 0.6 Number of outliers Power BC Integrative KNN MLP NB QDA RF SVC Figure 2.7: Power of Storey’s BH with conformal p-values based on different machine-learning models, on animal image classification data. Integrative p-values are compared to inductive p-values based on six different binary classification models. Other details are as in Figure 2.1. Software availability Software implementing the methods and analyses described in this chapter is available online at https : //github.com/ZiyiLiang/weighted_conformal_pvalues. 37 Chapter 3 Conformal inference for neural networks trained with early stopping 3.1 Introduction Deep neural networks can detect complex data patterns and leverage them to make accurate predictions in many applications, including computer vision, natural language processing, and speech recognition, to name a few examples. These models can sometimes even outperform skilled humans [115], but they still make mistakes. Unfortunately, the severity of these mistakes is compounded by the fact that the predictions computed by neural networks are often overconfident [48], partly due to overfitting [129, 93]. Several training strategies have been developed to mitigate overfitting, including dropout [119], batch normalization [60], weight normalization [108], data augmentation [114], and early stopping [98]; the latter is the focus of this chapter. Early stopping consists of continuously evaluating after each batch of stochastic gradient updates (or epoch) the predictive performance of the current model on hold-out independent data. After a large number of gradient updates, only the intermediate model achieving the best performance on the hold-out data is utilized to make predictions. This strategy is often effective at mitigating overfitting and can produce relatively accurate predictions compared to fully trained models, but it does not fully resolve overconfidence because it does not lead to models with finite-sample guarantees. 38 A general framework for quantifying the predictive uncertainty of any black-box machine learning model is that of conformal inference [135]. The idea is to apply a pre-trained model to a calibration set of hold-out observations drawn at random from the target population. If the calibration data are exchangeable with the test point of interest, the model performance on the calibration set can be translated into statistically rigorous predictive inferences. This framework is flexible and can accommodate different learning tasks, including out-of-distribution testing [116], classification [137], and regression [135]. For example, in the context of classification, conformal inference can give prediction sets that contain the correct label for a new data point with high probability. In theory, the quality of the trained model has no consequence on the average validity of conformal inferences, but it does affect their reliability and usefulness on a caseby-case level. In particular, conformal uncertainty estimates obtained after calibrating an overconfident model may be too conservative for some test cases and too optimistic for others [104]. The goal of this chapter is to combine conformal calibration with standard early stopping training techniques as efficiently as possible, in order to produce more reliable predictive inferences with a finite amount of available data. Achieving high accuracy with deep learning often requires large training sets [88], and conformal inference makes the overall pipeline even more data-intensive. As high-quality observations can be expensive to collect, in some situations practitioners may naturally wonder whether the advantage of having principled uncertainty estimates is worth a possible reduction in predictive accuracy due to fewer available training samples. This concern is relevant because the size of the calibration set cannot be too small if one wants stable and reliable conformal inferences [133, 111]. In fact, very large calibration sets may be necessary to obtain stronger conformal inferences that are valid not only on average but also conditionally on some important individual features; see [137, 102, 40]. This chapter resolves the above dilemma by showing that conformal inferences for deep learning models trained with early stopping can be obtained almost “for free”—without spending more precious data. 39 More precisely, we present an innovative method that blends model training with early stopping and conformal calibration using the same hold-out samples, essentially obtaining rigorous predictive inferences at no additional data cost compared to standard early stopping. It is worth emphasizing this result is not trivial. In fact, naively applying existing conformal calibration methods using the same hold-out samples utilized for early stopping would not lead to theoretically valid inferences, at least not without resorting to very conservative corrections. The chapter is organized as follows. Section 3.2 develops our conformalized early stopping (CES) method, starting from outlier detection and classification, then addressing regression. Section 3.3 demonstrates the advantages of CES through numerical experiments. Section B1 in the Supplementary Material studies a naive benchmark method which essentially utilizes the same data for early stopping and conformal calibration, simply ignoring the exchangeability issues that our method is designed to overcome. In Section B1, we derive a theoretical correction for Additional details and results, including a theoretical analysis of the naive benchmark mentioned above, can be found in the Appendices, along with all mathematical proofs. Related Work Conformal inference [109, 136, 135] has become a very rich and active area of research [74, 75, 72, 6]. Many prior works studied the computation of efficient conformal inferences starting from pre-trained black-box models, including for example in the context of outlier detection [116, 47, 10, 82], classification [137, 53, 104, 3, 9], and regression [135, 75, 103]. Other works have studied the general robustness of conformal inferences to distribution shifts [130, 112] and, more broadly, to failures of the data exchangeability assumption [4, 45]. Our research is orthogonal, as we look inside the black-box model and develop a novel early-stopping training technique that is naturally integrated with conformal calibration. Nonetheless, the proposed method could be combined with those described in the aforementioned chapters. Other recent 40 research has explored different ways of bringing conformal inference into the learning algorithms [28, 12, 125, 35], and some of those works apply standard early stopping techniques, but they do not address our problem. This chapter is related to [141], which proposed a general theoretical adjustment for conformal inferences computed after model selection. That method could be utilized to account for early stopping without further data splits, as explained in Section 3.2.1. However, we will demonstrate that even an improved version of such analysis remains overly conservative in the context of model selection via early stopping, and the alternative method developed in this chapter performs much better in practice. Our solution is inspired by Mondrian conformal inference [137] as well as by the integrative conformal method of [82]. The latter deals with the problem of selecting the best model from an arbitrary machine learning toolbox to obtain the most powerful conformal p-values for outlier testing. The idea of [82] extends naturally to the early stopping problem in the special cases of outlier detection and classification, but the regression setting requires substantial technical innovations. The work of [82] is also related to [87], although the latter is more distant from this chapter because it focuses on theoretically controlling the false discovery rate [14] in multiple testing problems. Finally, this chapter draws inspiration from [64], which shows that models trained with bootstrap (or bagging) techniques can also lead to valid conformal inferences essentially for free. 3.2 Conformalized early stopping 3.2.1 Existing benchmarks Consider n data points, Zi for i ∈ D = [n] = {1, . . . , n}, sampled exchangeably (e.g., i.i.d.) from an unknown distribution PZ with support on some space Z. Consider also an additional test sample, Zn+1. In the context of outlier detection, one wishes to test whether Zn+1 was sampled exchangeably from PZ. 41 In classification or regression, one can write Zi = (Xi , Yi), where Xi is a feature vector while Yi is a discrete category or a continuous response, and the goal is to predict the unobserved value of Yn+1 given Xn+1 and the data in D. The standard pipeline begins by randomly splitting the data in D into three disjoint subsets: Dtrain, Des, Dcal ⊂ [n]. The samples in Dtrain are utilized to train a model M via stochastic gradient descent, in such a way as to (approximately) minimize the desired loss L, while the observations in Des and Dcal are held out. We denote by Mt the model learnt after t epochs of stochastic gradient descent, for any t ∈ [t max], where t max is a pre-determined maximum number of epochs. For simplicity, L is assumed to be an additive loss, in the sense that its value calculated on the training data after t epochs is Ltrain(Mt) = P i∈Dtrain ℓ(Mt ;Zi), for some appropriate function ℓ. For example, a typical choice for regression would be the squared-error loss: ℓ(Mt ;Zi) = [Yi − µˆ(Xi ; Mt)]2 , where µˆ(Xi ; Mt) indicates the value of the regression function at Xi , as estimated by Mt . Similarly, the loss evaluated on Des is denoted as Les(Mt) = P i∈Des ℓ(Mt ;Zi). After training for t max epochs, early stopping selects the model Mˆ es that minimizes the loss on Des: Mˆ es = arg minMt : 0≤t≤tmax Les(Mt). Conformal calibration of Mˆ es is then conducted using the independent hold-out data set Dcal, as sketched in Figure 3.1 (a). This pipeline requires a three-way data split because: (i) Dtrain and Des must be disjoint to ensure the early stopping criterion is effective at mitigating overfitting; and (ii) Dcal must be disjoint from Dtrain ∪ Des to ensure the performance of the selected model Mˆ es on the calibration data gives us an unbiased preview of its future performance at test time, enabling valid conformal inferences. The Limitations of a Naive Benchmark An intuitive alternative to the standard approach described above is to naively perform standard conformal inference using the same hold-out samples utilized for early stopping, as sketched in Figure 3.2. This approach, which we call the naive benchmark, may seem appealing as it avoids a three-way data split, 42 data set D Dtrain candidate models Des Dcal selected model test point conformal inference (a) Conformal inference after early stopping. data set D Dtrain candidate models Des-cal selected model test point conformal inference (b) Conformalized early stopping (CES). Figure 3.1: Schematic visualization of rigorous conformal inferences for models trained with early stopping. (a) Conventional pipeline requiring a three-way sample split. (b) Conformalized early stopping, requiring only a two-way split. but it does not provide rigorous inferences. In fact, the necessary exchangeability with the test point is broken if the same hold-out data are used twice—first to evaluate the early stopping criterion and then to perform conformal calibration. In principle, the issue could be corrected by applying a conservative adjustment to the nominal significance level of the conformal inferences, as studied by [141] and reviewed in Appendix B1. However, this leads to overly conservative inferences in practice when applied with the required theoretical correction, as demonstrated by the numerical experiments summarized in Figure 3.3, even if a tighter adjustment developed in Appendix B1 is utilized instead of that of [141]. Intuitively, the problem is that there tend to be complicated dependencies among the candidate models provided to the early stopping algorithm, but the available analyses are not equipped to handle such intricacies and must therefore take a pessimistic viewpoint of the model selection process. Thus, the naive benchmark remains unsatisfactory, although it can serve as an informative benchmark for the novel method developed in this chapter. Interestingly, we will see empirically that the naive benchmark applied without the required theoretical corrections often performs similarly to the rigorous method proposed in this chapter, especially for large data sets. 43 data set D Dtrain candidate models Des-cal selected model test point conformal inference Figure 3.2: Schematic visualization of heuristic conformal inferences based on a naive benchmark that utilizes the same hold-out data twice. Coverage (marginal) Coverage (conditional) Width 200 500 1000 2000 200 500 1000 2000 200 500 1000 2000 16 17 18 19 20 21 0.7 0.8 0.9 1.0 0.7 0.8 0.9 1.0 Sample size Method CES Naive + theory Data splitting Full training Figure 3.3: Average performance, as a function of the sample size, of conformal inferences based on neural networks trained and calibrated with different methods, on the bio regression data [95]. Ideally, the coverage of the conformal prediction intervals should be close to 90% and their width should be small. All methods shown here guarantee 90% marginal coverage. 3.2.2 CES for Outlier Detection Consider testing whether Zn+1, is an inlier, in the sense that it was sampled from PZ exchangeably with the data in D. Following the notation of Section 3.2.1, consider a partition of D into two subsets, Dtrain and Des-cal, chosen at random independently of everything else, such that D = Dtrain ∪ Des-cal. The first step of CES consists of training a deep one-class classifier M using the data in Dtrain via stochastic gradient descent for t max epochs, storing all parameters characterizing the intermediate model after each τ epochs. We refer to τ ∈ [t max] as the storage period, a parameter pre-defined by the user. Intuitively, a smaller τ increases the memory cost of CES but may also lead to the selection of a more accurate model. While the memory cost of this approach is higher compared to that of standard early-stopping training techniques, which only require storing one model at a time, it is not prohibitively expensive. In fact, the candidate 44 models do not need to be kept in precious RAM memory but can be stored on a relatively cheap hard drive. As reasonable choices of τ may typically be in the order of T = ⌊t max/τ ⌋ ≈ 100, the cost of CES is not excessive in many real-world situations. For example, it takes approximately 100 MB to store a pretrained standard ResNet50 computer vision model, implying that CES would require approximately 10 GB of storage in such applications—today this costs less than $0.25/month in the cloud. After pre-training and storing T candidate models, namely Mt1 , . . . , MtT for some sub-sequence (t1, . . . , tT ) of [t max], the next step is to select the appropriate early-stopped model based on the hold-out data in Des-cal as well as the test point Zn+1. Following the notation of Section 3.2.1, define the value of the one-class classification loss L for model Mt , for any t ∈ [T], evaluated on Des-cal as: Les-cal(Mt) = P i∈Des-cal ℓ(Mt ;Zi). Further, for any z ∈ Z, define also L +1 es-cal(Mt , z) as: L +1 es-cal(Mt , z) = Les-cal(Mt) + ℓ(Mt ; z). (3.1) Therefore, L +1 es-cal(Mt , Zn+1) can be interpreted as the cumulative value of the loss function calculated on an augmented hold-out data set including also Zn+1. Then, we select the model Mˆ ces(Zn+1) minimizing L +1 es-cal(Mt , Zn+1): Mˆ ces(Zn+1) = arg min Mtj : 1≤j≤T L +1 es-cal(Mtj , Zn+1). (3.2) Note that the computational cost of evaluating (3.2) is negligible compared to that of training the models. Next, the selected model Mˆ ces(Zn+1) is utilized to compute a conformal p-value [10] to test whether Zn+1 is an inlier. In particular, Mˆ ces(Zn+1) is utilized to compute nonconformity scores Sˆ i(Zn+1) for all samples i ∈ Des-cal ∪ {n + 1}. These scores rank the observations in Des-cal ∪ {n + 1} based on how the one-class classifier Mˆ ces(Zn+1) perceives them to be similar to the training data; by convention, a smaller value of Sˆ i(Zn+1) suggests Zi is more likely to be an outlier. Suitable scores are typically included in the 45 output of standard one-class classification models, such as those provided by the Python library PyTorch. For simplicity, we assume all scores are almost-surely distinct; otherwise, ties can be broken at random by adding a small amount of independent noise. Then, the conformal p-value uˆ0(Zn+1) is given by the usual formula: uˆ0(Zn+1) = 1 + |i ∈ Des-cal : Sˆ i ≤ Sˆ n+1| 1 + |Des-cal| , (3.3) making the dependence of Sˆ i on Zn+1 implicit in the interest of space. This method, outlined by Algorithm 4, gives p-values that are exactly valid in finite samples, in the sense that they are stochastically dominated by the uniform distribution under the null hypothesis. Algorithm 4 Conformalized early stopping for outlier detection 1: Input: Exchangeable data points Z1, . . . , Zn; test point Zn+1. 2: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 3: Input: One-class classifier trainable via (stochastic) gradient descent. 4: Randomly split the exchangeable data points into Dtrain and Des-cal. 5: Train the one-class classifier for t max epochs and save the intermediate models Mt1 , . . . , MtT . 6: Pick the most promising model Mˆ ces(Zn+1) based on (3.2), using the data in Des-cal ∪ {n + 1}. 7: Compute nonconformity scores Sˆ i(Zn+1) for all i ∈ Des-cal ∪ {n + 1} using model Mˆ ces(Zn+1). 8: Output: Conformal p-value uˆ0(Zn+1) given by (3.3). Theorem 6. Assume Z1, . . . , Zn, Zn+1 are exchangeable random samples, and let uˆ0(Zn+1) be the output of Algorithm 4, as given in (3.3). Then, P [ˆu0(Zn+1) ≤ α] ≤ α for any α ∈ (0, 1). 46 3.2.3 CES for Classification The above CES method will now be extended to deal with K-class classification problems, for any K ≥ 2. Consider n exchangeable pairs of observations (Xi , Yi), for i ∈ D = [n], and a test point (Xn+1, Yn+1) whose label Yn+1 ∈ [K] has not yet been observed. The goal is to construct an informative prediction set for Yn+1 given the observed features Xn+1 and the rest of the data, assuming (Xn+1, Yn+1)is exchangeable with the observations indexed by D. An ideal goal would be to construct the smallest possible prediction set with guaranteed feature-conditional coverage at level 1−α, for any fixed α ∈ (0, 1). Formally, a prediction set Cˆ α(Xn+1) ⊆ [K] has feature-conditional coverage at level 1−α if P[Yn+1 ∈ Cˆ α(Xn+1) | Xn+1 = x] ≥ 1 − α, for any x ∈ X , where X is the feature space. Unfortunately, perfect feature-conditional coverage is extremely difficult to achieve unless the feature space X is very small [40]. Therefore, in practice, one must be satisfied with obtaining relatively weaker guarantees, such as label-conditional coverage and marginal coverage. Formally, Cˆ α(Xn+1) has 1 − α label-conditional coverage if P[Yn+1 ∈ Cˆ α(Xn+1) | Yn+1 = y] ≥ 1 − α, for any y ∈ [K], while marginal coverage corresponds to P[Yn+1 ∈ Cˆ α(Xn+1)] ≥ 1 − α. Label-conditional coverage is stronger than marginal coverage, but both criteria are useful because the latter is easier to achieve with smaller (and hence more informative) prediction sets. We begin by focusing on label-conditional coverage, as this follows most easily from the results of Section 3.2.2. This solution will be extended in Appendix B2 to target marginal coverage. The first step of CES consists of randomly splitting D into two subsets, Dtrain and Des-cal, as in Section 3.2.2. The samples in Des-cal are further divided into subsets D y es-cal with homogeneous labels; that is, D y es-cal = {i ∈ Des-cal : Yi = y} for each y ∈ [K]. The data in Dtrain are utilized to train a neural network classifier via stochastic gradient descent, storing the intermediate candidate models Mt after each τ epochs. This is essentially the same approach as in Section 3.2.2, with the only difference being that the neural network is now designed to perform K-class classification rather than one-class classification. Therefore, this neural network should have a soft-max layer with K nodes near its output, whose values corresponding to an input data point 47 with features x are denoted as πˆy(x), for all y ∈ [K]. Intuitively, we will interpret πˆy(x) as approximating (possibly inaccurately) the true conditional data-generating distribution; i.e., πˆy(x) ≈ P [Y = y | X = x]. For any model Mt , any x ∈ X , and any y ∈ [K], define the augmented loss L +1 es-cal(Mt , x, y) as: L +1 es-cal(Mt , x, y) = Les-cal(Mt) + ℓ(Mt ; x, y). (3.4) Concretely, a typical choice for ℓ is the cross-entropy loss: ℓ(Mt ; x, y) = − log ˆπ t y (x), where πˆ t denotes the soft-max probability distribution estimated by model Mt . Intuitively, L +1 es-cal(Mt , x, y) is the cumulative value of the loss function calculated on an augmented hold-out data set including also the imaginary test sample (x, y). Then, for any y ∈ [K], CES selects the model Mˆ ces(Xn+1, y) minimizing L +1 es-cal(Mt , Xn+1, y) among the T stored models: Mˆ ces(Xn+1, y) = arg min Mtj : 1≤j≤T L +1 es-cal(Mtj , Xn+1, y). (3.5) The selected model Mˆ ces(Xn+1, y) is then utilized to compute a conformal p-value for testing whether Yn+1 = y. In particular, we compute nonconformity scores Sˆy i (Xn+1) for all i ∈ Dy es-cal ∪ {n + 1}, imagining that Yn+1 = y. Different types of nonconformity scores can be easily accommodated, but in this chapter, we follow the adaptive strategy of [104]. The computation of these nonconformity scores based on the selected model Mˆ ces is reviewed in Appendix B4. Here, we simply note the p-value is given by: uˆy(Xn+1) = 1 + |i ∈ Dy es-cal : Sˆy i ≤ Sˆy n+1| 1 + |Dy es-cal| , (3.6) 48 again making the dependence of Sˆy i on Xn+1 implicit. Finally, the prediction set Cˆ α(Xn+1) is constructed by including all possible labels for which the corresponding null hypothesis cannot be rejected at level α: Cˆ α(Xn+1) = {y ∈ [K] : ˆuy(Xn+1) ≥ α} . (3.7) This method, outlined by Algorithm 5, guarantees label-conditional coverage at level 1 − α. Algorithm 5 Conformalized early stopping for multi-class classification 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with labels Yi ∈ [K]. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: K-class classifier trainable via (stochastic) gradient descent. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train the K-class classifier for t max epochs and save the intermediate models Mt1 , . . . , MtT . 7: for y ∈ [K] do 8: Define D y es-cal = {i ∈ Des-cal : Yi = y} and imagine Yn+1 = y. 9: Pick the model Mˆ ces(Xn+1, y) according to (3.5), using the data in Des-cal ∪ {n + 1}. 10: Compute scores Sˆy i (Xn+1), ∀i ∈ Dy es-cal ∪ {n + 1}, using Mˆ ces(Xn+1, y); see Appendix B4. 11: Compute the conformal p-value uˆy(Xn+1) according to (3.6). 12: end for 13: Output: Prediction set Cˆ α(Xn+1) given by (3.7). Theorem 7. Assume (X1, Y1), . . . ,(Xn+1, Yn+1) are exchangeable, and let Cˆ α(Xn+1) be the output of Algorithm 5, as given in (3.7), for any given α ∈ (0, 1). Then, P[Yn+1 ∈ Cˆ α(Xn+1) | Yn+1 = y] ≥ 1 − α for any y ∈ [K]. 49 3.2.4 CES for Regression This section extends CES to regression problems with a continuous outcome. As in the previous sections, consider a data set containing n exchangeable observations (Xi , Yi), for i ∈ D = [n], and a test point (Xn+1, Yn+1) with a latent label Yn+1 ∈ R. The goal is to construct a reasonably narrow prediction interval Cˆ α(Xn+1) for Yn+1 that is guaranteed to have marginal coverage above some level 1 − α, i.e., P[Yn+1 ∈ Cˆ α(Xn+1)] ≥ 1 − α, and can also practically achieve reasonably high feature-conditional coverage. Developing a CES method for this problem is more difficult compared to the classification case studied in Section 3.2.3 due to the infinite number of possible values for Yn+1. In fact, a naive extension of Algorithm 5 would be computationally unfeasible in the regression setting, for the same reason why full-conformal prediction [135] is generally impractical. The novel solution described below is designed to leverage the particular structure of an early stopping criterion based on the squared-error loss evaluated on hold-out data. Focusing on the squared-error loss makes CES easier to implement and explain using classical absolute residual nonconformity scores [135, 73]. However, similar ideas could also be repurposed to accommodate other scores, such as those based on quantile regression [103], conditional distributions [26, 61], or conditional histograms [113]. As usual, we randomly split D into Dtrain and Des-cal. The data in Dtrain are utilized to train a neural network via stochastic gradient descent, storing the intermediate models Mt after each τ epoch. The approach is similar to those in Sections 3.2.2–3.2.3, although now the output of a model Mt applied to a sample with features x is denoted by µˆt(x) and is designed to approximate (possibly inaccurately) the conditional mean of the unknown data-generating distribution; i.e., µˆt(x) ≈ E [Y | X = x]. (Note that we will omit the superscript t unless necessary to avoid ambiguity). For any model Mt and any x ∈ X , y ∈ R, define L +1 es-cal(Mt , x, y) = Les-cal(Mt) + [y − µˆt(x)]2 . (3.8) 50 Consider now the following optimization problem, Mˆ ces(Xn+1, y) = arg min Mtj : 1≤j≤T L +1 es-cal(Mtj , Xn+1, y), (3.9) which can be solved simultaneously for all y ∈ R thanks to the amenable form of (3.8). In fact, each L +1 es-cal(Mt , x, y)is a simple quadratic function of y; see the sketch in Figure 3.4. This implies Mˆ ces(Xn+1, y) is a step function, whose parameters can be computed at cost O(T log T) with an efficient divide-andconquer algorithm designed to find the lower envelope of a family of parabolas [33, 92]; see Appendix B3. y L+1 es-cal(Mt,x,y) M1 M2 M3 �! �" �# Figure 3.4: Squared-error loss on test-augmented hold-out data for three alternative regression models M1, M2 and M3, as a function of the place-holder outcome y for the test point. The CES method utilizes the best model for each possible value of y, which is identified by the lower envelope of these three parabolas. In this case, the lower envelope has two finite knots at k1 and k3. Therefore, Mˆ ces(Xn+1, y) has L distinct steps, for some L = O(T log T) that may depend on Xn+1, and it can be written as a function of y as: Mˆ ces(Xn+1, y) = X L l=1 ml(Xn+1)Iy ∈ (kl−1, kl ], (3.10) where ml(Xn+1) ∈ [T]represents the best model selected within the interval(kl−1, kl ]such that ml(Xn+1) ̸= ml−1(Xn+1) for all l ∈ [L]. Above, k1 ≤ k2 ≤ · · · ≤ kL denote the knots of Mˆ ces(Xn+1, y), which also 51 depend on Xn+1 and are defined as the boundaries in the domain of y between each consecutive pair of steps, with the understanding that k0 = −∞ and kL+1 = +∞. Then, for each step l ∈ [L], let Bl indicate the interval Bl = (kl−1, kl ] and, for all i ∈ Des-cal, evaluate the nonconformity score Sˆ i(Xn+1, Bl) for observation (Xi , Yi) based on the regression model indicated by ml(Xn+1); i.e., Sˆ i(Xn+1, Bl) = |Yi − µˆml(Xn+1) (Xi)|. (3.11) Let Qˆ 1−α(Xn+1, Bl) denote the ⌈(1 − α)(1 + |Des-cal|)⌉-th smallest value among all nonconformity scores Sˆ i(Xn+1, Bl), assuming for simplicity that there are no ties; otherwise, ties can be broken at random. Then, define the interval Cˆ α(Xn+1, Bl) as that obtained by applying the standard conformal prediction method with absolute residual scores based on the regression model µˆml(Xn+1) (Xn+1): Cˆ α(Xn+1, Bl) = ˆµml(Xn+1) (Xn+1) ± Qˆ 1−α(Xn+1, Bl). (3.12) Finally, the prediction interval Cˆ α(Xn+1) is given by: Cˆ α(Xn+1) = Convex ∪ L l=1{Bl ∩ Cˆ α(Xn+1, Bl)} , (3.13) where Convex(·) denotes the convex hull of a set. This procedure is summarized in Algorithm 6 and it is guaranteed to produce prediction sets with valid marginal coverage. 52 Algorithm 6 Conformalized early stopping for regression 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with outcomes Yi ∈ R. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: Regression model trainable via (stochastic) gradient descent. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train the regression model for t max epochs and save the intermediate models Mt1 , . . . , MtT . 7: Evaluate Mˆ ces(Xn+1, y) as in (3.10), using Algorithm B5. 8: Partition the domain of Y into L intervals Bl , for l ∈ [L], based on knots of Mˆ ces(Xn+1, y). 9: for l ∈ [L] do 10: Evaluate nonconformity scores Sˆ i(Xn+1, Bl) for all i ∈ Des-cal as in (3.11). 11: Compute Qˆ 1−α(Xn+1, Bl): the ⌈(1 − α)(1 + |Des-cal|)⌉-th smallest value among Sˆ i(Xn+1, Bl). 12: Construct the interval Cˆ α(Xn+1, Bl) according to (3.12). 13: end for 14: Output: Prediction interval Cˆ α(Xn+1) given as a function of {Cˆ α(Xn+1, Bl)} L l=1 by (3.13). Theorem 8. Assume (X1, Y1), . . . ,(Xn+1, Yn+1) are exchangeable, and let Cˆ α(Xn+1) be the output of Algorithm 6, as given by (3.13), for any given α ∈ (0, 1). Then, P[Yn+1 ∈ Cˆ α(Xn+1)] ≥ 1 − α. The intuition behind the above method is as follows. Each intermediate interval Cˆ α(Xn+1, Bl), for l ∈ [L], may be thought of as being computed by applying, under the null hypothesis that Yn+1 ∈ Bl , the classification method from Section 3.2.3 for a discretized version of our problem based on the partition {Bl} L l=1. Then, leveraging the classical duality between confidence intervals and p-values, it becomes clear that taking the intersection of Bl and Cˆ α(Xn+1, Bl) essentially amounts to including the “label” Bl in the output prediction if the null hypothesis Yn+1 ∈ Bl cannot be rejected. The purpose of the final convex hull operation is to generate a contiguous prediction interval, which is what we originally stated to seek. 53 One may intuitively be concerned that this method may output excessively wide prediction interval if the location of {Bl ∩ Cˆ α(Xn+1, Bl)} is extremely large in absolute value. However, our numerical experiments will demonstrate that, as long as the number of calibration data points is not too small, the selected models in general provide reasonably concentrated predictions around the true test response regardless of the placeholder value y. Therefore, the interval Cˆ α(Xn+1, Bl) tends to be close to the true y even if Bl is far away, in which case Cˆ α(Xn+1, Bl) ∩ Bl = ∅ does not expand the final prediction interval Cˆ α(Xn+1). Although it is unlikely, Algorithm 6 may sometimes produce an empty set, which is an uninformative and potentially confusing output. A simple solution consists of replacing any empty output with the naive conformal prediction interval computed by Algorithm B3 in Appendix B1, which leverages an earlystopped model selected by looking at the original calibration data set without the test point. This approach is outlined by Algorithm B7 in Appendix B4.1. As the intervals given by Algorithm B7 always contain those output by Algorithm 6, it follows that Algorithm B7 also enjoys guaranteed coverage; see Corollary B3. CES for Quantile Regression The CES method for regression described in Section 3.2.4 relies on classical nonconformity scores [135, 73] that are not designed to deal efficiently with heteroscedastic data [103, 111]. However, the idea can be extended to accommodate other nonconformity scores, including those based on quantile regression [103], conditional distributions [61, 26], or conditional histograms [113]. The reason why we have so far focused on the classical absolute residual scores is that they are more intuitive to apply in conjunction with an early stopping criterion based on the squared-error loss. In this section, we extend CES to the conformalized quantile regression (CQR) method of [103]. A review of the CQR method is provided in Appendix B4. As in the previous section, consider a data set containing n exchangeable observations (Xi , Yi), for i ∈ D = [n], and a test point (Xn+1, Yn+1) with a latent label Yn+1 ∈ R. We first randomly split D into two subsets, Dtrain and Des-cal. The data in Dtrain are utilized to train a neural network quantile regression 54 model [128] by seeking to minimize the pinball loss instead of the squared error loss, for each target level β = βlow and β = βhigh (e.g., βlow = α/2 and βhigh = 1 − α/2). Note that the same neural network, with two separate output nodes, can be utilized to estimate conditional quantiles at two different levels; e.g., as in [103]. For any t ∈ [t max], let Mβ,t denote the intermediate neural network model stored after t epochs of stochastic gradient descent, following the same notation as in Section 3.2.4. For each target level β and any x ∈ X , let qˆβ,t(x) denote the approximate β−th conditional quantile of the unknown conditional distribution P(Y | X = x) estimated by Mβ,t. Similarly to Section 3.2.4, for any model Mβ,t and any x ∈ X , y ∈ R, define the augmented loss evaluated on the calibration data including also a dummy test point (x, y) as: L +1 es-cal(Mβ,t, x, y) = Les-cal(Mβ,t) + L(Mβ,t, x, y) = X i∈Des-cal ρβ(Yi , qˆβ,t(Xi)) + ρβ(y, qˆβ,t(x)), (3.14) where ρβ denotes the pinball loss function defined in (B12). For any model Mβ,t, the augmented loss is equal to a constant plus a convex function of y, namely ρβ(y, qˆβ,t(x)). Therefore, for any fixed x, the quantity in (3.14) can be sketched as a function of Mβ,t and y as shown in Figure 3.5. This is analogous to Figure 3.4 from Section 3.2.4, with the difference that now the quadratic functions have been replaced by piece-wise linear “pinball” functions. After pre-training and storing T candidate models, namely Mβ,t1 , . . . , Mβ,tT for some sub-sequence (t1, . . . , tT ) of [t max], consider the following optimization problem, Mˆ β,ces(Xn+1, y) = arg min Mβ,tj : 1≤j≤T L +1 es-cal(Mβ,tj , Xn+1, y). (3.15) This problem is equivalent to identifying the lower envelope of a family of shifted pinball loss functions, similarly to Section 3.2.4; see Figure 3.5 for a schematic visualization. Again, this lower envelope can 55 y L+1 es-cal(M,t,x,y) M,1 M,2 M,3 !" !! Figure 3.5: Pinball loss functions on test-augmented hold-out data for three alternative regression models, M1, M2 and M3, as a function of the place-holder outcome y for the test point. The CES method utilizes the best model for each possible value of y, which is identified by the lower envelope of these three pinball loss functions. In this case, the lower envelope has a single finite knot at k2. be found at computational cost O(T log T), with the same divide-and-conquer algorithm described in Appendix B3. In particular, Mˆ β,ces(Xn+1, y) is a step function with respect to y with L distinct steps, for some L = O(T log T), and it can be written as: Mˆ β,ces(Xn+1, y) = X L l=1 mβ,l(Xn+1)Iy ∈ (k β l−1 , kβ l ], (3.16) where mβ,l(Xn+1) ∈ [T] represents the best model selected within the interval (k β l−1 , kβ l ] such that mβ,l(Xn+1) ̸= mβ,l−1(Xn+1) for all l ∈ [L]. Above, k β 1 ≤ k β 2 ≤ · · · ≤ k β L denote the knots of Mˆ β,ces(Xn+1, y), which also depend on Xn+1 and are defined as the boundaries in the domain of y between each consecutive pair of steps, with the understanding that k β 0 = −∞ and k β L+1 = +∞; see Figure 3.5 for a schematic visualization. After computing Mˆ β,ces(Xn+1, y) in (3.15) for both βlow and βhigh, we concatenate the respective knots k low 1 , . . . , klow L1 , khigh 1 , . . . , khigh L2 and sort them into k1 ≤ k2 ≤ kL1+L2 , so that within each interval Bl = (kl−1, kl ] for step l ∈ [L1 + L2], there exist exactly one best model for βlow and exactly one best model for 56 βhigh. Then, for each interval Bl = (kl−1, kl ] associated with step l ∈ [L1+L2], evaluate the nonconformity score Eˆ i(Xn+1, Bl) for all i ∈ Des-cal, based on the regression model indicated by mβlow,l(Xn+1) and mβhigh,l(Xn+1); i.e., Eˆ i(Xn+1, Bl) = max n qˆmβlow,l(Xn+1) (Xi) − Yi , Yi − qˆmβhigh,l(Xn+1) (Xi) o . (3.17) Let Qˆ 1−α(Xn+1, Bl) denote the ⌈(1 − α)(1 + |Des-cal|)⌉-th smallest value among all nonconformity scores Eˆ i(Xn+1, Bl), assuming for simplicity that there are no ties; otherwise, ties can be broken at random. Then, define the interval Cˆ α(Xn+1, Bl) as that obtained by applying the conformal prediction method of [103] with nonconformity scores (3.17) based on the estimated conditional quantiles qˆmβlow,l(Xn+1) (Xn+1) and qˆmβhigh,l(Xn+1) (Xn+1); that is, Cˆ α(Xn+1, Bl) = [ˆqmβlow,l(Xn+1) (Xn+1) − Qˆ 1−α(Xn+1, Bl), qˆmβhigh,l(Xn+1) (Xn+1) + Qˆ 1−α(Xn+1, Bl)]. (3.18) Finally, the output prediction interval Cˆ α(Xn+1) is given by: Cˆ α(Xn+1) = Convex ∪ L l=1{Bl ∩ Cˆ α(Xn+1, Bl)} . (3.19) This procedure, summarized in Algorithm 7, guarantees valid marginal coverage. 57 Algorithm 7 Conformalized early stopping for quantile regression 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with outcomes Yi ∈ R. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: Trainable quantile regression model with target quantiles [βlow, βhigh]. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train for t max epochs and save the intermediate models Mβlow,t1 , . . . , Mβlow,tT , Mβhigh,t1 , . . . , Mβhigh,tT . 7: Evaluate Mˆ βlow,ces(Xn+1, y) and Mˆ βhigh,ces(Xn+1, y) as in (3.16), using Algorithm B6. 8: Partition the domain of Y into L1 + L2 intervals Bl , for l ∈ [L1 + L2], based on the knots of Mˆ βlow,ces(Xn+1, y) and Mˆ βhigh,ces(Xn+1, y). 9: for l ∈ [L1 + L2] do 10: Evaluate nonconformity scores Eˆ i(Xn+1, Bl) for all i ∈ Des-cal as in (3.17). 11: Compute Qˆ 1−α(Xn+1, Bl): the ⌈(1 − α)(1 + |Des-cal|)⌉-th smallest value among Eˆ i(Xn+1, Bl). 12: Construct the interval Cˆ α(Xn+1, Bl) according to (3.18). 13: end for 14: Output: Prediction interval Cˆ α(Xn+1) given as a function of {Cˆ α(Xn+1, Bl)} L l=1 by (3.19). Theorem 9. Assume (X1, Y1), . . . ,(Xn+1, Yn+1) are exchangeable, and let Cˆ α(Xn+1) be the output of Algorithm 7, as given by (3.19), for any given α ∈ (0, 1). Then, P[Yn+1 ∈ Cˆ α(Xn+1)] ≥ 1 − α. Similarly to Section 3.2.4, it is possible (although unlikely) that Algorithm 7 may sometimes produce an empty prediction set. Therefore, we present Algorithm B8 in Appendix B4.1, which extends Algorithm 7 in such a way as to explicitly avoid returning empty prediction intervals. As the intervals given by Algorithm B8 always contain those output by Algorithm 7, it follows from Theorem 9 that Algorithm B8 also enjoys guaranteed coverage; see Corollary B4. 58 3.2.5 Implementation Details and Computational Cost Beyond the cost of training the neural network (which is relatively expensive but does not need to be repeated for different test points) and the storage cost associated with saving the candidate models (which we have argued to be feasible in many applications), CES is quite computationally efficient. Firstly, CES treats all test points individually and could process them in parallel, although many operations do not need to be repeated. In particular, one can recycle the evaluation of the calibration loss across different test points; e.g., see (3.4). Thus, the model selection component can be easily implemented at cost O((nes-cal + ntest) · T + ntest · nes-cal) for classification (of which outlier detection is a special case) and O((nes-cal + ntest) · T · log T + ntest · nes-cal) for regression, where nes-cal = |Des-cal| and T is the number of candidate models. Note that the T · log T dependence in the regression setting comes from the divide-and-conquer algorithm explained in Appendix B3. It is possible that the cost of CES may become a barrier in some applications, particularly if T is very large, despite the slightly more than linear scaling. Hence, we recommend employing moderate values of T (e.g., 100 or 1000). 3.3 Numerical Experiments with real data Outlier Detection The use of CES for outlier detection is demonstrated using the CIFAR10 data set [67], a collection of 60,000 32-by-32 RGB images from 10 classes including common objects and animals. A convolutional neural network with ReLU activation functions is trained on a subset of the data to minimize the cross-entropy loss. The maximum number of epochs is set to be equal to 50. The trained classification model is then utilized to compute conformity scores for outlier detection with the convention that cats are inliers and the other classes are outliers. In particular, a nonconformity score for each Zn+1 is defined as 1 minus 59 the output of the soft-max layer corresponding to the label “cat”. This can be interpreted as an estimated probability of Zn+1 being an outlier. After translating these scores into a conformal p-value uˆ0(Zn+1), the null hypothesis that Zn+1 is a cat is rejected if uˆ0(Zn+1) ≤ α = 0.1. The total number of samples utilized for training, early stopping, and conformal calibration is varied between 500 and 2000. In each case, CES is applied using 75% of the samples for training and 25% for early stopping and calibration. Note that the calibration step only utilizes inliers, while the other data subsets also contain outliers. The empirical performance of CES is measured in terms of the probability of falsely rejecting a true null hypothesis—the false positive rate (FPR)—and the probability of correctly rejecting a false null hypothesis—the true positive rate (TPR). The CES method is compared to three benchmarks. The first benchmark is naive early stopping with the best (hybrid) theoretical correction for the nominal coverage level described in Appendix B1. The second benchmark is early stopping based on data splitting, which utilizes 50% of the available samples for training, 25% for early stopping, and 25% for calibration. The third benchmark is full training without early stopping, which simply selects the model obtained after the last epoch. The test set consists of 100 independent test images, half of which are outliers. All results are averaged over 100 trials based on independent data subsets. FPR TPR 500 1000 2000 500 1000 2000 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 Sample size Method CES Naive + theory Data splitting Full training Figure 3.6: Average performance, as a function of the sample size, of conformal inferences for outlier detection based on neural networks trained and calibrated with different methods, on the CIFAR10 data [67]. Ideally, the TPR should be as large as possible while maintaining the FPR below 0.1. Figure 3.6 summarizes the performance of the four methods as a function of the total sample size All methods control the FPR below 10%, as expected, but CES achieves the highest TPR. The increased power of CES compared to data splitting is not surprising, as the latter relies on a less accurate model trained on 60 less data. By contrast, the naive benchmark trains a model more similar to that of CES, but its TPR is not as high because the theoretical correction for the naive conformal p-values is overly pessimistic. Finally, full training is the least powerful competitor for large sample sizes because its underlying model becomes more and more overconfident as the training set grows. Multi-class Classification The same CIFAR10 data [67] are utilized to demonstrate the performance of CES for a 10-class classification task. These experiments are conducted similarly to those in Section 3.3, with the difference that now the soft-max output of the convolutional neural network is translated into conformal prediction sets, as explained in Appendix B4, instead of conformal p-values. The CES method is compared to the same three benchmarks from Section 3.3. All prediction sets guarantee 90% marginal coverage, and their performances are evaluated based on average cardinality. Coverage (marginal) Cardinality 500 1000 2000 500 1000 2000 6 7 8 9 0.7 0.8 0.9 1.0 Sample size Method CES Naive + theory Data splitting Full training Figure 3.7: Average performance, as a function of the sample size, of conformal prediction sets for multiclass classification based on neural networks trained and calibrated with different methods, on the CIFAR10 data [67]. Ideally, the coverage should be close to 90% and the cardinality should be small. Figure 3.7 summarizes the results averaged over 100 independent realizations of these experiments. While all approaches always achieve the nominal coverage level, the CES method is able to do so with the smallest, and hence most informative, prediction sets. As before, the more disappointing performance of the data splitting benchmark can be explained by the more limited amount of data available for training, that of the naive benchmark by the excessive conservativeness of its theoretical correction, and that of the full training benchmark by overfitting. 61 Regression We now apply the CES method from Section 3.2.4 to the following 3 public-domain regression data sets from the UCI Machine Learning repository [97]: physicochemical properties of protein tertiary structure (bio) [95], hourly and daily counts of rental bikes (bike) [18], and concrete compressive strength (concrete) [30]. These data sets were previously also considered by [103], to which we refer for further details. As in the previous sections, we compare CES to the usual three benchmarks: naive early stopping with the hybrid theoretical correction for the nominal coverage level, early stopping based on data splitting, and full model training without early stopping. All methods utilize the same neural network with two hidden layers of width 128 and ReLU activation functions, trained for up to 1000 epochs. The models are calibrated in such a way as to produce conformal prediction sets with guaranteed 90% marginal coverage for a test set of 100 independent data points. The total sample size available for training, early stopping and calibration is varied between 200 and 2000. These data are allocated for specific training, early-stopping, and calibration operations as in Sections 3.3–3.3. The performance of each method is measured in terms of marginal coverage, worst-slab conditional coverage [25]—estimated as described in [111]—and average width of the prediction intervals. All results are averaged over 100 independent experiments, each based on a different random sample from the original raw data sets. Figure 3.3 summarizes the performance of the four alternative methods on the bio data, as a function of the total sample size. These results show that all methods reach 90% marginal coverage in practice, as anticipated by the mathematical guarantees, although the theoretical correction for the naive early stopping method appears to be overly conservative. The CES method clearly performs best, in the sense that it leads to the shortest prediction intervals while also achieving approximately valid conditional coverage. By contrast, the conformal prediction intervals obtained without early stopping have significantly lower conditional coverage, which is consistent with the prior intuition that fully trained neural networks can sometimes suffer from overfitting. 62 Finally, it must be noted that the widths of the prediction intervals output by the CES method in these experiments are very similar to those of the corresponding intervals produced by naively applying early stopping without data splitting and without the theoretical correction described in Appendix B1. This naive approach was not taken as a benchmark because it does not guarantee valid coverage, unlike the other methods. Nonetheless, it is interesting to note that the rigorous theoretical properties of the CES method do not come at the expense of a significant loss of power compared to this very aggressive heuristic, and in this sense, one may say that the conformal inferences computed by CES are “almost free”. Quantile Regression We apply the CES quantile regression method from Section 3.2.4 to the following publicly available and commonly investigated regression data sets from the UCI Machine Learning repository [97]: medical expenditure panel survey number 21 (MEPS_21) [90]; blog feedback (blog_data) [19]; Tennessee’s student teacher achievement ratio (STAR) [2]; community and crimes (community) [29]; physicochemical properties of protein tertiary structure (bio) [95]; house sales in King County (homes) [55]; and hourly and daily counts of rental bikes (bike) [18]. These data sets were previously also considered by [103]. As in the previous sections, we compare CES to the usual three benchmarks, now implemented based on quantile regression: naive early stopping with the hybrid theoretical correction for the nominal coverage level, early stopping based on data splitting and full model training without early stopping. We follow the same model architecture and data preprocessing steps as in [103]. To be specific, the input features are standardized to have zero mean and unit variance, and the response values are rescaled by diving the absolute mean of the training responses. All methods utilize the same neural network with three hidden layers and ReLU activation functions between layers, trained for up to 2000 epochs. The parameters are trained minimizing the pinball loss function (see Appendix B4) with Adam optimizer [65], minibatches of 63 size 25, 0 weight decay and dropout, and fixed learning rate (0.001 for STAR, homes, bike, and bio, 0.0001 for community, and 0.00005 for MEPS_21 and blog_data). The models are calibrated in such a way as to produce conformal prediction sets with guaranteed 90% marginal coverage for a test set of 1000 independent data points. The total sample size available for training, early stopping and calibration is varied between 200 and 2000 (200 and 1000 for small data sets such ascommunity and STAR). These data are allocated for specific training, early-stopping, and calibration operations as in Sections 3.3–3.3. Again, the performance of each method is measured in terms of marginal coverage, worst-slab conditional coverage [25], and average width of the prediction intervals. All results are averaged over 25 independent experiments, each based on a different random sample from the original raw data sets. Figure 3.8 summarizes the performance of the four alternative methods on the homes data, as a function of the total sample size; The error bar corresponding to standard errors are plotted around each data point. These results show that all methods reach 90% marginal coverage in practice, as anticipated by the mathematical guarantees, although the theoretical correction for the naive early stopping method appears to be overly conservative. Full training, though producing the smallest prediction bands, has very low conditional coverage, which indicates that fully trained neural network models can suffer from overfitting and therefore is not appealing. Data splitting method beats full training as it gives higher approximated conditional coverage, and CES further beats data splitting in terms of conditional coverage, meanwhile producing prediction intervals of similar length as data splitting. Software Availability Software implementing the algorithms and data experiments are available online at https : //github.com/ZiyiLiang/Conformalized_early_stopping. 64 Coverage (marginal) Coverage (conditional) Width 200 500 1000 2000 200 500 1000 2000 200 500 1000 2000 1 2 3 0.7 0.8 1.0 0.7 0.8 1.0 Sample size Method CES Naive + theory Data splitting Full training Figure 3.8: Average performance, as a function of the sample size, of conformal prediction sets for quantile regression based on neural networks trained and calibrated with different methods, on the homes data [55]. The marginal coverage is theoretically guaranteed to be above 90%. Ideally, the conditional coverage should high and the prediction intervals should be tight. 65 Chapter 4 Structured multiple testing 4.1 Introduction Multiple testing for network data Statistical analysis of network-structured data is an important topic with a wide range of applications. Our study is motivated by genome-wide association studies (GWAS), where a primary objective is to identify disease-associated single-nucleotide polymorphisms (SNPs) across diverse populations. Recent studies have indicated that linkage analysis can provide insights into the genetic basis of complex diseases, particularly SNPs in linkage disequilibrium (LD) can jointly contribute to the representation of the disease phenotype [110]. However, existing research in GWAS has often neglected the LD network information, representing a significant limitation. This article aims to address this gap by developing integrative analytical tools that combine GWAS association data with LD network information for more efficient identification of disease-associated SNPs. Let m denote the number of SNPs and [m] = {1, . . . , m}. The primary data, as provided by GWAS, consists of a list of test statistics T = {Ti : i ∈ [m]}, which assess the association strength of individual SNPs with a phenotype of interest. Let P = {Pi : i ∈ [m]} denote the corresponding p-values. The auxiliary data, as provided by linkage analysis, is a matrix S = (rij : i, j ∈ [m]) comprising pairwise 66 correlations, where rij measures the LD correlation between SNP i and SNP j. To illustrate, we construct an undirected LD graph based on pairwise LD correlation in Figure 4.1, where each node represents a SNP and an edge is drawn to connect two SNPs if their correlation exceeds a pre-determined cutoff. Figure 4.1: Auxiliary linkage data provide structural information that can be utilized to identify significant SNPs. Incorporating LD patterns into inference is desirable in GWAS as it can improve the power and interpretability of analysis. However, developing a principled approach that cross-utilizes data from different sources is a challenging task. Firstly, the primary data T and the auxiliary data S are usually collected from different populations. For instance, in our analysis detailed in Section 4.5, the target population in GWAS consists of 77,418 individuals with Type 2 diabetes, while the auxiliary data for linkage analysis are collected from the general population [1000 Genomes (1000G) Phase 3 Database]. Secondly, the dimensions of T and S do not match coordinate by coordinate as in the conventional settings. Specifically, T is a vector of statistics while S is an LD matrix. Motivated by the observation that SNPs in the same sub-network can exhibit similar distributional characteristics and tend to work together in GWAS [110], this article develops a class of locally adaptive 67 structure learning algorithms (LASLA) that is capable of incorporating side information in different dimensions and from diverse populations. LASLA reflects this structural knowledge by computing a set of locally adaptive weights {wi : i ∈ [m]} to adjust the p-values {Pi : i ∈ [m]}. In summary, LASLA follows a two-step strategy: Step 1: Learn the relational or structural knowledge of the high-dimensional parameter through auxiliary data. Step 2: Apply the structural knowledge by adaptively placing differential weights wi on p-values Pi , for i ∈ [m]. Related work Structured multiple testing has gained increasing attention in recent years [86, 22, 80, 24, 99]. The fundamental idea is to supplement the sequence of primary statistics T = {Ti : i = 1, 2, . . .}, such as z-values, with an auxiliary sequence U = {Ui : i = 1, 2, . . .} to enhance the power of inference. The auxiliary data can be collected in various ways, including (a) externally from prior studies and secondary data sources ([59, 7]); (b) internally within the same data using carefully constructed independent sequences ([22, 138]); and (c) intrinsically through patterns associated with the data, such as the natural order of data streams ([39, 86]) or spatial locations ([78, 21]). Aforementioned studies primarily focus on an auxiliary sequence U that has the same dimension as T , while the exploration of more generalized side information such as LD matrices or multiple auxiliary sequences remains limited. Moreover, existing structure-adaptive methods ([22, 80, 138]) have dominantly relied on leveraging the sparsity structure encoded in auxiliary data, while recent research has revealed that other forms of structural information, including signal amplitudes, heteroscedasticity, and hierarchical structures, can also provide valuable insights ([94, 77, 41]). In the context of GWAS, where SNPs from 68 different subnetworks may exhibit distinct distributional characteristics, this broader range of structural information becomes particularly relevant. 4.2 Locally Adaptive Structure Learning Algorithms Suppose we are interested in testing m hypotheses: H0,i : θi = 0 vs. H1,i : θi = 1, i ∈ [m], (4.1) where θ = (θi : i ∈ [m]) is a vector of binary random variables indicating the existence or absence of signals at the testing locations. A multiple testing rule can be represented by a binary vector δ = (δi : i ∈ [m]) ∈ {0, 1} m, where δi = 1 indicates that we reject H0,i and δi = 0 otherwise. This article focuses on problem (4.1) with the goal of controlling the FDP and FDR [13] defined respectively by: FDP(δ) = Pm i=1(1 − θi)δi max(Pm i=1 δi , 1), and FDR = E {FDP(δ)} . An ideal procedure should maximize power, which is interpreted as the expected number of true positives: ETP = Ψ(δ) = E (Xm i=1 θiδi ) . Our approach involves the calculation of a distance matrix D = (1 − r 2 ij : i, j ∈ [m]) based on the LD matrix S = (r 2 ij : i, j ∈ [m]), where rij ∈ [0, 1] represents the Pearson’s correlation coefficient that measures the linkage disequilibrium between SNPs i and j. Note that the distance matrix D contains identical information asS but it facilitates our definition of the local neighborhood and subsequent data-driven algorithms. Specifically, SNPs with higher correlation rij are considered “closer” in distance. Clusters of 69 SNPs with small pairwise distance form a local neighborhood which may exhibit distinct distributional patterns compared to the far-away neighborhoods. The derivation of a distance matrix D from the network auxiliary data S is straightforward. However, it becomes more challenging when dealing with generalized side information. An extension on how to define a proper distance matrix for general side information is discussed in Section C2 of the Supplementary Material. Intuitively, leveraging the local heterogeneity can enhance the interpretability and power of multiple testing procedures. This motivates us to propose LASLA discussed in the next a few sections. 4.2.1 Oracle FDR procedure The weighting strategy provides an effective approach to incorporating relevant domain knowledge in multiple testing and has been widely used in the literature ([44, 127, 7]; among others). The proposed LASLA builds upon this strategy. We first introduce an oracle procedure in this section under an ideal setting. Then in Sections 4.2.2–4.2.3, we develop a data-driven algorithm that emulates the oracle and leverages the distance matrix to construct informative weights. Assume the primary test statistics {Ti : i ∈ [m]} follow a hierarchical model: θi ind∼ Bernoulli(π ∗ i ), P(Ti ≤ t|θi) = (1 − θi)F0(t) + θiF ∗ 1i (t). (4.2) We would like to make two remarks to clarify our notations. Firstly, (4.2) is not intended to exactly correspond to the true data-generating process. Instead, it serves as a hypothetical model to motivate the use of our data-driven weights. Secondly, we use π ∗ i and F ∗ 1i (t) to account for the potential heterogeneity in the distribution across various testing units in light of full auxiliary information Di that will be specified in Section 4.2.2. By contrast, the null distribution F0(t) is known and identical for all testing units. Denote by F ∗ i (t) = (1 − π ∗ i )F0(t) + π ∗ i F ∗ 1i (t) (4.3) 70 the mixture distribution and f ∗ i (t) the corresponding density. The optimal testing rule, which has the largest ETP among all valid marginal FDR/FDR procedures ([126, 22, 54]), is a thresholding rule based on the local false discovery rate (lfdr): L ∗ i = P(θi = 0|Ti) = (1 − π ∗ i )f0(Ti)/f ∗ i (Ti). A step-wise algorithm (e.g. [126]) can be used to determine a data-driven threshold along the L ∗ i ranking. Specifically, denote by L ∗ (1) ≤ . . . ≤ L ∗ (m) the sorted L ∗ i values, and H(1), . . . , H(m) the corresponding hypotheses. Let k∗ = max n j : j −1 Pj i=1 L ∗ (i) ≤ α o . Then the stepwise algorithm rejects H(1), . . . , H(k∗) . 4.2.2 A data-driven procedure Note that, π ∗ i and f ∗ i (t) are hypothetical and non-accessible oracle quantities that cannot be directly utilized in the data-driven algorithms. Therefore, we use kernel estimation to approximate the oracle quantities. By convention, throughout we assume that θj does not affect the null distribution of Ti for i ̸= j. Let Di be the ith column of D. Adapting the fixed-domain asymptotics in [120], we define Di as a continuous∗ finite domain (w.r.t. coordinate i) in R with positive measure, where each d ∈ Di is a distance and 0 ∈ Di . The two sets Di and Di can be viewed as distances measured from a partial/full network, respectively, and Di ⊂ Di . The details on the fixed-domain asymptotics, i.e., Di → Di as m → ∞, are relegated to Section C7 in the Supplementary Material. The approximation of π ∗ i is inspired by the screening process in [21]. We reiterate the essence here for completeness. Let Dij be the distance between entries i and j from the full network. A key assumption ∗The assumption on the continuity of Di can be relaxed. For example, our theory can be easily extended to the case where Di is the union of disjoint continuous domains. 71 is that π ∗ i is close to π ∗ j if Dij is small, which enables us to resemble π ∗ i using neighboring points. Since a direct evaluation of π ∗ i is challenging, we consider an intermediate quantity: π τ i = 1 − P(Pi > τ | Di) 1 − τ , 0 < τ < 1, which offers a good approximation of π ∗ i with a suitably chosen τ . Intuitively, when τ is large, null pvalues are dominant in the tail area [τ, 1]. Hence P(Pi > τ | Di)/(1 − τ ) approximates the overall null proportion. See [21] for more discussion. Let K: R → R be a positive, bounded, and symmetric kernel function that satisfies the following conditions: Z R K(x)dx = 1, Z R xK(x)dx = 0, Z R x 2K(x)dx = σ 2 K < ∞. (4.4) Define Kh(x) = h −1K(x/h), where h is the bandwidth, and Vh(i, j) = Kh(Dij )/Kh(0). We can view mi = P j̸=i Vh(i, j) as the total “mass” in the vicinity of i. Suppose we want to count how many null p-values are greater than τ among the mi observation around i. The empirical count is given by P j̸=i Vh(i, j)I(Pj > τ ), whereas the expected count can be approximated by mi(1 − π τ i )(1 − τ ). Setting equal the expected and empirical counts, we construct a data-driven quantity that resemble π τ i , i ∈ [m] by : πi = 1 − P j̸=i [Vh(i, j)I{Pj > τ}] (1 − τ ) P j̸=i Vh(i, j) . (4.5) Similarly, we introduce a data-driven quantity that resemble the hypothetical density f ∗ i (t) for i ∈ [m] by: fi(t) = P j̸=i [Vh(i, j)Kh(Tj − t)] P j̸=i Vh(i, j) , (4.6) 72 which describes the likelihood of Ti taking a value in the vicinity of t. Consequently, the data-driven lfdr is given by: Li = {1 − πi}f0(Ti)/fi(Ti). (4.7) Let L(1) ≤ . . . ≤ L(m) denote the sorted values of Li . Following the approach in Section 4.2.1, we choose L(k) to be the threshold, where k = max{j : j −1 Pj i=1 L(i) ≤ α}. 4.2.3 Oracle-assisted weights Assume for now that Ti is independent of T −i = {Tj , j ∈ [m], j ̸= i}. This assumption is only used to motivate the methodology and is not required for theoretical analysis. We assume that f0 is symmetric about zero; otherwise we can always transform the primary statistics into z-statistics/t-statistics, etc. We now introduce the weights mimicking the oracle procedure above. Intuitively, the thresholding rule Li < t is approximately equivalent to the following rule: Ti < t− i or Ti > t+ i . (4.8) Here t − i ≤ 0 and t + i ≥ 0 are coordinate-specific thresholds that lead to asymmetric rejection regions, which are useful for capturing structures of the alternative distribution, such as unequal proportions of positive and negative signals [126, 80, 41]. The next step is to derive weights that can emulate the rule (4.8). We first consider the case of Ti ≥ 0. Let t + i = ∞ if (1 − πi)f0(t)/fi(t) > L(k) for all t ≥ 0, else we let t + i = inf t ≥ 0 : {(1 − πi)f0(t)/fi(t)} ≤ L(k) (4.9) 73 The rejection rule Ti > t+ i is equivalent to the p-value rule Pi < 1 − F0(t + i ), where Pi is the one-sided p-value and F0 is the null distribution of Ti . Define the weighted p-values as {P w i = Pi/wi : i ∈ [m]}. If we choose a universal threshold for all P w i , then using weight wi = 1 − F0(t + i ) can effectively emulate Ti > t+ i . For the case of Ti < 0, let t − i = −∞ if (1 − πi)f0(t)/fi(t) > L(k) for all t ≤ 0, else we let t − i = sup t ≤ 0 : {(1 − πi)f0(t)/fi(t)} ≤ L(k) (4.10) The corresponding weight is given by wi = F0(t − i ). To ensure the robustness of the algorithm, we set wi = max{wi , ξ} and wi = min{wi , 1 − ξ}, where 0 < ξ < 1 is a small constant. In all numerical studies, we set ξ = 10−5 . To expedite the calculation and facilitate our theoretical analysis, we propose to use only a subset of p-values in the neighborhood to obtain πi and fi(t) for i ∈ [m]. Specifically, we define Ni = {j ̸= i : Dij ≤ aϵ} as a neighborhood set which only contains indices with distance to i smaller than aϵ and satisfies |Ni | = m1−ϵ for some small constant ϵ > 0. Moreover, we require that Dij1 ≤ Dij2 for any j1 ∈ Ni , j2 ∈ N / i , and j2 ̸= i. This approach has minimal impact on numerical performance, as p-values that are far apart contribute little. Furthermore, for each i ∈ [m], we propose to compute an index-dependent oracle L i (k) which only relies on |Ni | statistics as opposed to the universal threshold L(k) defined in Section 4.2.2. This modification arises from the subtle fact that, though Li in (4.7) is estimated without using Ti itself, wi still correlates with Ti since the determination of k in the universal threshold relies on all primary statistics. This intricate dependence structure poses challenges for the theoretical analysis. Therefore, we compute the index-dependent threshold L i (k) to ensure that wi is independent of Ti under the null hypothesis at the cost of computational efficiency. We summarize the proposed method in Algorithm 8, and prove that the corresponding weighted multiple testing strategy is asymptotically valid in Section 4.3.1. 74 Algorithm 8 Oracle-assisted weights 1: Input: FDR level α, ϵ specifying the size of the sub-neighborhood, kernel function K(·). 2: Input: Primary statistics T = {Ti : i ∈ [m]} and distance matrix D = (Dij : i, j ∈ [m]). 3: For i ∈ [m] do: 4: For j ∈ Ni do: 5: Estimate π i j by (4.5) with the summation taken over Ni . 6: Estimate f i j (t) by (4.6) with the summation taken over Ni . 7: Compute L i j as given by (4.7) with πj replaced by π i j ; fj (t) by f i j (t). 8: Denote the sorted statistics by L i (1) ≤ . . . ≤ L i (|Ni|) . 9: Choose L i (k) to be the oracle threshold, where k = max{j ∈ [|Ni |] : j −1 Pj l=1 L i (l) ≤ α}. 10: If Ti ≥ 0 then: 11: Compute t + i as given by (4.9), with L(k) replaced by L i (k) ; 12: Compute the weight by wi = 1 − F0(t + i ). 13: If Ti < 0 then: 14: Compute t − i as given by (4.10), with L(k) replaced by L i (k) . 15: Compute the weight by wi = F0(t − i ). 16: Output: Oracle-assisted weights {wi : i ∈ [m]}. Alternatively, a computationally much more efficient procedure that utilizes the universal oracle threshold L(k) for all i ∈ [m] is introduced in Algorithm C10 of Section C4 in the Supplementary Material. We remark on the distinctions between these two algorithms. Theoretically, the weighted multiple testing procedure based on Algorithm 8 can achieve asymptotic FDR and FDP control with minimal assumptions as shown in Section 4.3.1, but the validity of the alternative procedure based on Algorithm C10 requires more technical assumptions, as discussed in Section C7 of the Supplementary Material. Practically, weights obtained through both methods are very similar, therefore leading to nearly identical empirical FDR and power performance. Hence, we recommend using the alternative Algorithm C10 for practical applications due to its computation efficiency. Note that throughout the chapter, the weights from Algorithm 8 are only employed for Theorems 10 and 11. 4.2.4 Oracle-assisted weights vs sparsity-adaptive weights This section presents a comparison between the oracle-assisted weights and the sparsity-adaptive weights described in Section C3 of the Supplementary Material. In the case where neighborhood information 75 is provided by spatial locations, the sparsity-adaptive weights are equivalent to the weights employed by the LAWS procedure ([21]). Intuitive examples are provided to illustrate potential information loss associated with the sparsity-adaptive weights (referred to as LAWS weights), and the advantages of the newly proposed weights (referred to as LASLA weights) are highlighted. We focus on scenarios where π ∗ i is homogeneous, but it’s important to note that the inadequacy of sparsity-adaptive weights extends to cases where π ∗ i is heterogeneous. This shortcoming arises from the fact that the sparsity-adaptive weights ignore structural information in alternative distributions. In the following comparisons, the oracle LAWS weights are defined as w laws i = π ∗ i /(1−π ∗ i ) for i ∈ [m], as reviewed in Supplement C3; and the oracle LASLA weights are obtained via Algorithm C10, with πi and fi replaced by their oracle counterparts π ∗ i and f ∗ i (t). We implement both weights with oracle quantities to emphasize the methodological differences. Consider two examples where {Ti : i ∈ [m]} follow the mixture distribution in (4.3) with m = 1000 and π ∗ i = 0.1. Example 1. Set F ∗ 1i (t) = γN(3, 1)+(1−γ)N(−3, 1), where γ controls the relative proportions of positive and negative signals. We vary γ from 0.5 to 1; when γ approaches 1, the level of asymmetry increases. Example 2. Set F ∗ 1i (t) = 0.5N(3, σ2 i ) + 0.5N(−3, σ2 i ), where σi controls the shape of the alternative distribution, and P(σi = 1) = 0.5 and P(σi = σ) = 0.5. We vary σ from 0.2 to 1; the heterogeneity is most pronounced when σ = 0.2. In both examples, π ∗ i remains constant for all i ∈ [m], so does w laws i . Consequently, the oracle LAWS reduces to the unweighted p-value procedure [42]. On the contrary, LASLA weights are able to capture the signal asymmetry in Example 1 and the shape heterogeneity in Example 2, resulting in notable power gain in both settings. 76 FDR Power 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.525 0.550 0.575 0.600 0.03 0.04 0.05 0.06 0.07 γ Method LASLA.OR LAWS.OR PV.OR (a) FDR Power 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.55 0.60 0.65 0.70 0.03 0.04 0.05 0.06 0.07 σ Method LASLA.OR LAWS.OR PV.OR (b) Figure 4.2: Empirical FDR and power comparison for oracle LASLA, LAWS and PV (unweighted p-value) with significance level α = 0.05. (a): Example 1; (b): Example 2. 4.2.5 The LASLA procedure We now introduce the final step of LASLA: p-value thresholding. Let {wi : i ∈ [m]} be the data-driven LASLA weights. Recall that the weighted p-values are {P w i : i ∈ [m]} with P w i := Pi/wi . Sort the weighted p-values from smallest to largest as P w (1), . . . , P w (m) . To motivate the analysis, we assume that Pi is independent with wi , for i ∈ [m]. It follows that the expected number of false positives (EFP) in light of full network information Di , i ∈ [m] for a threshold t w can be calculated as: EFP = Xm i=1 P{P w i ≤ t w, θi = 0|Di} = t wXm i=1 wiP(θi = 0|Di) = t wXm i=1 wi(1 − π ∗ i ), (4.11) where the second equality follows from the standard assumption that Pi is uniformly distributed given θi = 0 regardless of the side information. Suppose we reject j hypotheses along the ranking provided by P w (i) for i ∈ [m], namely, set the threshold as t w = P w (j) . Then the FDP can be estimated as: FDP d = 77 P w (j) /jPm i=1 wi(1 − πi). We choose the largest possible threshold such that FDP is less than the nominal d level α. Specifically, we find k w = max j : P w (j) /jXm i=1 wi(1 − πi) ≤ α , (4.12) and reject H(1), · · · , H(kw) , which are the hypotheses with weighted p-values no larger than P w (kw) . Remark. We can substitute other thresholding rules for (4.12), such as the weighted BH (WBH) procedure [44]. However, numerical studies in Section C5 show that WBH can be overly conservative. 4.3 Theoretical Analysis 4.3.1 Asymptotic validity of the LASLA procedure In this section we establish the asymptotic validity of LASLA when the p-values are marginally independent. The dependent cases will be studied in Section C7 of the Supplementary Material. We also show that LASLA is asymptotically more powerful than the BH procedure [13] under mild conditions. By convention, we assume that the auxiliary variables only affect the alternative distribution but not the null distribution of the primary statistics. We start by analyzing the data-driven estimator πi , as its consistency is needed for valid FDR control. The next assumption requires that the distributional quantities vary smoothly in the vicinity of location i. (A1) For all i, j, P(Pj > τ |Dj , Dij = x) is continuous at x, and has bounded first and second derivatives. It is important to note that marginally independent p-values will become dependent conditional on auxiliary data. The next assumption generalizes the commonly used “weak dependence” notion in [124]. It requires that most of the neighborhood p-values (conditional on auxiliary data) do not exhibit strong pairwise dependence. 78 (A2) Var P j∈Ni Kh(Dij )I{Pj > τ}|D ≤ C P j∈Ni Var(Kh(Dij )I{Pj > τ}|Dj ) for some constant C > 1, for all i ∈ [m]. The next proposition establishes the convergence of πi to π τ i . Proposition 2. Recall that |Ni | = m1−ϵ . Under Assumptions (A1) and (A2), if m−1 ≪ h ≪ m−ϵ , we have, uniformly for all i ∈ [m], E [(πi − π τ i )|Di ] 2 → 0, as Di → Di . Define Zi = Φ−1 (1 − Pi/2) and denote by Z = (Z1, . . . , Zm) T . We collect below several regularity conditions for proving the asymptotic validity of LASLA. (A3) Assume that Pm i=1 P(θi = 0|Di) ≥ cm for some constant c > 0 and that Var[ Pm i=1 I{θi = 0}|D] = O(m1+ζ ) for some 0 ≤ ζ < 1, where D = {Di}. (A4) Define Sρ = i : 1 ≤ i ≤ m, |µi | ≥ (log m) (1+ρ)/2 , where µi = E(Zi). For some ρ > 0 and δ > 0, |Sρ| ≥ [1/(π 1/2α) + δ](log m) 1/2 , where π ≈ 3.14 is a constant. Remark. Condition (A3) assumes that the model is sparse and that {θi} m i=1 are not perfectly correlated (conditional on auxiliary data). Condition (A4) requires that there are a slowly growing number of signals having magnitude of the order {(log m)/n} (1+ρ)/2 if n samples are employed to obtain the p-value. Let t w = P w (kw) , where k w is calculated based on the step-wise algorithm (4.12) with weights from Algorithm 8. Denote by δ w ≡ δ w(t w) = {δ w i (t w) : i ∈ [m]} the set of decision rules, where δ w i (t w) = I{P w i ≤ t w}. Then the FDP and FDR of LASLA are respectively given by FDP(δ w) = Pm i=1(1 − θi)I{P w i ≤ t w} max{ Pm i=1 I{P w i ≤ t w}, 1} , and FDR = E {FDP(δ w)} . The next theorem states that LASLA controls both the FDP and FDR at the nominal level asymptotically. Theorem 10. Assume that the p-values are marginally independent. Under the conditions in Proposition 2 and (A3) and (A4), lim Di→Di,∀i FDR ≤ α, and lim Di→Di,∀i P(FDP(δ w) ≤ α + ε) = 1, for any ε > 0. 79 4.3.2 Asymptotic power analysis This section provides a theoretical analysis to demonstrate the benefit of the proposed weighting strategy. To simplify the analysis, we assume that the distance matrix D of the full network is known. Denote by δ v (t) = {δ v i (t) : i ∈ [m]} a class of testing rules based on weighted p-values, where δ v i (t) = I{P v i ≤ t}, P v i = Pi/vi and vi is the weight that is independent of Pi . It can be shown that (e.g. Proposition 2 of [21]) under mild conditions, the FDR of δ v (t) can be written as FDR{δ v (t)} = Qv (t|D) + o(1), where Q v (t|D) = Pm i=1(1 − π ∗ i )vit Pm i=1(1 − π ∗ i )vit + Pm i=1 π ∗ i F ∗ 1i (vit|Di) corresponds to the limiting value of the FDR, in which π ∗ i = P(θi = 1|Di). The power of δ v (t) can be evaluated using the expected number of true positives Ψv (t) = E { Pm i=1 δ v i (t)θi} = Pm i=1 π ∗ i F ∗ 1i (vit). Let w˜i = wi Pm j=1(1 − π ∗ j )/ Pm j=1{(1 − π ∗ j )wj}, where wi are the oracle-assisted weights from Algorithm 8. It is easy to see that LASLA and δ w˜ (t) share the same ranking of hypotheses. The goal is to compare the LASLA weights (vi = ˜wi : i ∈ [m]) with the naive weights {vi = 1 : i ∈ [m]}. Denote by t v o = sup{t : Qv (t) ≤ α} the oracle threshold for the p-values with generic weights {vi : i ∈ [m]}. The oracle procedure with LASLA weights and the (unweighted) oracle p-value procedure [42] are denoted by δ w˜ (t w˜ o ) and δ 1 (t 1 o ), respectively. Next we discuss some assumptions needed in our power analysis. The first condition states that weights should be “informative” in the sense that wi are constructed in a way such that, on average, small/large π ∗ i correspond to small/large wi . A similar assumption has been used in [44]. (A5) The oracle-assisted weights satisfy Pm i=1(1 − π ∗ i ) Pm i=1(1 − π ∗ i )wi · Pm i=1 π ∗ P i m i=1 π ∗ i w −1 i ≥ 1. 80 The second condition is concerned with the shape of the alternative p-value distributions. When the densities are homogeneous, i.e. F ∗ 1i (t) ≡ F ∗ 1 (t), it reduces to the condition that x → F ∗ 1 (t/x) is a convex function, which is satisfied by commonly used density functions [57, 21]. (A6) Pm i=1 aiF ∗ 1i (t/xi) ≥ Pm i=1 aiF ∗ 1i ( Pm j=1 aj t/Pm j=1 ajxj ) for any 0 ≤ ai ≤ 1, mini∈[m] w˜ −1 i ≤ xi ≤ maxi∈[m] w˜ −1 i and t 1 o/ mini∈[m] w˜ −1 i ≤ 1. The above two conditions are mild in many practical settings. For example, we checked that both are easily fulfilled in all our simulation studies with the proposed LASLA weights. The next theorem provides insights on why the weighting strategy used in LASLA provides power gain, as we shall see in our numerical studies. Theorem 11. Assume that p-values are marginal independent and Conditions (A5) and (A6) hold. Then (a) Qw˜ (t 1 o ) ≤ Q1 (t 1 o ) ≤ α; and (b) Ψw˜ (t w˜ o ) ≥ Ψw˜ (t 1 o ) ≥ Ψ1 (t 1 o ). The theorem implies that (a) if the same threshold t 1 o is used, then δ w˜ (t 1 o ) has smaller FDR and larger power than δ 1 (t 1 o ); (b) the thresholds satisfy t w˜ o ≥ t 1 o . Since Ψv (t) is non-decreasing in t, we conclude that δ w˜ (t w˜ o ) (oracle procedure with LASLA weights) dominates δ w˜ (t 1 o ) and hence δ 1 (t 1 o ) (unweighted oracle p-value procedure) in power. 4.4 Numerical experiments with synthetic data This section considers a model that mimics the network structure in the GWAS application of Section 4.5. The implementation details and additional simulation results for high-dimensional regression, latent variable model and multiple auxiliary samples are relegated to Section C5 of the Supplementary Material. Software implementing the algorithms and replicating all data experiments are available online at https://github.com/ZiyiLiang/r-lasla. 81 Fori ∈ [m], let θi ∼ Bernoulli(0.1) denote the existence or absence of the signal at index i. The primary data T = (Ti : i ∈ [m]) are generated as Ti ∼ (1 − θi)N(0, 1) + θiN(µ1, 1) with µ1 controlling the signal strength. The distance matrixD = (Dij )1≤i,j≤m follows Dij ∼ I{θi=θj} |N(µ2, 0.5)|+I{θi̸=θj} |N(1, 0.5)|, where 0 ≤ µ2 ≤ 1 controls the informativeness of the distance matrix. Intuitively, if θi = θj , then Dij should be relatively small. We investigate two settings. Setting 1: Fix µ2 = 0, m = 1200, vary µ1 from 2.5 to 3 by 0.1; Setting 2: Fix µ1 = 3, m = 1200, vary µ2 from 0 to 1 by 0.2. D becomes less informative as µ2 gets closer to 1. The nominal level is set at α = 0.05. Existing methods on structured multiple testing are not applicable due to the dimension mismatch of T and D. Hence we only compare the data-driven LASLA (LASLA.DD) to the vanilla BH method that discards the auxiliary information. The simulation results, which are averaged over 100 randomized data sets, are summarized in Figure 4.3. We can see that both methods control the FDR at the nominal level, and LASLA is conservative. In both settings, the power gain over BH is substantial. In Setting 2, we examine the performance of LASLA as the usefulness of D varies. The power gap between two methods becomes smaller as µ2 approaches 1. Note that even if D becomes completely non-informative (µ2 = 1), LASLA still outperforms BH. This is due to the fact that LASLA captures the asymmetry within the alternative distribution of the primary statistics. This is consistent with the finding in [126] that the lfdr ([34]) procedure dominates BH in power. 4.5 Numerical experiments with real data This section focuses on conducting association studies of Type 2 diabetes (T2D), a prevalent metabolic disease with strong genetic links. Our primary goal is to identify SNPs associated with T2D in diverse populations. We construct distance matrix from LD information to gain valuable insights into the genetic basis of complex diseases. 82 FDR Power 2.5 2.6 2.7 2.8 2.9 3.0 2.5 2.6 2.7 2.8 2.9 3.0 0.2 0.4 0.6 0.8 0.000 0.025 0.050 0.075 0.100 µ1 Method BH LASLA.DD (a) FDR Power 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.5 0.6 0.7 0.8 0.000 0.025 0.050 0.075 0.100 µ2 Method BH LASLA.DD (b) Figure 4.3: Empirical FDR and power comparison for data-driven LASLA and BH. (a): Setting 1; (b): Setting 2. Table 4.1: Number of rejections by different methods FDR 0.001 0.01 0.05 0.1 BH 35 61 128 179 LASLA 36 101 184 271 Table 4.2: Number of rejections by Bonferroni Correction FWER 0.001 0.01 0.05 0.1 Bonferroni 21 29 35 43 [118] performs a meta-analysis to combine 23 studies on a total of 77,418 individuals with T2D and 356,122 controls. For illustration purpose, we randomly choose m = 5000 SNPs from Chromosome 6 to be the target of inference. Primary statistics are the z-values provided in [118] and the auxiliary LD matrix is constructed by the genetic analysis tool Plink from the 1000 Genomes (1000G) Phase 3 Database. It is important to note that the primary and auxiliary data are collected from different populations and are not matched in dimension. We apply BH and LASLA at different FDR levels and compare them with the Bonferroni correction which is commonly used in GWAS to control Family-wise Error Rate (FWER). The number of rejections 83 by different methods are summarized in Tables 4.1 and 4.2. Both BH and LASLA are more powerful than the Bonferroni method. Moreover, at the same FDR level, LASLA makes notably more rejections than BH, and the discrepancy becomes even larger as the nominal FDR level increases. To illustrate the power gain of LASLA over BH, we visualize the rejected hypotheses in Figure 4.4. Red nodes in the figure present SNPs detected by LASLA but not by BH at FDR level 0.05. Nodes connected by an edge are in linkage disequilibrium. The graph highlights LASLA’s ability to leverage the LD matrix’s network structure for inference, leading to the identification of clusters of SNPs in LD. In contrast, BH could potentially miss important variants. Notably, LASLA detects T2D-risk variants within the gene CCHCR1, a new candidate gene for T2D reported in a recent study by [20]. Figure 4.4: Sub-network identified by LASLA (all nodes). BH only detects the blue nodes. 84 Chapter 5 Conclusions This thesis investigates the development of methodologies that efficiently utilize advanced machine learning models while maintaining statistical robustness. It is structured into two principal areas: conformalized post-selection inference and structured multiple testing, forming the thesis’s foundation. Chapter 2 merges concepts from several disciplines, such as conformal inference, hypothesis testing, and False Discovery Rate (FDR) calibration, to propose a methodology for computing integrative conformal p-values for out-of-distribution testing. This method leverages outlier data to weigh standard conformal p-values as well as to select and tune the most effective model from any toolbox of black-box classifiers. The method is practical for a wide range of applications and frequently outperforms existing benchmarks without necessitating large datasets. Chapter 3 integrates early stopping with conformal calibration to address overfitting and reduce overconfidence in predictions, respectively. Building on the insights from Chapter 2, this approach is extended to the complex regression scenarios, demonstrating a strategic combination of techniques to enhance model reliability and prediction accuracy. Chapter 4 introduces LASLA, a structured multiple testing procedure that utilizes network side information through a distance matrix to enhance testing power. By employing sparsity and distributional 85 characteristics of data to calculate weights, LASLA surpasses existing methods in performance. Additionally, it offers theoretical guarantees for controlling the False Discovery Rate (FDR) and False Discovery Proportion (FDP) under various conditions, providing evidence of its power advantage under mild assumptions. In summary, this thesis contributes to the dialogue on reliable machine learning by formulating methodologies that combine the efficiency of contemporary machine learning algorithms with robust statistical guarantees. It is hoped that the approaches discussed herein will inspire further investigation and innovation in the field of statistical learning. 86 Bibliography [1] Milton Abramowitz and Irene A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. ninth Dover printing, tenth GPO printing. Dover, 1964. [2] C.M. Achilles, Helen Pate Bain, Fred Bellott, Jayne Boyd-Zaharias, Jeremy Finn, John Folger, John Johnston, and Elizabeth Word. Tennessee’s Student Teacher Achievement Ratio (STAR) project. Version V1. 2008. doi: 10.7910/DVN/SIWH9F. [3] Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I Jordan. “Uncertainty sets for image classifiers using conformal prediction”. In: preprint arXiv:2009.14193 (2020). [4] Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. “Conformal prediction beyond exchangeability”. In: preprint arXiv:2202.13415 (2022). [5] Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, Ryan J Tibshirani, et al. “Predictive inference with the jackknife+”. In: Ann. Stat. 49.1 (2021), pp. 486–507. [6] Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, Ryan J Tibshirani, et al. “Predictive inference with the jackknife+”. In: Ann. Stat. 49.1 (2021), pp. 486–507. [7] Pallavi Basu, T Tony Cai, Kiranmoy Das, and Wenguang Sun. “Weighted False Discovery Rate Control in Large-Scale Multiple Testing”. In: J. Am. Statist. Assoc. 113.523 (2018), pp. 1172–1183. [8] Pallavi Basu, T Tony Cai, Kiranmoy Das, and Wenguang Sun. “Weighted false discovery rate control in large-scale multiple testing”. In: J. Am. Statist. Assoc. 113.523 (2018), pp. 1172–1183. [9] Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. “Distribution-free, risk-controlling prediction sets”. In: Journal of the ACM (JACM) 68.6 (2021), pp. 1–34. [10] Stephen Bates, Emmanuel Candès, Lihua Lei, Yaniv Romano, and Matteo Sesia. “Testing for outliers with conformal p-values”. In: Ann. Stat. 51.1 (2023), pp. 149–178. [11] Dorian Beganovic and Evgueni Smirnov. “Ensemble cross-conformal prediction”. In: 2018 IEEE International Conference on Data Mining Workshops. IEEE. 2018, pp. 870–877. 87 [12] Anthony Bellotti. “Optimized conformal classification using gradient descent approximation”. In: arXiv preprint arXiv:2105.11255 (2021). [13] Yoav Benjamini and Yosef Hochberg. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing”. In: J. Roy. Statist. Soc. B 57.1 (1995), pp. 289–300. issn: 00359246. [14] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. In: J. R. Stat. Soc. Series B 57.1 (1995), pp. 289–300. [15] Yoav Benjamini and Yosef Hochberg. “Multiple Hypotheses Testing with Weights”. In: Scandinavian Journal of Statistics 24.3 (1997), pp. 407–418. (Visited on 08/22/2022). [16] Yoav Benjamini and Daniel Yekutieli. “The control of the false discovery rate in multiple testing under dependency”. In: Ann. Stat. (2001), pp. 1165–1188. [17] Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. “Valid post-selection inference”. In: The Annals of Statistics 41.2 (2013), pp. 802–837. doi: 10.1214/12 − AOS1077. [18] Bike sharing dataset. https : //archive.ics.uci.edu/ml/datasets/bike + sharing + dataset. Accessed: July, 2019. [19] BlogFeedback dataset. https : //github.com/xinbinhuang/feature − selection_blogfeedback. Accessed: Mar, 2023. [20] Laura N et al Brenner. “Analysis of Glucocorticoid-Related Genes Reveal CCHCR1 as a New Candidate Gene for Type 2 Diabetes”. In: J. Endocr. Soc. 4.11 (2020), bvaa121. [21] T Tony Cai, Wenguang Sun, and Yin Xia. “LAWS: A Locally Adaptive Weighting and Screening Approach To Spatial Multiple Testing”. In: J. Am. Statist. Assoc. (2021), pp. 1–30. [22] T. Tony Cai, Wenguang Sun, and Weinan Wang. “CARS: Covariate assisted ranking and screening for large-scale two-sample inference (with discussion)”. In: J. Roy. Statist. Soc. B 81 (2019), pp. 187–234. [23] Tianxi Cai, T Tony Cai, and Anru Zhang. “Structured matrix completion with applications to genomic data integration”. In: J. Am. Statist. Assoc. 111.514 (2016), pp. 621–633. [24] Ismaël Castillo and Étienne Roquain. “On spike and slab empirical Bayes multiple testing”. In: Ann. Statist. 48.5 (2020), pp. 2548–2574. [25] Maxime Cauchois, Suyash Gupta, and John Duchi. “Knowing what You Know: valid and validated confidence sets in multiclass and multilabel prediction.” In: J. Mach. Learn. Res. 22 (2021), pp. 81–1. [26] Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. “Distributional conformal prediction”. In: Proceedings of the National Academy of Sciences 118.48 (2021), e2107794118. [27] Lei Clifton, David A Clifton, Yang Zhang, Peter Watkinson, Lionel Tarassenko, and Hujun Yin. “Probabilistic novelty detection with support vector machines”. In: IEEE Transactions on Reliability 63.2 (2014), pp. 455–467. 88 [28] Nicolo Colombo and Vladimir Vovk. “Training conformal predictors”. In: Conformal and Probabilistic Prediction and Applications. PMLR. 2020, pp. 55–64. [29] Communities and crime dataset. https : //github.com/vbordalo/Communities − Crime. Accessed: Mar, 2023. [30] Concrete compressive strength data set. http : //archive.ics.uci.edu/ml/datasets/concrete + compressive + strength. Accessed: July, 2019. [31] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Ieee. 2009, pp. 248–255. [32] Chesner Désir, Simon Bernard, Caroline Petitjean, and Laurent Heutte. “One class random forests”. In: Pattern Recognition 46.12 (2013), pp. 3490–3506. [33] Olivier Devillers and Mordecai J Golin. “Incremental algorithms for finding the convex hulls of circles and the lower envelopes of parabolas”. In: Information Processing Letters 56.3 (1995), pp. 157–164. [34] Bradley Efron, Robert Tibshirani, John D. Storey, and Virginia Tusher. “Empirical Bayes analysis of a microarray experiment”. In: J. Amer. Statist. Assoc. 96 (2001), pp. 1151–1160. issn: 0162-1459. [35] Bat-Sheva Einbinder, Yaniv Romano, Matteo Sesia, and Yanfei Zhou. “Training Uncertainty-Aware Classifiers with Conformalized Deep Learning”. In: Adv. Neural Inf. Process. Syst. Vol. 35. 2022. [36] Clara Fannjiang, Stephen Bates, Anastasios N Angelopoulos, Jennifer Listgarten, and Michael I Jordan. “Conformal prediction under feedback covariate shift for biomolecular design”. In: Proceedings of the National Academy of Sciences 119.43 (2022), e2204569119. [37] William Fithian and Lihua Lei. “Conditional calibration for false discovery rate control under dependence”. In: Ann. Stat. 50.6 (2022), pp. 3091–3118. [38] William Fithian, Dennis Sun, and Jonathan Taylor. Optimal Inference After Model Selection. 2017. arXiv: 1410.2597 [math.ST]. [39] Dean P Foster and Robert A Stine. “α-investing: a procedure for sequential control of expected false discoveries”. In: J. R. Stat. Soc. B 70.2 (2008), pp. 429–444. [40] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. “The limits of distribution-free conditional predictive inference”. In: Information and Inference 10.2 (2021), pp. 455–482. [41] Luella Fu, Bowen Gang, Gareth M James, and Wenguang Sun. “Heteroscedasticity-adjusted ranking and thresholding for large-scale multiple testing”. In: J. Am. Statist. Assoc. 117.538 (2022), pp. 1028–1040. [42] Christopher Genovese and Larry Wasserman. “Operating characteristics and extensions of the false discovery rate procedure”. In: J. R. Stat. Soc. B 64 (2002), pp. 499–517. issn: 1369-7412. 89 [43] Christopher R Genovese, Kathryn Roeder, and Larry Wasserman. “False discovery control with p-value weighting”. In: Biometrika 93.3 (2006), pp. 509–524. [44] Christopher R Genovese, Kathryn Roeder, and Larry Wasserman. “False discovery control with p-value weighting”. In: Biometrika 93.3 (2006), pp. 509–524. [45] Isaac Gibbs and Emmanuel Candès. “Conformal inference for online prediction with arbitrary distribution shifts”. In: arXiv preprint arXiv:2208.08401 (2022). [46] Frank E Grubbs. “Procedures for detecting outlying observations in samples”. In: Technometrics 11.1 (1969), pp. 1–21. [47] Leying Guan and Robert Tibshirani. “Prediction and outlier detection in classification problems”. In: J. R. Stat. Soc. Series B 84.2 (2022), p. 524. [48] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. “On calibration of modern neural networks”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org. 2017, pp. 1321–1330. [49] Chirag Gupta, Arun K Kuchibhotla, and Aaditya Ramdas. “Nested conformal prediction and quantile out-of-bag ensemble methods”. In: Pattern Recognition 127 (2022), p. 108496. [50] Matan Haroush, Tzviel Frostig, Ruth Heller, and Daniel Soudry. “A Statistical Framework for Efficient Out of Distribution Detection in Deep Neural Networks”. In: ICLR. 2021. [51] Ville Hautamaki, Ismo Karkkainen, and Pasi Franti. “Outlier detection using k-nearest neighbour graph”. In: Proc. Int. Conf. Pattern Recognit. Vol. 3. IEEE. 2004, pp. 430–433. [52] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016, pp. 770–778. [53] Yotam Hechtlinger, Barnabás Póczos, and Larry Wasserman. Cautious Deep Learning. arXiv:1805.09460. 2018. arXiv: 1805.09460 [stat.ML]. [54] Ruth Heller and Saharon Rosset. “Optimal control of false discovery criteria in the two-group model”. In: J. Roy. Statist. Soc. B 83.1 (2021), pp. 133–155. [55] House prices from King County dataset. https : //www.kaggle.com/datasets/shivachandel/kc − house − data?select = kc_house_data.csv. Accessed: Mar, 2023. [56] James X Hu, Hongyu Zhao, and Harrison H Zhou. “False discovery rate control with groups”. In: J. Am. Stat. Assoc. 105.491 (2010), pp. 1215–1227. [57] James X Hu, Hongyu Zhao, and Harrison H Zhou. “False discovery rate control with groups”. In: J. Am. Statist. Assoc. 105 (2010), pp. 1215–1227. [58] Nikolaos Ignatiadis and Wolfgang Huber. “Covariate powered cross-weighted multiple testing”. In: J. R. Stat. Soc. Series B 83.4 (2021), pp. 720–751. 90 [59] Nikolaos Ignatiadis, Bernd Klaus, Judith B Zaugg, and Wolfgang Huber. “Data-driven hypothesis weighting increases detection power in genome-scale multiple testing”. In: Nat. Methods 13.7 (2016), p. 577. [60] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: International conference on machine learning. PMLR. 2015, pp. 448–456. [61] Rafael Izbicki, Gilson Shimizu, and Rafael Stern. “Flexible distribution-free conditional predictive bands using density estimators”. In: AISTATS. PMLR. 2020, pp. 3068–3077. [62] Ying Jin and Emmanuel J Candès. “Selection by prediction with conformal p-values”. In: Journal of Machine Learning Research 24.244 (2023), pp. 1–41. [63] Shehroz S Khan and Michael G Madden. “One-class classification: taxonomy of study and review of techniques”. In: The Knowledge Engineering Review 29.3 (2014), pp. 345–374. [64] Byol Kim, Chen Xu, and Rina Barber. “Predictive inference is free with the jackknife+-after-bootstrap”. In: Adv. Neural Inf. Process. Syst. 33 (2020), pp. 4138–4149. [65] Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: preprint at arXiv:1412.6980 (2014). [66] Roger Koenker and Gilbert Bassett. “Regression Quantiles”. In: Econometrica 46.1 (1978), pp. 33–50. (Visited on 03/24/2023). [67] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009. url: https : //www.cs.toronto.edu/~kriz/cifar.html. [68] E Krusińska. “A valuation of state of object based on weighted Mahalanobis distance”. In: Pattern Recognit. 20.4 (1987), pp. 413–418. [69] Arun K. Kuchibhotla, John E. Kolassa, and Todd A. Kuffner. “Post-Selection Inference”. In: Annual Review of Statistics and Its Application 9.1 (2022), pp. 505–527. doi: 10.1146/annurev − statistics − 100421 − 044639. [70] Rikard Laxhammar and Göran Falkman. “Inductive conformal anomaly detection for sequential detection of anomalous sub-trajectories”. In: Ann. Math. Artif. Intell. 74.1-2 (2015), pp. 67–94. [71] Jason D. Lee, Dennis L. Sun, Yuekai Sun, and Jonathan E. Taylor. “Exact post-selection inference, with application to the lasso”. In: The Annals of Statistics 44.3 (2016), pp. 907–927. doi: 10.1214/15 − AOS1371. [72] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. “Distribution-Free Predictive Inference for Regression”. In: J. Am. Stat. Assoc. 113.523 (2018), pp. 1094–1111. doi: 10.1080/01621459.2017.1307116.. 91 [73] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. “Distribution-free predictive inference for regression”. In: J. Am. Stat. Assoc. 113.523 (2018), pp. 1094–1111. [74] Jing Lei, James Robins, and Larry Wasserman. “Distribution-Free Prediction Sets”. In: J. Am. Stat. Assoc. 108.501 (2013), pp. 278–287. doi: 10.1080/01621459.2012.751873.. [75] Jing Lei and Larry Wasserman. “Distribution-free prediction bands for non-parametric regression”. In: J. R. Stat. Soc. Series B 76.1 (2014), pp. 71–96. doi: 10.1111/rssb.12021.. [76] Lihua Lei and William Fithian. “AdaPT: an interactive procedure for multiple testing with side information”. In: J. R. Stat. Soc. Series B 80.4 (2018), pp. 649–679. [77] Lihua Lei and William Fithian. “AdaPT: an interactive procedure for multiple testing with side information”. In: J. R. Stat. Soc. B 80.4 (2018), pp. 649–679. [78] Lihua Lei, Aaditya Ramdas, and William Fithian. “STAR: A general interactive framework for FDR control under structural constraints”. In: arXiv preprint arXiv:1710.02776 (2017). [79] Ang Li and Rina Foygel Barber. “Accumulation tests for FDR control in ordered hypothesis testing”. In: J. Am. Stat. Assoc. 112.518 (2017), pp. 837–849. [80] Ang Li and Rina Foygel Barber. “Multiple testing with the structure-adaptive Benjamini–Hochberg algorithm”. In: J. R. Stat. Soc. B 81.1 (2019), pp. 45–74. [81] Ziyi Liang, T. Tony Cai, Wenguang Sun, and Yin Xia. Locally Adaptive Algorithms for Multiple Testing with Network Structure, with Application to Genome-Wide Association Studies. 2023. arXiv: 2203.11461 [stat.ME]. [82] Ziyi Liang, Matteo Sesia, and Wenguang Sun. “Integrative conformal p-values for out-of-distribution testing with labelled outliers”. In: Journal of the Royal Statistical Society Series B: Statistical Methodology (Jan. 2024), qkad138. issn: 1369-7412. doi: 10.1093/jrsssb/qkad138. eprint: https : //academic.oup.com/jrsssb/advance − article − pdf/doi/10.1093/jrsssb/qkad138/55379173/qkad138.pdf. [83] Ziyi Liang, Yanfei Zhou, and Matteo Sesia. Conformal inference is (almost) free for neural networks trained with early stopping. 2023. arXiv: 2301.11556 [stat.ML]. [84] Henrik Linusson, Ulf Johansson, and Henrik Boström. “Efficient conformal predictor ensembles”. In: Neurocomputing 397 (2020), pp. 266–278. [85] Tuve Löfström, Ulf Johansson, and Henrik Boström. “Effective utilization of data in inductive conformal prediction”. In: Proc. Int. Jt. Conf. Neural Netw. IEEE. 2013. [86] Gavin Lynch, Wenge Guo, Sanat K Sarkar, Helmut Finner, et al. “The control of the false discovery rate in fixed sequence multiple testing”. In: Electron. J. Stat. 11.2 (2017), pp. 4649–4673. [87] Ariane Marandon, Lihua Lei, David Mary, and Etienne Roquain. “Machine learning meets false discovery rate”. In: preprint arXiv:2208.06685 (2022). 92 [88] Gary Marcus. “Deep learning: A critical appraisal”. In: arXiv preprint arXiv:1801.00631 (2018). [89] Markos Markou and Sameer Singh. “Novelty detection: a review—part 1: statistical approaches”. In: Signal Process. 83.12 (2003), pp. 2481–2497. [90] Medical Expenditure Panel Survey, Panel 21. https : //meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber = HC − 192. Accessed: January, 2019. [91] Ignacio Medina, José Carbonell, Luis Pulido, Sara Madeira, Stefan Götz, Ana Conesa, Joaquín Tárraga, Alberto Pascual-Montano, Ruben Nogales-Cadenas, Javier Santoyo-Lopez, Francisco García-García, Martina Marba, David Montaner, and Joaquin Dopazo. “Babelomics: An integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling”. In: Nucleic acids research 38 (July 2010), W210–3. doi: 10.1093/nar/gkq388. [92] Franck Nielsen and Mariette Yvinec. “An output-sensitive convex hull algorithm for planar objects”. In: International Journal of Computational Geometry & Applications 8.01 (1998), pp. 39–65. [93] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. “Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift”. In: Advances in Neural Information Processing Systems 32 (2019). [94] Edsel A Peña, Joshua D Habiger, and Wensong Wu. “Power-enhanced multiple decision functions controlling family-wise error and false discovery rates”. In: Ann. Statist. 39.1 (2011), p. 556. [95] Physicochemical properties of protein tertiary structure data set. https : //archive.ics.uci.edu/ml/datasets/Physicochemical + Properties + of + Protein + Tertiary + Structure. Accessed: July, 2019. [96] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. “A review of novelty detection”. In: Signal Process. 99 (2014), pp. 215–249. [97] Heysem Kaya Pınar Tüfekci. UCI Machine Learning Repository. 2012. url: https : //archive.ics.uci.edu/ml/datasets/combined + cycle + power + plant. [98] Lutz Prechelt. “Automatic early stopping using cross validation: quantifying the criteria”. In: Neural networks 11.4 (1998), pp. 761–767. [99] Zhimei Ren and Emmanuel Candès. “Knockoffs with side information”. In: Ann. Appl. Stat. 17.2 (2023), pp. 1152–1174. [100] Philippe Rigollet and Xin Tong. “Neyman-Pearson classification, convexity and stochastic constraints”. In: J. Mach. Learn. Res. 12 (2011), pp. 2831–2855. [101] Kathryn Roeder and Larry Wasserman. “Genome-wide significance levels and weighted hypothesis testing”. In: Statistical science: a review journal of the Institute of Mathematical Statistics 24.4 (2009), p. 398. 93 [102] Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel Candès. “With Malice Toward None: Assessing Uncertainty via Equalized Coverage”. In: Harvard Data Science Review (2020). [103] Yaniv Romano, Evan Patterson, and Emmanuel J Candès. “Conformalized quantile regression”. In: Adv. Neural Inf. Process. Syst. Vol. 32. 2019, pp. 3538–3548. [104] Yaniv Romano, Matteo Sesia, and Emmanuel J. Candès. “Classification with Valid and Adaptive Coverage”. In: Adv. Neural Inf. Process. Syst. Vol. 33. 2020. [105] Etienne Roquain and Mark Van de Wiel. “Optimal weighting for false discovery rate control”. In: Electronic Journal of Statistics 3 (July 2008). doi: 10.1214/09 − EJS430. [106] Richard Lockhart Ryan J. Tibshirani Jonathan Taylor and Robert Tibshirani. “Exact Post-Selection Inference for Sequential Regression Procedures”. In: Journal of the American Statistical Association 111.514 (2016), pp. 600–620. doi: 10.1080/01621459.2015.1108848. [107] Mauricio Sadinle, Jing Lei, and Larry Wasserman. “Least ambiguous set-valued classifiers with bounded error levels”. In: J. Am. Stat. Assoc. 114.525 (2019), pp. 223–234. [108] Tim Salimans and Durk P Kingma. “Weight normalization: A simple reparameterization to accelerate training of deep neural networks”. In: Advances in neural information processing systems 29 (2016). [109] Craig Saunders, Alexander Gammerman, and Volodya Vovk. “Transduction with confidence and credibility”. In: IJCAI. 1999. [110] Marc A Schaub, Alan P Boyle, Anshul Kundaje, Serafim Batzoglou, and Michael Snyder. “Linking disease associations with regulatory information in the human genome”. In: Genome Res. 22.9 (2012), pp. 1748–1759. [111] Matteo Sesia and Emmanuel J Candès. “A comparison of some conformal quantile regression methods”. In: Stat 9.1 (2020). [112] Matteo Sesia, Stefano Favaro, and Edgar Dobriban. “Conformal Frequency Estimation with Sketched Data under Relaxed Exchangeability”. In: arXiv preprint arXiv:2211.04612 (2022). [113] Matteo Sesia and Yaniv Romano. “Conformal Prediction using Conditional Histograms”. In: Adv. Neural Inf. Process. Syst. 34 (2021). [114] Connor Shorten and Taghi M Khoshgoftaar. “A survey on image data augmentation for deep learning”. In: Journal of Big Data 6.1 (2019), pp. 1–48. [115] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. 94 [116] James Smith, Ilia Nouretdinov, Rachel Craddock, Charles Offer, and Alexander Gammerman. “Conformal anomaly detection of trajectories with a multi-class hierarchy”. In: International symposium on statistical learning and data sciences. Springer. 2015, pp. 281–290. [117] Hwanjun Song, Minseok Kim, and Jae-Gil Lee. “Selfie: Refurbishing unclean samples for robust deep learning”. In: International Conference on Machine Learning. PMLR. 2019, pp. 5907–5915. [118] C.N. Spracklen, M. Horikoshi, and Y.J. et al. Kim. “Identification of type 2 diabetes loci in 433,540 East Asian individuals.” In: Nature 582 (2020), pp. 240–245. [119] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. In: J. Mach. Learn. Res. 15 (2014), pp. 1929–1958. [120] Michael L. Stein. “Fixed-Domain Asymptotics for Spatial Periodograms”. In: J. Am. Statist. Assoc. 90.432 (1995), pp. 1277–1288. doi: 10.1080/01621459.1995.10476632. [121] Ingo Steinwart and Andreas Christmann. “Estimating conditional quantiles with the help of the pinball loss”. In: Bernoulli 17.1 (2011), pp. 211–225. [122] John D Storey. “A direct approach to false discovery rates”. In: J. R. Stat. Soc. Series B 64.3 (2002), pp. 479–498. [123] John D Storey, Jonathan E Taylor, and David Siegmund. “Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach”. In: J. R. Stat. Soc. Series B 66.1 (2004), pp. 187–205. [124] John D. Storey. “The positive false discovery rate: a Bayesian interpretation and the q-value”. In: Ann. Statist. 31 (2003), pp. 2013–2035. issn: 0090-5364. [125] David Stutz, Krishnamurthy Dvijotham, Ali Taylan Cemgil, and Arnaud Doucet. “Learning Optimal Conformal Classifiers”. In: arXiv preprint arXiv:2110.09192 (2021). [126] Wenguang Sun and T. Tony Cai. “Oracle and adaptive compound decision rules for false discovery rate control”. In: J. Amer. Statist. Assoc. 102 (2007), pp. 901–912. issn: 0162-1459. [127] Wenguang Sun, Brian J Reich, T. T. Cai, Michele Guindani, and Armin Schwartzman. “False discovery control in large-scale spatial multiple testing”. In: J. R. Stat. Soc. B 77.1 (2015), pp. 59–83. [128] James W. Taylor. “A quantile regression neural network approach to estimating the conditional density of multiperiod returns”. In: Journal of Forecasting 19.4 (2000), pp. 299–311. doi: 10.1002/1099 − 131X(200007)19 : 4 < 299 :: AID − FOR775 > 3.0.CO; 2 − V.. [129] Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. “On mixup training: Improved calibration and predictive uncertainty for deep neural networks”. In: Adv. Neural. Inf. Process. Syst. 2019, pp. 13888–13899. [130] Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. “Conformal prediction under covariate shift”. In: Adv. Neural Inf. Process. Syst. 32 (2019). 95 [131] Xin Tong. “A plug-in approach to Neyman-Pearson classification”. In: J. Mach. Learn. Res. 14.1 (2013), pp. 3011–3040. [132] Xin Tong, Yang Feng, and Jingyi Jessica Li. “Neyman-Pearson classification algorithms and NP receiver operating characteristics”. In: Science Advances 4.2 (2018), eaao1659. [133] Vladimir Vovk. “Conditional Validity of Inductive Conformal Predictors”. In: Proceedings of the Asian Conference on Machine Learning. Vol. 25. 2012, pp. 475–490. [134] Vladimir Vovk. “Transductive conformal predictors”. In: IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer. 2013, pp. 348–360. [135] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer, 2005. doi: 10.1007/b106715.. [136] Vladimir Vovk, Alexander Gammerman, and Craig Saunders. “Machine-learning applications of algorithmic randomness”. In: International Conference on Machine Learning. 1999, pp. 444–453. [137] Vladimir Vovk, David Lindsay, Ilia Nouretdinov, and Alex Gammerman. Mondrian Confidence Machine. Technical Report. Royal Holloway, University of London, 2003. [138] Yin Xia, T Tony Cai, and Wenguang Sun. “GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference”. In: J. Am. Statist. Assoc. 115 (2020), pp. 1236–1250. [139] Chen Xu and Yao Xie. “Conformal Anomaly Detection on Spatio-Temporal Observations with Missing Data”. In: preprint arXiv:2105.11886 (2021). [140] Yachong Yang and Arun Kumar Kuchibhotla. Finite-sample Efficient Conformal Prediction. 2021. arXiv: 2104.13871 [stat.ME]. [141] Yachong Yang and Arun Kumar Kuchibhotla. “Finite-sample efficient conformal prediction”. In: arXiv preprint arXiv:2104.13871 (2021). [142] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding deep learning (still) requires rethinking generalization”. In: Communications of the ACM 64.3 (2021), pp. 107–115. [143] Anqi Zhao, Yang Feng, Lie Wang, and Xin Tong. “Neyman-Pearson classification under high-dimensional settings”. In: J. Mach. Learn. Res. 17.1 (2016), pp. 7469–7507. [144] Zilu Zhou, Weixin Wang, Li-San Wang, and Nancy Ruonan Zhang. “Integrative DNA copy number detection and genotyping from sequencing and array-based platforms”. In: Bioinformatics 34.14 (2018), pp. 2349–2355. 96 Appendix A Appendix to Chapter 2 A1 FDR control with TCV+ integrative conformal p-values The conditional FDR calibration method can also be adapted to deal with the TCV+ integrative p-values computed by Algorithm 3, although at higher computational cost. Our three-step solution is outlined below and summarized graphically in Figure A5. Step 1: Calibration. For each of the m test points, i ∈ Dtest = {n + 1, . . . , n + m}, let Φi be the following collection of variables: Φi = sˆ k(j) 1 (Xj ), ∀j ∈ D1, unordered set {sˆ k 1 (Xj )}j , for j ∈ D0 ∪ {i}, and for all k ∈ [K], sˆ k 1 (Xj ), for all j ∈ Dtest\{i} and k ∈ [K], unordered set {Xj}j , for j ∈ D0 ∪ {i}, Xj , for all j ∈ Dtest\{i}. (A1) For each test point i ∈ Dtest, compute a vector of m random variables u˜i(Xj ), for all j ∈ Dtest \{i}, by applying Algorithm 3 with D0 replaced by D0∪{i}. In contrast with Section 2.3, now evaluating each u˜i(Xj ) 97 involves re-fitting K machine-learning models, and is therefore quite expensive. Let R˜ i be the number of rejections obtained by applying BH at level α to {u˜i(Xn+1), . . . , u˜i(Xi−1), 0, u˜i(Xi+1), . . . , u˜i(Xn+m)}. Steps 2–3. Same as the corresponding steps 2–3 in Section 2.3. Theorem A1 below establishes that this solution rigorously controls the FDR. Theorem A1. Assume Xi is exchangeable with the inliers in D0, conditional on D1, for all i ∈ Dtest with Yi = 0. Then, the expected proportion of inliers in the set R output by the above three-step procedure is smaller than αm0/m, where m = |Dtest| and m0 is the number of inliers in Dtest . The computational cost of this guarantee is high. In fact, the conditional calibration algorithm needs to re-train a potentially large number of machine learning classifiers in the case of the TCV+ p-values, and therefore this approach may not be very practical when working with large data sets. Fortunately, pairwise correlations between integrative conformal p-values naturally tend to decrease as the number of labeled data points increases. Therefore, it is reasonable to expect that BH can be safely relied upon with large samples, as suggested by our numerical experiments. A2 Analysis of computational complexity This section studies the computational complexity of the methods presented in this paper, comparing the results to the benchmark cost of computing standard conformal p-values. These analyses focus on computing time because memory costs do not cause additional concerns for the algorithms considered here. Before we begin, it is useful to recall that n = |D0| + |D1| is the total number of labeled data points, n0 = |Dcal 0 | is the number of labeled inliers assigned in the calibration set, and n1 = |Dcal 1 | is the corresponding number of labeled outliers. In the case of Algorithm 3, which does not involve a clear distinction between training and calibration data, it will be understood that n0 = |D0| and n1 = |D1|. This slight abuse of notation will not introduce any ambiguity as long as O(n1) = O(n0) = O(n). Let 98 Table A1: Computational costs of computing conformal p-values using different methods. These results are derived in Sections A2–A2 Single test point Multiple (m) test points Standard conformal p-values [135] O(T + n) O(T + (m + n) log n) Algorithm 1 O(T + n log n) O(T + (m + mn ∧ n 2 ) log n) Algorithm 2 (with M candidate models) O(MT + n(M + log n)) O(MT + (M(m + n) + (mn ∧ n 2 )) log n) Algorithm 3 (with K folds) O(KT + Kn log n) O(KmT + (K(n + m) + mn) log n) Transductive conformal p-values [134] O(T + n) O(mT + mn) T0 = T0(n) and T1 = T1(n) respectively denote the computational costs of training the classifiers A0 and A1 utilized by Algorithm 1. In the case of Algorithm 2, we will assume for simplicity that each of the M0 classifiers Am 0 has the same training cost T0, and each of the M1 classifiers Am 1 has the same training cost T1. In the case of Algorithm 3, the numbers of folds for the classifiers Am 0 and Am 1 are K0 and K1, respectively. The results of this analysis, explained in detail below, are summarised in Table A1 under the simplifying assumptions that O(n1) = O(n0) = O(n), O(T0) = O(T1) = O(T), O(M0) = O(M1) = O(M), and O(K0) = O(K1) = O(K). Note that the following short-hand notation will be used: a ∧ b = min{a, b}. Cost analysis of standard conformal p-values The results of these section can be found in [135], but we repeat them here for completeness. Analysis for a single test point. The cost of computing a standard (inductive) conformal p-value for a single test point is clearly O(T + n), as shown below. • Training the classifier has cost O(T0). • After the classifier is trained, the cost of computing conformity scores for all i ∈ Dcal 0 ∪ {n + 1} is O(n0). 99 • The cost of computing uˆ0(Xn+1) by directly applying (2.1) is O(n0). Therefore, the overall cost is O(T0 + n0), which becomes O(T + n) under the simplifying assumptions. Analysis for m test points. The cost of computing standard conformal p-values for m distinct test points, indexed by Dtest, is O(T + (m + n) · log n), as shown below. • The fixed cost of training the classifier is O(T0), because this only needs to be trained once on the data in Dtrain 0 . • After the classifier is trained, the cost of computing conformity scores for all i ∈ Dcal 0 ∪ Dtest is O(n0 + m). • The cost of computing uˆ0(Xj )for all j ∈ Dtest with (2.1) is O((m+n0)·log n). In fact, the numerator of (2.1) is equal to one plus the rank of sˆ0(Xj ) among {sˆ0(Xi)}i for i ∈ Dcal 0 . Sorting the latter can be carried out at worst-case cost O(n0 · log n0). Then, the rank of each of the m statistics sˆ0(Xj ) can be found at cost O(log n0) with a binary search. Therefore, the overall cost is O(T0 + (m + n0) · log n0), which becomes O(T + (m + n) · log n) under the simplifying assumptions. Cost analysis of Algorithm 1 Analysis for a single test point. The cost of applying Algorithm 1 to compute an integrative conformal p-value for a single test point is O(T + n · log n). This result is derived below, by analyzing separately each component of Algorithm 1. • Training the classifiers has cost O(T0 + T1). • After the classifiers are trained, the cost of computing conformity scores for all i ∈ Dcal 0 ∪ Dcal 1 ∪ {n + 1} is O(n0 + n1). 100 • The cost of computing uˆ0(Xi) with (2.4) for all i ∈ Dcal 0 ∪ {n + 1} is O(n0 · log n0). In fact, the numerator of (2.4) is equal to the rank of sˆ0(Xj ) among {sˆ0(Xi)}i for i ∈ Dcal 0 ∪ {n+ 1}, and thus it suffices to sort the latter in order to compute sˆ0(Xj ) for all j ∈ Dcal 0 {n + 1}. This sorting operation can be carried out at worst-case cost O(n0 · log n0). • The cost of computing uˆ1(Xi) with (2.2) for all i ∈ Dcal 0 ∪ {n + 1} is O((n0 + n1) · log n1). In fact, note that evaluating the numerator of (2.2), for any j ∈ Dcal 0 ∪ {n + 1}, requires finding the rank of sˆ1(Xj ) among {sˆ1(Xi)}i for i ∈ Dcal 1 . Pre-sorting the vector {sˆ1(Xi)}i for i ∈ Dcal 1 has cost O(n1 · log n1). Then, after pre-sorting, the additional cost of finding the correct position of each sˆ1(Xj ), for j ∈ Dcal 0 ∪ {n + 1}, is only O(log n1) using a binary search. • Evaluating the statistics rˆ(Xi) for all i ∈ {n + 1} ∪ Dcal 0 using (2.3) has cost O(n0). • Calculating the output conformal p-value uˆ(Xn+1) with (2.5) has cost O(n0). Therefore, the overall cost is O(T0+T1+n0·log n0+n1·log n1+n0·log n1), which becomes O(T +n·log n) under the simplifying assumptions. Analysis for m test points. The cost of applying Algorithm 1 to compute integrative conformal pvalues for m distinct test points, indexed by Dtest, is O(T + min{m · n, m + n 2 )} · log n). This result is derived below, by analyzing separately each component of Algorithm 1. • The fixed cost of training the classifiers is O(T0 + T1), because these only need to be trained once on the data in Dtrain 0 and Dtrain 1 . • After the classifiers are trained, the cost of computing conformity scores for all i ∈ Dcal 0 ∪Dcal 1 ∪Dtest is O(n0 + n1 + m). • The cost of computing uˆ1(Xi) with (2.2) for all i ∈ Dcal 0 ∪ Dtest is O((n1 + n0 + m) · log n1). In fact, note that evaluating the numerator of (2.2), for any j ∈ Dcal 0 ∪ Dtest, requires finding the rank 101 of sˆ1(Xj ) among {sˆ1(Xi)}i for i ∈ Dcal 1 . Pre-sorting the vector {sˆ1(Xi)}i for i ∈ Dcal 1 has cost O(n1 · log n1). Then, after pre-sorting, the additional cost of finding the correct position of each sˆ1(Xj ), for j ∈ Dcal 0 ∪ Dtest, would only be O(log n1) using a binary search. • The cost of computing the output conformal p-values uˆ(Xi) for all i ∈ Dtest with (2.5) can be understood as follows. First, note that for each i ∈ Dtest: uˆ(Xi) = 1 1 + n0 1 + X j∈Dcal 0 I uˆ0(Xj ; Xi) uˆ1(Xj ) ≤ uˆ0(Xi ; Xi) uˆ1(Xi) , (A2) where, by (2.3) and (2.4), uˆ0(Xj ; Xi) = 1 n0 + 1 X k∈Dcal 0 Isˆ0(Xk) ≤ sˆ0(Xj ) + 1 n0 + 1 Isˆ0(Xi) ≤ sˆ0(Xj ) = ˜u0(Xj ) + 1 n0 + 1 Isˆ0(Xi) ≤ sˆ0(Xj ). Above, the function u˜0 is defined as: u˜0(x) := 1 n0 + 1 X k∈Dcal 0 Isˆ0(Xk) ≤ sˆ0(x). Note that the cost of computing u˜0(Xj ) with (2.2) for all j ∈ Dcal 0 ∪ Dtest is O((n0 + m)· log n0). In fact, pre-sorting the vector {sˆ0(Xk)}k for k ∈ Dcal 0 has cost O(n0 · log n0). Then, after pre-sorting, the additional cost of finding the correct position of each sˆ0(Xj ), for j ∈ Dcal 0 ∪ Dtest, would only be O(log n0) using a binary search. With this notation, (A2) can be re-written as: uˆ(Xi) = 1 + P j∈Dcal 0 I u˜0(Xj ) + Isˆ0(Xi)≤sˆ0(Xj ) n0+1 uˆ1(Xj ) ≤ u˜0(Xi) + 1 n0+1 uˆ1(Xi) 1 + n0 . (A3) 102 We now distinguish between two possible approaches. • Approach I (efficient for m ≤ n0). In this case, we evaluate the numerator of (A3) independently for each i ∈ Dtest. This amounts to finding the rank of [˜u0(Xi)+1/(1+n0)]/uˆ1(Xi) among {[˜u0(Xj )+ Isˆ0(Xi) ≤ sˆ0(Xj )/(n0 + 1)]/uˆ1(Xj )}j for j ∈ Dcal 0 . As discussed above, each of these operations can be carried out with cost O(n0 · log n0), so the total cost of this component is O(m · n0 · log n0). In conclusion, the total cost of Algorithm 1 with this approach is: O(T0 + T1 + (n1 + n0 + m) · log n1 + m · n0 · log n0), which becomes O(T + m · n · log n) under the simplifying assumptions. • Approach II (efficient for m > n0). In this case, we note that the rank of sˆ0(Xi) among {sˆ0(Xj )}j for j ∈ Dcal 0 can only take one of 1 + n0 possible values, and therefore there are at most 1 + n0 possible configurations of the vector {[˜u0(Xj ) + Isˆ0(Xi) ≤ sˆ0(Xj )/(n0 + 1)]/uˆ1(Xj )}j for j ∈ Dcal 0 , for any set of {Xj}j∈Dcal 0 , regardless of m. Therefore, pre-sorting all of these vectors has cost O(n 2 0 · log n0). Then, finding the rank of [˜u0(Xi) + 1/(1 + n0)]/uˆ1(Xi), for any one i ∈ Dtest , among the appropriate pre-sorted vector {[˜u0(Xj ) + Isˆ0(Xi) ≤ sˆ0(Xj )/(n0 + 1)]/uˆ1(Xj )}j , for j ∈ Dcal 0 , can be carried out at cost O(log n0) using a binary search. Repeating the last step for all i ∈ Dtest yields cost O(m·log n0). In conclusion, the total cost of Algorithm 1 with this approach is: O(T0+T1+ (n1+n0+m)·log n1+n 2 0 ·log n0+m·log n0), which becomes O(T + (m+n 2 )·log n) under the simplifying assumptions. Cost analysis of Algorithm 2 Let us assume for simplicity that each of the M0 classifiers Am 0 has the same training cost T0, and each of the M1 classifiers Am 1 has the same training cost T1. Analysis for a single test point. The cost of applying Algorithm 2 to compute an integrative conformal p-value for a single test point is O(M · (T0 + T1) + n · (M + log n)). This result is derived below, by analyzing separately each component of Algorithm 2. 103 • Training the classifiers Am 0 has cost O(M0 · T0 + M1 · T1). • After the classifiers Am 0 are trained, the cost of computing conformity scores sˆ m 0 (Xi) for all models m ∈ [M0] and all i ∈ Dcal 0 ∪ Dcal 1 ∪ {n + 1} is O(M0 · (n0 + n1)). • Selecting the optimal classifier m∗ 0 has cost O(M0 · (n0 + n1)). • After the classifiers Am 1 are trained, the cost of computing conformity scores sˆ m 1 (Xi) for all models m ∈ [M1] and all i ∈ Dcal 0 ∪ Dcal 1 ∪ {n + 1} is O(M1 · (n0 + n1)). • Selecting the optimal classifier m∗ 1 has cost O(M1 · (n0 + n1)). • The remaining part of Algorithm 2, which is based on the selected models m∗ 0 and m∗ 1 , has the same cost, O(n0 · log n0 + n1 · log n1 + n0 · log n1), as the corresponding part of Algorithm 1. Therefore, the overall cost is O(M0·T0+M1·T1+(M0+M1)·(n0+n1)+n0·log n0+n1·log n1+n0·log n1), which becomes O(MT + n · (M + log n)) under the simplifying assumptions. Analysis for m test points. The cost of applying Algorithm 2 to compute integrative conformal pvalues for m distinct test points, indexed by Dtest, is O(MT + M · (n + m) · log n + (mn ∧ n 2 ) log n). This result is derived below, by analyzing separately each component of Algorithm 1. • The fixed cost of training the classifiers is O(M0 · T0 + M1 · T1), because these only need to be trained once on the data in Dtrain 0 and Dtrain 1 . • After the classifiers Am 0 are trained, the cost of computing conformity scores sˆ m 0 (Xi) for all models m ∈ [M0] and all i ∈ Dcal 0 ∪ Dcal 1 ∪ Dtest is O(M0 · (n0 + n1 + m)). • Selecting the optimal classifier m∗ 0 that maximizes the median difference between the conformity scores sˆ m 0 (Xi) evaluated on {i} ∪ Dcal 0 and those evaluated on Dcal 1 , separately for each test point i ∈ Dtest, has total cost O(M0·(n0+m)·log n0+M0·n1·log n1). To see this, first note that the median 104 score evaluated on Dcal 1 only need to be evaluated once, at total cost O(M0 · n1 · log n1) and then they can be re-used for all test points. Further, to calculate the median scores for each {i} ∪ Dcal 0 , we can first sort the scores corresponding to Dcal 0 at cost O(M0 ·n0 ·log n0), and then update the median for each i ∈ Dtest at cost O(M0 · m · log n0) using insertions and deletions in the sorted array via a binary search. Thus, the overall cost of this component is O(M0 ·(n0+m)·log n0+M0 ·n1 ·log n1). • After the classifiers Am 1 are trained, the cost of computing conformity scores sˆ m 1 (Xi) for all models m ∈ [M1] and all i ∈ Dcal 0 ∪ Dcal 1 ∪ Dtest is O(M1 · (n0 + n1 + m)). • Selecting the optimal classifier m∗ 1 that maximizes the median difference between the conformity scores sˆ m 1 (Xi) evaluated on {i} ∪ Dcal 0 and those evaluated on Dcal 1 , separately for each test point i ∈ Dtest, has total cost O(M1 · (n0 + m) · log n0 + M1 · n1 · log n1). This follows by the same argument used above for m∗ 0 . • The remaining part of Algorithm 2, which is based on the selected models m∗ 0 and m∗ 1 , has the same cost, O((n1 + n0 + m)· log n1 + (n 2 0 ∧ (m · n0))· log n0), as the corresponding part of Algorithm 1. Therefore, the overall cost is O(M0 · T0 + M1 · T1 + M0 · (n0 + n1 + m) + M0 · (n0 + m) · log n0 + M0 ·n1 · log n1 +M1 ·(n0 +n1 + m) +M1 ·(n0 + m)· log n0 +M1 ·n1 · log n1 + (n1 +n0 + m)· log n1 + ((n 2 0 + m) ∧ (m · n0))· log n0), which becomes O(M · T + M ·(n + m)· log n + (mn ∧ n 2 ) log n) under the simplifying assumptions. Cost analysis of Algorithm 3 Let us assume for simplicity that each of the M0 classifiers Am 0 has the same training cost T0, and each of the M1 classifiers Am 1 has the same training cost T1, regardless of the number of folds K0 and K1. Further, note that n0 = |D0| and n1 = |D1| throughout this section. 105 Analysis for a single test point. The cost of applying Algorithm 3 to compute an integrative conformal p-value for a single test point is O(K0 · T0 + K1 · T1 + (n1 + K1 · n0) · log n1 + n0 · log n0). This result is derived below, by analyzing separately each component of Algorithm 3. • Training the classifiers has total cost O(K0 · T0 + K1 · T1). • After the classifiers Am 1 are trained, the cost of computing uˆ1(Xi) with (2.6) for all i ∈ D0 ∪ {n+ 1} is O((n1 + K1 · n0) · log n1). This can be understood as follows. Note that, for any value of x, the quantity uˆ1(x) defined in (2.6) can be equivalently re-written as uˆ1(x) = 1 1 + |D1| 1 +X K1 k=1 X j∈Dk 1 Isˆ k 1 (Xj ) ≤ sˆ k 1 (x) = 1 1 + |D1| " 1 +X K1 k=1 rank sˆ k 1 (x), {sˆ k 1 (Xj )}j∈Dk 1 # , (A4) where rank(a, v) denotes the rank of a scalar a within a vector v. For each k ∈ [K1], the vector of scores {sˆ k 1 (Xj )}j∈Dk 1 can be evaluated and sorted at cost O((n1/K) · log(n1/K)). Then, for each k ∈ [K1] and each i ∈ D0 ∪ {n + 1}, the rank of sˆ k 1 (Xi) in {sˆ k 1 (Xj )}j∈Dk 1 can be found at cost O(log(n1/K) via a binary search. Therefore, the total cost of evaluating uˆ1(Xi) for all i ∈ D0 ∪ {n + 1} is O(n1 · log(n1/K) + K1 · n0 · log(n1/K)), which we conservatively simplify to O((n1 + K1 · n0) · log n1). • After the classifiers Am 0 are trained, the cost of computing uˆ0(Xi) with (2.7) for all i ∈ D0 ∪ {n+ 1} is O(n0 · log n0). This can be understood as follows. Note that, by definition of (2.7), for any i ∈ D0 ∪ {n + 1}, the numerator of uˆ0(Xi) can be found directly from the rank of sˆ k(i) 0 (Xi) in the vector {sˆ k 0 (Xj )}j∈D0 . Evaluating and sorting this vector has cost O(n0 · log n0), and then each of the O(n0) ranks of interest can be found at cost log n0 using a binary search. • Evaluating the statistics rˆ(Xi) for all i ∈ {n + 1} ∪ D0 has cost O(n0). 106 Therefore, the overall cost is O(K0 · T0 + K1 · T1 + (n1 + K1 · n0) · log n1 + n0 · log n0), which becomes O(K · T + K · n · log n) under the simplifying assumptions. Analysis for m test points. The cost of applying Algorithm 3 to compute integrative conformal pvalues for m distinct test points, indexed by Dtest, is O(K · m · T + (K · (n + m) + m · n) · log n). This result is derived below, by analyzing separately each component of Algorithm 3. • The fixed cost of training the classifiers Am 1 is O(K1 · T1), because these only need to be trained once on the data in D1. • As long as D0 is split into the same K0 folds for each test point, only K0 − 1 new models need to be trained for each i ∈ Dtest. Therefore, the total cost of training the classifiers Am 0 for all test points is O(T0 + m · (K0 − 1) · T0), which we simply write as O(m · K0 · T0). • After the classifiers Am 1 are trained, the cost of computing uˆ1(Xi) with (2.6) for all i ∈ D0 ∪ Dtest is O((n1 + K1 · (n0 + m)) · log n1). This can be understood as follows. Note that, for any value of x, the quantity uˆ1(x) defined in (2.6) can be equivalently re-written as in (A4). For each k ∈ [K1], the vector of scores {sˆ k 1 (Xj )}j∈Dk 1 can be evaluated and sorted at cost O((n1/K) · log(n1/K)). Then, for each k ∈ [K1] and each i ∈ D0 ∪ Dtest, the rank of sˆ k 1 (Xi) in {sˆ k 1 (Xj )}j∈Dk 1 can be found at cost O(log(n1/K) via a binary search. Therefore, the total cost of evaluating uˆ1(Xi) for all i ∈ D0 ∪ Dtest is O(n1 · log(n1/K) + K1 ·(n0 + m) log(n1/K)), which we conservatively simplify to O((n1 + K1 · (n0 + m)) · log n1). • The cost of computing uˆ0(Xi) with (2.2) for all i ∈ Dcal 0 ∪ Dtest is O((n0 + m) · log n0). In fact, note that evaluating the numerator of (2.2), for any j ∈ Dcal 0 ∪ Dtest, requires finding the rank of sˆ0(Xj ) among {sˆ0(Xi)}i for i ∈ Dcal 0 . Pre-sorting the vector {sˆ0(Xi)}i for i ∈ Dcal 0 has cost O(n0 · log n0). Then, after pre-sorting, the additional cost of finding the correct position of each sˆ0(Xj ), for j ∈ Dcal 0 ∪ Dtest, would only be O(log n0) using a binary search. 107 • Finally, the cost of computing the output conformal p-values uˆ(Xi) for all i ∈ Dtest with (2.5) can be understood as follows. First, note that for each i ∈ Dtest: uˆ(Xi) = 1 1 + n0 1 + X j∈D0 I uˆ0(Xj ; k0(j), Xi) uˆ1(Xj ) ≤ uˆ0(Xi ; k0(i), Xi) uˆ1(Xi) , (A5) where uˆ0(Xj ; k0(j), Xi) = ˜u0(Xj ; k0(j)) + 1 n0 + 1 Isˆ k(i) 0 (Xi) ≤ sˆ k(j) 0 (Xj ), u˜0(Xj ; k(j)) = 1 n0 + 1 X l∈D0 Isˆ k(l) 0 (Xl) ≤ sˆ k(j) 0 (Xj ). Note that the cost of computing u˜0(Xj ; k(j)) for all j ∈ D0 ∪ Dtest is O((n0 + m)· log n0). In fact, pre-sorting the vector {sˆ k(l) 0 (Xl)}l for l ∈ D0 has cost O(n0 · log n0). Then, after pre-sorting, the additional cost of finding the correct position of each sˆ k(j) 0 (Xj ), for j ∈ D0∪Dtest, is only O(log n0) using a binary search. With this notation, (A5) can be re-written as: uˆ(Xi) = 1 + P j∈D0 I u˜0(Xj ) + Isˆ k(i) 0 (Xi)≤sˆ k(j) 0 (Xj ) n0+1 uˆ1(Xj ) ≤ u˜0(Xi) + 1 n0+1 uˆ1(Xi) 1 + n0 . (A6) We evaluate the numerator of (A6) independently for each i ∈ Dtest. This amounts to finding the rank of [˜u0(Xi) + 1/(1 + n0)]/uˆ1(Xi) among {[˜u0(Xj ) + Isˆ k(i) 0 (Xi) ≤ sˆ k(j) 0 (Xj )/(n0 + 1)]/uˆ1(Xj )}j for j ∈ Dcal 0 . As discussed above, each of these operations can be carried out with cost O(n0 ·log n0), so the total cost of this component is O(m · n0 · log n0). In conclusion, the total cost of Algorithm 3 108 with this approach is: O(K0 · m · T0 + K1T1 + (n1 + K1 · (n0 + m)) · log n1 + m · n0 · log n0), which becomes O(K · m · T + (K · (n + m) + m · n) · log n) under the simplifying assumptions. Therefore, under the simplifying assumptions, the total cost is O(KmT +(K ·(n+m)+m·n)·log n). Cost analysis of transductive conformal p-values In a nutshell, a transductive conformal p-value [134] for a test point Xi , where i ∈ Dtest, is obtained by fitting a one-class classifier on the data in D0 ∪ {i} and then using the scoring function sˆ (i) 0 corresponding to this model to evaluate: uˆ0(Xi) = 1 + |j ∈ Dcal 0 : ˆs (i) 0 (Xj ) ≤ sˆ (i) 0 (Xi)| 1 + |Dcal 0 | . Therefore, the cost of evaluating a single transductive conformal p-value is O(T + n0). Further, this procedure is completely independent for different test points, so the total cost of evaluating transductive conformal p-values for all i ∈ Dtest is O(m · T + m · n0). A3 TCV+ beyond conformal out-of-distribution testing Consider a set of n + 1 data points, (Xi , Yi) for i ∈ [n + 1] = {1, . . . , n + 1}, where Xi ∈ R d is a feature vector for the i-th sample, and Yi ∈ [C] is a corresponding categorical label with C possible values. Suppose the data are sampled exchangeably from some unknown distribution PX,Y . The goal is to predict Yn+1 given Xn+1 and the first n labeled observations. The TCV+ method in Section 2.2.3 can be extended to compute prediction sets with guaranteed marginal coverage for this multi-class classification problem, as outlined in Algorithm A1. This solution has the advantage of enjoying tighter theoretical coverage (1−α instead of 1−2α) compared to the traditional inductive cross-validation+ approach of [5] 109 and [104], although it is more computationally expensive. Further, this solution reduces to the leave-oneout transductive method of [134] if each fold contains only one sample. Theorem A2. Suppose (Xn+1, Yn+1) is exchangeable with the n labeled data points and Cˆ α(Xn+1) is computed by Algorithm A1, with training algorithms that are invariant to the order of the input data. Then, P h Yn+1 ∈ Cˆ α(Xn+1) i ≥ 1 − α for any α ∈ (0, 1). Algorithm A1 TCV+ prediction sets for multi-class classification 1: Input: labeled data set D, black-box classifier A, test point Xn+1, number of folds K, number of possible classes C, confidence level α ∈ (0, 1). 2: for y = 1 . . . C do 3: Randomly split D ∪ {n + 1} into K disjoint folds D1 , . . . , DK. 4: for k = 1 . . . K do 5: Train A on D \ Dk , pretending Yn+1 = y. 6: Calculate conformity scores s y (Xj ; k) for all j ∈ Dk based on A; e.g., as in [104]. 7: end for 8: Let k(i) denote the fold of Xi and calculate uˆ(Xn+1; y) = 1 + P i∈D Is y (Xi ; k(i)) ≤ s y (Xn+1; k(n + 1)) 1 + |D| . Define Cˆ α(Xn+1) ⊆ {1, . . . , C} as: Cˆ α(Xn+1) = {y ∈ {1, . . . , C} : ˆu(Xn+1; y) > α} . 9: end for 10: Output A prediction set Cˆ α(Xn+1) for the unknown label Yn+1. 110 The connection between Algorithm 3 and Algorithm A1 becomes clearer if we consider also Algorithm A2: a further extension of Algorithm A1 designed to construct prediction sets for multi-class classification with guaranteed label-conditional [135, 107, 104] coverage. Theorem A3. Fix any possible label y and suppose Yn+1 = y. Assume (Xn+1, Yn+1) is exchangeable with the labeled data points indexed by {i ∈ [n] : Yi = y}. Suppose that Cˆ α(Xn+1) is computed by Algorithm A1, using training algorithms that are invariant to the order of the input data points. Then, P h Yn+1 ∈ Cˆ α(Xn+1) | Yn+1 = y i ≥ 1 − α for any α ∈ (0, 1) and y ∈ [C]. 111 Algorithm A2 TCV+ label-conditional prediction sets for multi-class classification 1: Input: labeled data set D, black-box classifier A, test point Xn+1, number of folds K, number of possible classes C, confidence level α ∈ (0, 1). 2: for y = 1 . . . C do 3: Define Dy = {i ∈ D : Yi = y}. 4: Randomly split Dy ∪ {n + 1} into K disjoint folds D1 y , . . . , DK y . 5: for k = 1 . . . K do 6: Train A on D \ Dk y , pretending that Yn+1 = y. 7: Calculate s y (Xj ; k) for all j ∈ Dk y based on A; e.g., as in [104]. 8: end for 9: Let k(i) denote the fold of Xi and calculate uˆ(Xn+1; y) = 1 + P i∈Dy Is y (Xi ; k(i)) ≤ s y (Xn+1; k(n + 1)) 1 + |Dy| . Define Cˆ α(Xn+1) ⊆ {1, . . . , C} as: Cˆ α(Xn+1) = {y ∈ {1, . . . , C} : ˆu(Xn+1; y) > α} . 10: end for 11: Output A prediction set Cˆ α(Xn+1) for the unknown label Yn+1. 112 A4 Asymptotic power analysis A4.1 Simplified analysis under independence and a mixture working model Recall the statement of Assumption 1: the weights uˆ1(Xi), for i ∈ Dtest, satisfy P i∈Dtest P [Yi = 0 | Z1,i] P i∈Dtest P [Yi = 0 | Z1,i] ˆu1(Xi) · P i∈Dtest P [Yi = 1 | Z1,i] P i∈Dtest P [Yi = 1 | Z1,i] ˆu1(Xi)−1 ≥ 1. Recall also that, throughout this section, all probabilities and expected values are implicitly conditional on F; i.e., the only remaining randomness is that in the test data. To gain a deeper understanding of Assumption 1, we focus on making the expression for Ξ more explicit under a simplifying working model, generalizing the related analysis described in Section 2.4 but arriving to qualitatively similar conclusions. As stated in Assumption 3, suppose the test data are independent of one another and of the labeled data. Further, imagine that the latent features Z1,i learnt by the machine-learning algorithm A1 are binary, i.e., Z1,i ∈ {0, 1} for all i ∈ Dtest, and that the conditional distribution of Yi | Z1,i is P [Yi = 1 | Z1,i] = ϵ, if Z1,i = 0, 1 − ϵ, if Z1,i = 1, (A7) for some parameter ϵ ∈ [0, 1/2]. In the special case of ϵ = 0, this working model reduces to that described by Assumption 4 in Section 2.4, as it implies that the test outliers are separable from the inliers. In general, for ϵ > 0, this working model describes a test set sampled from a mixture distribution with two components. One component (with Z1 = 1) corresponds to test points that are more likely to be outliers, while the other component (Z1 = 0) corresponds to test points that are more likely to be inliers. The parameter ϵ thus represents a noise level and it controls the intrinsic difficulty of testing for outliers using the features learnt by A1. 113 Let Dtest 0 = {i ∈ Dtest : Z1,i = 0} and Dtest 1 = {i ∈ Dtest : Z1,i = 1} indicate the data points in the test set that are more likely to be inliers and outliers, respectively. Let λ0 denote the expected proportion of test points in Dtest 0 ; that is, λ0 = P [Z1,i = 0] for all i ∈ Dtest. Suppose that, for m large, m0 = |Dtest 0 | and m1 = |Dtest 1 | both grow linearly in m; i.e., m0 ≈ λ0m and m1 ≈ (1 − λ0)m. Then, it is easy to verify that, in the large-m limit, Assumption 1 becomes approximately equivalent to saying that Ξ ′ ≥ 1, where Ξ ′ is defined as: Ξ ′ := ϵ + λ0 − 2ϵλ0 (1 − ϵ)λ0ϕ0 + ϵ(1 − λ0)ϕ1 · ϵλ0 + (1 − ϵ)(1 − λ0) ϵλ0φ0(n1) + (1 − ϵ)(1 − λ0)φ1(n1) , (A8) and ϕ0(n1) := E [ˆu1(Xi) | Z1,i = 0] , ϕ1(n1) := E [ˆu1(Xi) | Z1,i = 1] φ0(n1) := E uˆ −1 1 (Xi) | Z1,i = 0 , φ1(n1) := E uˆ −1 1 (Xi) | Z1,i = 1 . In order to make the expression for Ξ ′ in (A8) even more explicit, let us also introduce a parametric working model for uˆ1(Xi) | Z1,i = 0 and uˆ1(Xi) | Z1,i = 1. For any z ∈ {0, 1}, suppose that uˆ1(X) | Z1 = z follows a discrete Beta distribution (see Section A4.2) with parameters (n1 + 1, νz, 1), for some νz > 0. That is, uˆ1(Xi) | Z1,i = 0 i.i.d. ∼ dBeta(n1 + 1, ν0, 1), uˆ1(Xi) | Z1,i = 1 i.i.d. ∼ dBeta(n1 + 1, ν1, 1). (A9) Above, n1 = |Dcal 1 | denotes the number of labeled data points in the outlier set utilized to define the function uˆ1, as explained in Section 2.2.1. This working model can be interpreted as follows. 114 The parameter ν1 controls the distribution of uˆ1(Xi) for the likely outliers i ∈ Dtest 1 . The special case of ν1 = 1 corresponds to a uniform distribution, which is what should be expected if the labeled data in Dcal 1 are exchangeable with the test data in Dtest 1 . Recall that the test data in Dtest 1 contain on average a fraction ϵ of true inliers, and all inliers are exchangeable. Therefore, ν1 = 1 roughly corresponds to a situation where Dcal 1 contains true outliers exchangeable with those in Dtest 1 , as well as a fraction ϵ of mislabeled inliers. By contrast, ν1 < 1 corresponds to a situation in which the data in Dtest 1 may not be exchangeable with those in Dcal 1 , either because the outliers are sampled from a different distribution or because the proportions of (mislabeled) inliers are not the same. The case of ν1 > 1 has a similar interpretation, but it reflects situations in which the one-class classifier trained on Dcal 1 leads to unusually large (rather than small) values of uˆ1(Xi) for i ∈ Dtest 1 , as in the example of Figure A1. By the definition of the discrete beta distribution in Section A4.2, larger values of the parameter ν1 are associated with the true outliers in the test set receiving larger weights uˆ1(Xi), which should be expected to yield a larger power boost for integrative conformal p-values. The parameter ν0 similarly controls the distribution of uˆ1(Xi) for the likely inliers i ∈ Dtest 0 . Intuitively, smaller values of ν0 are associated with the true inliers in the test set receiving smaller weights uˆ1(Xi), which should be expected to yield a larger power boost for integrative conformal p-values. It follows from the asymptotic approximations presented in Section A4.2 that the working model defined in (A9) allows the expression for Ξ ′ in (A8) to be approximated, for m and n1 large enough, as: Ξ ′ ≈ ϵ + λ0 − 2ϵλ0 (1 − ϵ)λ0 ν0 ν0+1 + ϵ(1 − λ0) ν1 ν1+1 · ϵλ0 + (1 − ϵ)(1 − λ0) ϵλ0φ(n1, ν0) + (1 − ϵ)(1 − λ0)φ(n1, ν1) , (A10) 115 with φ(n1, ν) defined as in (A15): φ(n, ν) := γ + log(n), if ν = 1, ν ν − 1 + ν · ζ(2 − ν) · n 1−ν , if ν ̸= 1. Note that, as n1 → ∞, φ(n1, ν) → ν/(1 − ν) if ν > 1, and φ(n1, ν) → ∞ otherwise. In the special case of ϵ = 0, this reduces to the analogous ratio Ξ defined in Section 2.4 under the noiseless working model; i.e., see Equation (2.12): Ξ ′ ≈ ν0 + 1 ν0 · 1 φ(n1, ν1) ≈ Ξ. A4.2 The discrete beta distribution The distribution described in this section is useful for our asymptotic power analysis. We say that a random variable U ∈ (0, 1] follows a discrete Beta distribution with parameters (n, ν, 1), namely dBeta(n, ν, 1), where n ∈ N and ν > 0, if P U = k n + 1 = k −(1−ν) H (1−ν) n+1 , k ∈ {1, . . . , n + 1}. (A11) Above, H (ν) n := Pn k=1 k −ν is the n-th generalized harmonic number of order ν, for n ∈ N and ν ≥ 0. A useful approximation of H (ν) n is given by the Euler-Maclaurin formula: H(ν) n ≈ γ + log(n), if ν = 1, ζ(ν) + 1 1−ν n 1−ν , if ν ̸= 1, as n → ∞, (A12) where γ ≈ 0.577 is Eurler’s constant and ζ is the Riemann zeta function. 116 Consequently, for n large, the expected value of U according to (A11) is approximately the same as that of a corresponding continuous beta random variable. That is, E [U] = 1 n + 1 · H (−ν) n+1 H (1−ν) n+1 ≈ ϕ(n, ν), (A13) where ϕ(n, ν) := ν 1 + ν − ν 2 ζ(1 − ν) 1 + ν · 1 n . Similarly, E [1/U] is always finite and can be approximated as: E 1 U = n · H (2−ν) n+1 H (1−ν) n+1 ≈ φ(n, ν), (A14) where φ(n, ν) := γ + log(n), if ν = 1, ν ν − 1 + ν · ζ(2 − ν) · n 1−ν , if ν ̸= 1. (A15) Then, a first-order Taylor expansion also gives: 1 φ(n, ν) ≈ 1 ν · ζ(2 − ν) · 1 (n1 + 1)1−ν , if ν < 1, 1 γ + log(n1 + 1), if ν = 1, ν − 1 ν − (ν − 1)2 · ζ(2 − ν) ν · (n1 + 1)ν−1 , if ν > 1. 117 A4.3 A heuristic rule of thumb for deciding when to weight? This section builds upon the theoretical power analysis of Section 2.4 to explore a possible estimator Ξˆ of the unknown informativeness ratio Ξ defined in (2.10), relying on the simplified formula in (2.12). The reason why we are interested in this question is that it may be helpful to obtain a heuristic rule-of-thumb for predicting when the weighting approach of integrative p-values is likely to lead to higher power. Throughout this section, we imagine that n, n0 = |Dcal 0 | and n1 = |Dcal 1 | are all large, and that Assumptions 2–4 hold without attempting to verify them. These technical conditions may be difficult to validate in practice but they can be taken as reasonable approximations for our current purpose, especially if we think of possible applications involving data with relatively little intrinsic noise (e.g., image classification). An unbiased estimate of the denominator of Ξ in (2.12), namely E [ˆu1(X) | Y = 0, D], can be obtained by taking the empirical average of uˆ1(Xi) for i ∈ Dtrain 0 . This is because uˆ1 only depends on the data in D1 and the labeled inliers are exchangeable with the test inliers conditional on D1. If the outliers are exchangeable and ν = 1, the analysis from Section 2.4 suggests that an intuitive rule-of-thumb is to apply weighting if Ξˆexch ≥ 1, where Ξˆexch := 1/[γ + log(n1 + 1)] (1/|Dtrain 0 |) P i∈Dtrain 0 uˆ1(Xi) . (A16) Note that Ξˆexch can be assessed prior to seeing any test points. Further, this rule-of-thumb does not introduce any selection bias into our method because the integrative conformal p-values remain valid even conditional on Ξˆexch . A similar rule-of-thumb for the more general case in which the outliers may not be exchangeable requires the availability of more than one (unlabeled) test points, as explained next. The idea is that the unknown shape parameter ν of the beta distribution in (2.12) can be estimated by fitting a two-mixture model on {uˆ1(Xi)}i∈Dtest , as long as Dtest is sufficiently large. This is made possible by the fact that one 118 component of this mixture can be well-approximated by the empirical distribution of {uˆ1(Xi)}i∈Dtrain 0 , because the inliers are exchangeable. Letting νˆ denote the estimated value of ν, which can for example be calculated with the method of moments as explained in Section A4.4, one may thus expect weighting to be beneficial if Ξˆnon-exch ≥ 1, where Ξˆnon-exch := |Dtrain 0 | P i∈Dtrain 0 uˆ1(Xi) · 1 νˆ·ζ(2−νˆ) · 1 (n1+1)1−νˆ , if ν <ˆ 1, 1 γ+log(n1+1) , if νˆ = 1, νˆ−1 νˆ − (ˆν−1)2 ζ(2−νˆ) νˆ(n1+1)νˆ−1 , if ν >ˆ 1. (A17) A notable difference between Ξˆexch and Ξˆnon-exch is that the latter is based on a statistic νˆ that is not invariant to permutations of the inliers in the test set with those in Dcal 0 . Therefore, a rule-of-thumb based on Ξˆnon-exch would introduce some selection bias into the integrative p-values. This issue might be relatively innocuous if Dcal 0 is large, but it can be completely avoided with a slightly more complicated approach that estimates νˆ from {uˆ1(Xi)}i∈Dtest∪Dcal 0 instead of {uˆ1(Xi)}i∈Dtest . We do not detail this alternative approach here in the interest of conciseness. The heuristics derived above may at first sight appear to provide a useful extension of the methodology described in this paper, by offering an automatic way of deciding when our weighting approach should be applied. Unfortunately, however, these heuristics do not always work well in practice. The issue is that our integrative weighting method often empirically leads to desirable power gains even if Ξ < 1. This can be explained by noting that our power analysis is only one-directional and, in any case, its underlying technical assumptions may not always be valid in practice. Despite having observed that this rule-ofthumb has limited practical utility, we thought it would be worth discussing here because it is an intuitively appealing idea. 119 A4.4 Method of moments estimation of ν in Section A4.3 The shape parameter ν of the discrete beta distribution utilized in (2.12) may be estimated by fitting a two-mixture model on {uˆ1(Xi)}i∈Dtest , knowing that one component (with proportion λ ∈ [0, 1]) can approximated by the empirical distribution of {uˆ1(Xi)}i∈Dtrain 0 . To estimate ν using the method of moments as simply as possible, note that it follows from Section A4.2 that, for n1 large, E [ˆu1(X) | Y = 1, D1] ≈ ν ν + 1 , and E uˆ 2 1 (X) | Y = 1, D1 = 1 (n1 + 1)2 · H (−1−ν) n1+1 H (1−ν) n1+1 ≈ ν ν + 2 . For n1 large enough, the method of moments essentially consists of solving the following system of two non-linear equations in λ (the ancillary parameter indicating the mixture proportion) and ν (the desired shape parameter): 1 |Dtest| P i∈Dtest uˆ1(Xi) = λ · 1 |Dtrain 0 | P i∈Dtrain 0 uˆ1(Xi) + (1 − λ) ν ν+1 , 1 |Dtest| P i∈Dtest uˆ 2 1 (Xi) = λ · 1 |Dtrain 0 | P i∈Dtrain 0 uˆ 2 1 (Xi) + (1 − λ) ν ν+2 . Note that this gives a simple explicit expression for ν as a function of λ: ν = 1 |Dtest| P i∈Dtest uˆ1(Xi) − λ · 1 |Dtrain 0 | P i∈Dtrain 0 uˆ1(Xi) 1 − λ − 1 |Dtest| P i∈Dtest uˆ1(Xi) + λ · 1 |Dtrain 0 | P i∈Dtrain 0 uˆ1(Xi) . (A18) 120 By collecting the terms multiplying λ, the above system can be equivalently written as: 1 + 1 ν · 1 |Dtest| P i∈Dtest uˆ1(Xi) = λ · h 1 + 1 ν · 1 |Dtrain 0 | P i∈Dtrain 0 uˆ1(Xi) − 1 i + 1, 1 + 2 ν · 1 |Dtest| P i∈Dtest uˆ 2 1 (Xi) = λ · h 1 + 2 ν · 1 |Dtrain 0 | P i∈Dtrain 0 uˆ 2 1 (Xi) − 1 i + 1. Therefore, the unknown of interest, ν, is the solution to the following non-linear equation: 1 |Dtest| P i∈Dtest uˆ1(Xi) − ν ν+1 1 |Dtrain 0 | P i∈Dtrain 0 uˆ1(Xi) − ν ν+1 = λ = 1 |Dtest| P i∈Dtest uˆ 2 1 (Xi) − ν ν+2 1 |Dtrain 0 | P i∈Dtrain 0 uˆ 2 1 (Xi) − ν ν+2 . This equation can be re-organized as 1 |Dtest| X i∈Dtest uˆ1(Xi) − ν ν + 1! 1 |Dtrain 0 | X i∈Dtrain 0 uˆ 2 1 (Xi) − ν ν + 2 = 1 |Dtrain 0 | X i∈Dtrain 0 uˆ1(Xi) − ν ν + 1 1 |Dtest| X i∈Dtest uˆ 2 1 (Xi) − ν ν + 2! = 1 |Dtrain 0 | X i∈Dtrain 0 uˆ1(Xi) · 1 |Dtest| X i∈Dtest uˆ 2 1 (Xi) + ν 2 (ν + 1)(ν + 2) − ν ν + 2 · 1 |Dtrain 0 | X i∈Dtrain 0 uˆ1(Xi) − ν ν + 1 · 1 |Dtest| X i∈Dtest uˆ 2 1 (Xi), which in turn implies ν 2 = U(ν), where the function U is defined as U(ν) := (ν + 1)(ν + 2) 1 |Dtest| X i∈Dtest uˆ1(Xi) − ν ν + 1! 1 |Dtrain 0 | X i∈Dtrain 0 uˆ 2 1 (Xi) − ν ν + 2 − (ν + 1)(ν + 2) · 1 |Dtrain 0 | X i∈Dtrain 0 uˆ1(Xi) · 1 |Dtest| X i∈Dtest uˆ 2 1 (Xi) + ν(ν + 1) 1 |Dtrain 0 | X i∈Dtrain 0 uˆ1(Xi) + ν(ν + 2) · 1 |Dtest| X i∈Dtest uˆ 2 1 (Xi). (A19) This suggests solving the equation ν 2 = U(ν) iteratively, as outlined in Algorithm A3. We have observed that this iterative algorithm typically converges very quickly, although it may sometimes lead to an unreliable solution with λˆ ∈ {0, 1} if the available data do not contain sufficient information to separate the two components of the mixture distribution; e.g., see Figure A21. In those cases when the method of moments leads to an estimate of νˆ associated with λˆ ∈ {0, 1}, we simply switch to Ξˆexch instead of Ξˆnon-exch , implicitly assuming that ν = 1. Algorithm A3 Method of moments estimation of the shape parameter ν in (A11) 1: Input: {uˆ1(Xi)}i∈Dtest , {uˆ1(Xi)}i∈Dtrain 0 . 2: Initialize λˆ = 0.5. 3: Initialize νˆ as the solution ν of (A18) applied with λ = λˆ. 4: while νˆ has not converged do 5: Update νˆ ← p max{0, U(ˆν)}, with the function U defined as in (A19). 6: end while 7: Output A method-of-moments estimate νˆ of the shape parameter ν in (A11). A5 Additional numerical experiments with synthetic data A5.1 Pairwise correlation between integrative conformal p-values In this section, we investigate empirically the correlation structure of integrative conformal p-values calculated for different test points using the same calibration data. It was discussed in Section 2.3 that integrative p-values may have more complicated dependencies compared to standard conformal p-values [10], and Figure A3 demonstrates this empirically. Synthetic data are generated separately from the same Gaussian mixture and binomial distributions as in Section 2.5, with p = 1000 features per observation. The sample size is varied as a control parameter. To reduce the computational cost of these experiments, the integrative method is applied with a narrower selection of possible one-class classification models compared 122 to Section 2.5: a support vector machine with radial basis function, sigmoid, or third-degree polynomial kernel. Figure A3 shows the average pairwise correlation between the integrative p-values for different test points as a function of the number of inliers in the calibration data set, separately for the two datagenerating distributions. These empirical correlations are averaged over 200,000 independent experiments. The results show the correlation between conformal p-values for different test points intuitively decreases as the number of calibration data points n0 increases. However, unlike in the case of standard conformal p-values [10], this relation appears to be data-dependent and the correlation is not exactly equal to 1/n0. In particular, integrative conformal p-values appear to experience generally stronger correlations compared to standard conformal p-values. This should not be surprising given that our integrative method extracts more information from the shared calibration data. A5.2 Validation of the asymptotic power analysis We seek to validate empirically the theoretical power analysis from Section 2.4, and to check whether the empirical rule-of-thumb proposed in Section A4.3 may be practically useful to predict whether our weighting method may lead to higher power. We begin from the special case in which the labeled outliers are exchangeable with the test outliers. For this purpose, independent data samples of size 200 are generated from the same Gaussian mixture model as in Section 2.5. Integrative conformal p-values are obtained as in Section 2.5, with the only difference that now a single model is utilized to compute conformity scores: a one-class support vector machine with radial basis kernel, for different values of the scale parameter. This parameter affects the informativeness of the resulting conformal p-values, which allows us to probe empirically the theoretical relation between power and the informativeness ratio Ξ defined in (2.10) and estimated by the statistic Ξˆexch in (A16). The size of the outlier calibration set, n1, is varied between 10 and 50. 123 Figure A20 summarizes the performance in these experiments of integrative conformal p-values, with and without adaptive weighting, as a function of the scale parameter of the support vector machine model. Here, the test and labeled outliers are exchangeable. The performance of our method is evaluated in terms of the power of BH applied to the integrative p-values at the nominal FDR level 10%. The results confirm that weighting tends to lead to higher power when the ratio Ξ defined in (2.10) is larger than one. Further, Figure A20 shows that the empirical estimate Ξˆexch discussed in Section A4.3 can provide a useful approximation of Ξ. Qualitatively similar conclusions can be drawn from Figure A21, which presents the results of similar experiments with non-exchangeable outliers. While these results are encouraging, they should be interpreted with care, bearing in mind that our asymptotic power analysis relies on several technical assumptions that are unlikely to hold exactly in practice. Further, our power analysis does not say that weighting cannot increase power even if Ξ < 1, and indeed this sometimes occurs, as we will observe later in different experiments based on real data. A6 Mathematical proofs Integrative conformal p-values Proof of Theorem 1. It suffices to show the variables Ri = ˆr(Xi) for i ∈ {n+1}∪Dcal 0 are exchangeable, as that implies the rank of Rn+1 is uniform on {1, 2, . . . , |Dcal 0 |+1}, and hence uˆ(Xn+1) ∼ Uniform 1/(|Dcal 0 | + 1), 2/(|Dcal 0 | + 1), . . . , 1 . To prove the claim, we start by introducing some helpful notation. Let us refer to full data set augmented with the unlabeled test point as D = D0 ∪ D1 ∪ {n+ 1}, and let σ be any permutation of {1, 2, . . . , n+ 1} that leaves invariant all elements of {1, 2, . . . , n} \ Dcal 0 . Then, let σ(D) denote the shuffled data set induced by σ. Imagine a “parallel universe” in which our method is applied to σ(D), with exactly the same recipe and the same random seeds, and let sˆ ′ 0 and sˆ ′ 1 denote the conformity score functions obtained in that universe. Clearly, sˆ ′ 0 = ˆs0 and sˆ ′ 1 = ˆs1 because these functions are determined my machine-learning 12 models trained only on subsets of the data in {1, 2, . . . , n} \ Dcal 0 , which are not altered by σ. Further, note the functions uˆ0 and uˆ1 are also identical to the respective counterparts uˆ ′ 0 and uˆ ′ 1 in the parallel universe, because the expression in (2.4) is invariant to permutations of {n + 1} ∪ Dcal 0 . This also implies rˆ ′ = ˆr. Therefore, the variables Ri , for i ∈ {n + 1} ∪ Dcal 0 , correspond in the parallel universe to variables R′ σ(i) = ˆr ′ (Xσ(i) ) such that {rˆ ′ (Xσ(i) )}i∈{n+1}∪Dcal 0 = ¯σ({rˆ(Xi)}i∈{n+1}∪Dcal 0 ), where σ¯ is understood to be the permutation obtained by restricting σ on {n + 1} ∪ Dcal 0 . The proof is completed by noting the data exchangeability assumption says {rˆ ′ (Xσ(i) )}i∈{n+1}∪Dcal 0 d= {rˆ(Xi)}i∈{n+1}∪Dcal 0 , and hence {rˆ(Xi)}i∈{n+1}∪Dcal 0 d= ¯σ({rˆ(Xi)}i∈{n+1}∪Dcal 0 ). Proof of Theorem 2. This follows from essentially the same argument as in the proof of Theorem 1. Note that the only difference between Algorithm 2 and Algorithm 1 is the automatic model selection. Hence, it suffices to show that the identity of the selected model is invariant to permutations, and then the rest of the proof follows that of Theorem 1. Note that the model selection is based on the unordered sets of conformity scores sˆ m 0 ({n + 1} ∪ D0) and sˆ m 0 (D1), which implies that the tuning process is invariant to permutations of {n + 1} ∪ D0 and m0∗ ′ = m0∗. The same argument also yields m1∗ ′ = m1∗. 125 Proof of Theorem 3. We assume throughout this proof that the null hypothesis is true; i.e., Yn+1 = 0. For simplicity, imagine D0 ∪ {n + 1} = [n + 1] = {1, . . . , n, n + 1}, so that n = |D0 | is the number of labeled inliers. This slight abuse of notation does not introduce ambiguity because we do not need to explicitly index the labeled outliers in the proof. Further, to simplify the notation, assume that n + 1 is a multiple of K0, the number of folds in which [n + 1] is divided. As in the proof of Theorem 1, for each i ∈ [n + 1], define Ri as : Ri = ˆr(Xi) = uˆ0[Xi ; k0(i)] uˆ1(Xi) . We begin by showing that, after conditioning on any partition of [n+ 1] into K folds D1 0 , . . . , DK 0 , the distribution of (R1, . . . , Rn+1) is invariant to certain permutations. In particular, let σ be any permutation of [n+1] that does not mix points assigned to different folds. That is, for any i, j ∈ [n+1], it is the case that k0(σ(i)) = k0(σ(j)) if and only if k0(i) = k0(j), where k0(i) ∈ [K0] is the fold to which observation i is assigned. This means that σ has a unique associated permutation π of [K0] such that k0(σ(i)) = π(k0(i)), for all i ∈ [n + 1]. Then, we claim that (Rσ(1), . . . , Rσ(n+1)) d= (R1, . . . , Rn+1). (A20) This can be established as follows. Imagine a “parallel universe” in which our method is applied, with the same recipe and random seeds, to a data set containing the same labeled outliers and an altered version of the inliers in D0 ∪ {n + 1} permuted by σ: (X′ 1 , . . . , X′ n , X′ n+1) = σ ◦ (X1, . . . , Xn, Xn+1). 126 Let R′ = (R′ 1 , . . . , R′ n1 ), and rˆ ′ , k ′ 0 , sˆ ′k 0 , uˆ ′ 0 , and uˆ ′ 1 indicate the analogues of R = (R1, . . . , Rn1 ), rˆ, k0, sˆ k 0 , uˆ0, and uˆ1 thus computed in the parallel universe, where R ′ i = ˆr ′ (X′ i ) = uˆ ′ 0 [X′ i ; k ′ 0 (i)] uˆ ′ 1 (X′ i ) . Because we are conditioning on the choice of folds and on any other possible sources of independent randomness in the classifiers, the vectors R′ and R are defined as the output of the same deterministic function applied to two different versions of the data, which are assumed to be equally distributed under the null hypothesis. Therefore, (R1, . . . , Rn+1) d= (R ′ 1 , . . . , R′ n+1). (A21) Next, note that permuting the inlier data indexed by D0 ∪ {n + 1} does not alter any of the machinelearning models sˆ k 1 learnt from the labeled outliers, which have exactly the same values in the “normal” and “parallel” universes. This means that uˆ ′ 1 = ˆu1. Further, if the machine learning algorithms Ak 0 are invariant to the order of their training data points, the set of models {sˆ ′k 0 }k∈[K0] is exactly the same as {sˆ k 0 }k∈[K0] , and sˆ ′k 0 = ˆs π(k) 0 . 127 This implies that the only effect of permuting the data points in [n + 1] with σ is to permute accordingly the preliminary p-values uˆ0(Xi ; k0(i)). In other words, uˆ ′ 0 (X′ i ; k ′ 0 (i)) = |j ∈ [n + 1] : ˆs ′k0(j) 0 (X′ j ) ≤ sˆ ′k0(i) 0 (X′ i )| 1 + n = |j ∈ [n + 1] : ˆs ′k0(j) 0 (Xσ(j) ) ≤ sˆ ′k0(i) 0 (Xσ(i) )| 1 + n = |j ∈ [n + 1] : ˆs π(k0(j))) 0 (Xσ(j) ) ≤ sˆ π(k0(i)) 0 (Xσ(i) )| 1 + n = |j ∈ [n + 1] : ˆs k0(σ(j))) 0 (Xσ(j) ) ≤ sˆ k0(σ(i)) 0 (Xσ(i) )| 1 + n = |j ∈ [n + 1] : ˆs k0(j)) 0 (Xj ) ≤ sˆ k0(σ(i)) 0 (Xσ(i) )| 1 + n = ˆu0(Xσ(i) ; k0(σ(i))). Therefore, R ′ i = ˆr ′ (X′ i ) = uˆ ′ 0 [X′ i ; k ′ 0 (i)] uˆ ′ 1 (X′ i ) = uˆ0(Xσ(i) ; k0(σ(i))) uˆ1(Xσ(i) ) = Rσ(i) . Combining this result with (A21) leads to (A20). While this result is (slightly) weaker than full exchangeability because it only holds only for permutations σ preserving a given fold structure, it turns out to be sufficient for our purposes as we conclude the proof with a strategy similar to that of [5]. Define the comparison matrix A ∈ {0, 1} (n+1)×(n+1) with entries Ai,j = 1 {Ri > Rj} , and the set of “strange” points, for any fixed α ∈ (0, 1), Sα(A) = i ∈ {1, . . . , n + 1} : nX +1 j=1 Ai,j ≥ (1 − α)(1 + n) , 128 so that the event uˆ(Xn+1) ≤ α is equivalent to n + 1 ∈ Sα(A). To see this, note that, by definition of uˆ(Xn+1), uˆ(Xn+1) ≤ α ⇔ nX +1 i=1 1 [ˆr(Xi ; k(i)) ≤ rˆ(Xn+1; k(n + 1))] ≤ α(1 + n) ⇔ nX +1 i=1 1 [ˆr(Xi ; k(i)) > rˆ(Xn+1; k(n + 1))] > (1 − α)(1 + n) ⇔ nX +1 i=1 Ai,n+1 ≥ (1 − α)(1 + n). Therefore, P [ˆu(Xn+1) ≤ α] = P [n + 1 ∈ Sα(A)] . It follows directly from (A20) that, for any permutation σ preserving the fold structure, Aσ(i)σ(j) d= Ai,j . Therefore, for any i ∈ [n + 1], P [i ∈ Sα(A)] = P [n + 1 ∈ Sα(A)] , because there always exists a fold-preserving permutation σ such that σ(i) = n + 1. Hence, P [n + 1 ∈ Sα(A)] = P [i ∈ Sα(A)] = 1 n + 1 nX +1 j=1 P [j ∈ Sα(A)] = E [|Sα(A)|] n + 1 ≤ α(n + 1) (n + 1) = α. 129 The final inequality above follows from the deterministic bound |Sα(A)| ≤ α(n+1), which we explain next. One can think of the matrix A as keeping score of all pairwise games between n + 1 distinct players in a chess tournament. Is it easy to see that any j ∈ {1, . . . , n + 1} belongs to the set Sα(A) of strange points if and only if player j wins against at least (1 − α)(n + 1) other players. Therefore, only those players j with the largest α(n + 1) values of Rj can be strange, which implies |Sα(A)| ≤ α(n + 1). Extensions of TCV+ Proof of Theorem A2. By construction of Cˆ α(Xn+1), we know Yn+1 ∈/ Cˆ α(Xn+1)if and only if uˆ(Xn+1; Yn+1) ≤ α. Therefore, it is enough to prove that P [ˆu(Xn+1; Yn+1) ≤ α] ≤ α. This can be shown by proceeding as in the proof of Theorem 3. For simplicity, imagine each fold Dk has m = (n + 1)/K points and m is an integer (this assumption could be relaxed at the cost of heavier notation, as in [5]). For y = Yn+1, define R ∈ R (n+1) such that, for each i ∈ {n + 1} ∪ D, Ri = ˆs y (Xi ; k(i)). Define the comparison matrix A ∈ {0, 1} (n+1)×(n+1) with entries Ai,j = 1 {Ri > Rj} , and the set of “strange” points, for any fixed α ∈ (0, 1), Sα(A) = i ∈ {1, . . . , n + 1} : nX +1 j=1 Ai,j ≥ (1 − α)(1 + n) . 130 The event uˆ(Xn+1; Yn+1) ≤ α is equivalent to n + 1 ∈ Sα(A), and the probability of the latter event can be bounded from above by α exactly as in the proof of Theorem 3. Proof of Theorem A3. The proof is similar to that of Theorem A2. Fix any possible label y and condition on the event that Yn+1. Given that Yn+1 ∈/ Cˆ α(Xn+1) if and only if uˆ(Xn+1; Yn+1) ≤ α, it suffices to prove P [ˆu(Xn+1; Yn+1) ≤ α | Yn+1 = y] ≤ α. This can be shown by proceeding as in the proof of Theorem 3. Let ny denote the cardinality of Dk . For simplicity, imagine each fold Dk y has m = (ny + 1)/K points and m is an integer (this assumption could be relaxed at the cost of heavier notation, as in [5]). With a slight abuse of notation, we refer to the test point index {n + 1} as {ny + 1}. Define R ∈ R (n+1) such that, for each i ∈ {ny + 1} ∪ D, Ri = ˆs y (Xi ; k(i)). Define the comparison matrix A ∈ {0, 1} (ny+1)×(ny+1) with entries Ai,j = 1 {Ri > Rj} , and the set of “strange” points, for any fixed α ∈ (0, 1), Sα(A) = i ∈ {1, . . . , ny + 1} : nXy+1 j=1 Ai,j ≥ (1 − α)(1 + ny) . The event uˆ(Xny+1; Yny+1) ≤ α is equivalent to ny + 1 ∈ Sα(A), and the probability of the latter event can be bounded from above by α as in the proof of Theorem 3, because the test point is exchangeable with the calibration data in Dy conditional on its true label being y. 131 FDR control Proof of Theorem 4. Note that all the u˜i(Xj ) are measurable with respect to Φi , which implies R˜ i is also measurable with respect to Φi . Further, if a test point i is an inlier, it is easy to see uˆ(Xi) is still superuniform conditional on Φi because Φi is invariant to permutations of D0 cal ∪ {i}. Thus, this result is a special case of Theorem A4, proved below. Theorem A4. Suppose the three-step algorithm described in Section 2.3 is applied with conformal p-values uˆ(Xi) that are super uniform under the null hypothesis conditional on Φi ; i.e., for all i ∈ Dtest such that Yi = 0 and all α ∈ (0, 1), P [ˆu(Xi) ≤ α | Φi , Yi = 0] ≤ α. Assume also R˜ i is measurable with respect to Φi . Then, the algorithm controls the FDR below αm0/m. Proof of Theorem A4. Our proof of Theorem A4 follows the same strategy as [37]; in fact, our method can be seen as a special case of their theory. Nonetheless, it is more convenient and instructive to revisit the full proof from scratch, as opposed to explicitly connecting the details of our method to the general results in [37]. For any i ∈ Dtest, define the variable Vi such that Vi = 1 if Yi = 0 and the corresponding null hypothesis is falsely rejected, and Vi = 0 otherwise. Denote the total number of final rejections as R = |R|. Then, the FDR can be written as FDR = X i:Yi=0 E Vi R ∨ 1 ≤ α m0 m , where the last inequality is proved below. 132 Let R∗ = R(ˆu(Xi) ← 0) denote the hypothetical total number of rejections obtained by fixing uˆ(Xi) = 0 prior to applying our procedure. If Xi is an inlier, then E Vi R ∨ 1 = E " Ii ∈ R+ · Iϵi ≤ R/R˜ i R ∨ 1 # = E " Ii ∈ R+ · Iϵi ≤ R∗/R˜ i R∗ ∨ 1 # ≤ E Ii ∈ R+ R˜ i = E " E " Iuˆ(Xi) ≤ αR˜ i/m R˜ i | Φi ## = E E h Iuˆ(Xi) ≤ αR˜ i/m | Φi i 1 R˜ i ≤ E αR˜ i/m 1 R˜ i = α m . The inequality above follows from the fact that R˜ i is measurable with respect to Φi and uˆ(Xi) is superuniform conditional on Φi . Proof of Theorem A1. All u˜i(Xj ) are measurable with respect to Φi , which implies R˜ i is also measurable with respect to Φi . Further, if Xi is an inlier, it is easy to see the original uˆ(Xi) computed by Algorithm 3 is still super-uniform conditional on Φi because Φi is invariant to permutations of D0 ∪ {i}. Thus, this result is also a special case of Theorem A4. Asymptotic power analysis Proof of Theorem 5. This proof follows the same strategy as that of Theorem 2 in [81], but we nonetheless report all details here for completeness. To simplify the notation, we prove the equivalent result with the weights w = (ˆu1(Xn+1), . . . , uˆ1(Xn+m)) replaced by w˜, which is defined as in (2.9). Note that this 133 constant rescaling of the weights is allowed because the m elements of w˜ maintain the same ordering as those of w. Let F be the collection of random variables {Z1,i}i∈Dtest , D0, and D1. Denote F1,i(t) the conditional cumulative distribution function of uˆ0(Xi) given Yi = 1 and all the information in F. Then, the expected number of true discoveries obtained by applying δ(t; v) with a fixed threshold t and weights vi is: Ψ(t; v) = E " X i∈Dtest P [Yi = 1 | Z1,i] · F1,i(vi · t) | F# , (A22) because Ψ(t; v) = E " X i∈Dtest Yi · δi(t; vi) | F# = X i∈Dtest P [Yi = 1 | F] · P [ˆu0(Xi) ≤ vi · t | Yi = 1, F] = X i∈Dtest P [Yi = 1 | Z1,i] · F1,i(vi · t). First, we prove that if the same threshold t is applied with weights w and with constant unit weights, then δ(t; w) will have larger power: Ψ(t; w) ≥ Ψ(t; 1). (A23) To prove (A23), note that, assuming X i∈Dtest aiF1,i(t/xi) ≥ X i∈Dtest aiF1,i t Pm j=1 aj Pm j=1 ajxj ! , 134 for any 0 ≤ ai ≤ 1 and min1≤i≤m w˜ −1 i ≤ xi ≤ max1≤i≤m w˜ −1 i , we can prove: X i∈Dtest P [Yi = 1 | Z1,i] F1,i( ˜wit) ≥ X i∈Dtest P [Yi = 1 | Z1,i] F1,i(t). (A24) In fact, X i∈Dtest P [Yi = 1 | Z1,i] F1,i( ˜wit) = X i∈Dtest P [Yi = 1 | Z1,i] F1,i(t/w˜ −1 i ) ≥ X i∈Dtest P [Yi = 1 | Z1,i] F1,i Pm j=1 P h Yj = 1 | s 1 j i t Pm j=1 P h Yj = 1 s 1 j i w˜ −1 j = X i∈Dtest P [Yi = 1 | Z1,i] F1,i t · Pm j=1 P h Yj = 0 | s 1 j i Pm j=1 P h Yj = 0 | s 1 j i wj · Pm j=1 P h Yj = 1 | s 1 j i Pm j=1 P h Yj = 1 | s 1 j i w −1 j ≥ X i∈Dtest P [Yi = 1 | Z1,i] F1,i(t), where the last inequality follows from Assumption 1. Combining (A24) with (A22) proves (A23). Next, we prove that, for any fixed threshold t, Q(t; ˜w) ≤ Q(t; 1). (A25) 135 This is shown easily by recalling the definition of w˜ in (2.9): Q(t; ˜w) = P i∈Dtest w˜it · P [Yi = 0 | Z1,i] P i∈Dtest w˜it · P [Yi = 0 | Z1,i] + P i∈Dtest F1,i( ˜wit | Z1,i) · P [Yi = 1 | Z1,i] = P i∈Dtest t · P [Yi = 0 | Z1,i] P i∈Dtest t · P [Yi = 0 | Z1,i] + P i∈Dtest F1,i( ˜wit | Z1,i) · P [Yi = 1 | Z1,i] ≤ P i∈Dtest t · P [Yi = 0 | Z1,i] P i∈Dtest t · P [Yi = 0 | Z1,i] + P i∈Dtest F1,i(t | Z1,i) · P [Yi = 1 | Z1,i] = Q(t; 1). where the inequality follows from (A24). Now, let us apply (A23) and (A25) setting the generic threshold t equal to t α oracle(1) = sup{t : Q(t; (1, . . . , 1)) ≤ α}, which gives: Q(t α oracle(1); ˜w) ≤ Q(t α oracle(1); 1) ≤ α, Ψ(t α oracle(1); ˜w) ≥ Ψ(t α oracle(1); 1). The proof is completed by noting that, by definition of t α oracle( ˜w), Q(t α oracle( ˜w); ˜w) ≤ α, Ψ(t α oracle(t α oracle( ˜w)); ˜w) ≥ Ψ(t α oracle(1); ˜w). 136 A7 Supplementary figures Demonstration of automatic model tuning Figure A1: Empirical demonstration of the need to adaptively tune the sign of conformity scores calculated by one-class classifiers, as discussed in Section 2.2.2. Left: the inlier data are generated from the Gaussian mixture model described in Section 2.5, with p = 100 and a = 2, while the outlier data are generated with a = 1. Right: the data are generated from the same model as on the left, swapping the inlier/outlier labels. Top: first two principal components of the data. Bottom: distributions of log(ˆs0(Xi)) for inlier and outlier data points Xi . 137 Schematic of Algorithm 3 D0 ∪ {Xn+1} D1 D1 0 D K0 0 D1 1 D K1 1. . . . . . A1 0 A K0 0 A1 1 A K1 1. . . . . . Split into K0 folds and train classifiers Split into K1 folds and train classifiers sˆ 1 0 (D1 0 ) sˆ K0 0 (D K0 0 ) Conformity scores Conformity scores sˆ 1 1 (D1 1 ) sˆ 1 1 (D0) sˆ 1 1 (Xn+1) sˆ K1 1 (D K1 1 ) sˆ K1 1 (D0) sˆ K1 1 (Xn+1) . . . . . . uˆ0(D0) uˆ0(Xn+1) uˆ1(D0) uˆ1(Xn+1) Preliminary p-values Preliminary p-values rˆ(D0) rˆ(Xn+1) uˆ(Xn+1) Integrative p-value Ratio statistics Figure A2: Schematic representation of Algorithm 3, computing integrative conformal p-values via TCV+ instead of data splitting. Other details are as in Figure 2.2. Pairwise correlation between integrative conformal p-values Data distribution 1 Data distribution 2 10 30 100 10 30 100 0.01 0.03 0.10 Calibration set size Method Integrative OCC OCC(Theory) Figure A3: Pairwise correlation between integrative conformal p-values for independent test points, on synthetic data from two different distributions. The results are shown as a function of the size of the shared calibration data set. The dotted line indicates the theoretical pairwise correlation between standard one-class classification (OCC) conformal p-values. 138 Additional method schematics sˆ1(Dcal 1 ) sˆ0(Dtest 0 \ {n + 1}) sˆ1(Dtest 0 \ {n + 1}) sˆ0(Dcal 0 ∪ {n + 1}) sˆ1(Dcal 0 ∪ {n + 1}) Algorithm 1 u˜n+1(Xl), l ∈ Dtest \ {n + 1} u˜n+1(Xn+1) = 0 BH R˜ n+1 sˆ0(Dtest 0 \ {n + m}) sˆ1(Dtest 0 \ {n + m}) sˆ0(Dcal 0 ∪ {n + m}) sˆ1(Dcal 0 ∪ {n + m}) Algorithm 1 u˜n+m(Xl), l ∈ Dtest \ {n + m} u˜n+m(Xn+m) = 0 BH R˜ n+m R+ R . . . . . . . . . Results from Algorithm 1 applied to m test points. For each test point, re-apply Algorithm 1 with a subsets of the pre-computed conformity scores to estimate the number of rejections for the other test points. Compute the rejection set, pruning if necessary. Figure A4: Schematic representation of the conditional FDR calibration method with integrative p-values described in Section 2.3. The operations in the green nodes are not computationally expensive because they utilize pre-computed conformity scores. The dashed lines represent the flow of information utilized for the evaluation of the machine-learning models, but not for training. Dcal 0 ∪ {n + 1} Dtest \ {n + 1} Algorithm 2 u˜n+1(Xl), l ∈ Dtest \ {n + 1} u˜n+1(Xn+1) = 0 BH R˜ n+1 sˆ k 1 (D0 ∪ {n + 1}) k = 1, . . . , K sˆ k 1 (Dtest \ {n + 1}) k = 1, . . . , K Dcal 0 ∪ {n + m} Dtest \ {n + m} Algorithm 2 u˜n+m(Xl), l ∈ Dtest \ {n + m} u˜n+m(Xn+m) = 0 BH R˜ n+m sˆ k 1 (D0 ∪ {n + m}) k = 1, . . . , K sˆ k 1 (Dtest \ {n+ m}) k = 1, . . . , K sˆ k(l) 1 (Xl), l ∈ D1 R+ R . . . . . . . . . . . . Results from Algorithm 2 applied to m test points. For each test point, re-apply Algorithm 2 with perturbed data to estimate the number of rejections for the other test points. Compute the rejection set, pruning if necessary. Figure A5: Schematic representation of the conditional FDR calibration method with integrative TCV+ pvalues described in Section A1. The red nodes are computationally expensive. The dashed lines represent the flow of information utilized for the evaluation of the machine-learning models, but not for training. 139 Numerical experiments with synthetic data Integrative conformal p-values Inliers : 0.5 Inliers : 0.75 Inliers : 0.9 Power FD R 30 300 3000 30 300 3000 30 300 3000 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 Sample size Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Inliers : 0.5 Inliers : 0.75 Inliers : 0.9 30 300 3000 30 300 3000 30 300 3000 0 1 2 3 Sample size Informativeness ratio True (Ξ) Estimated (Ξ ^) Figure A6: Performance of conformal p-values computed using different methods, in the experiments with simulated data of Figure 2.3. The results are shown as a function of the sample size, for calibration data with different proportions of inliers. Top: power obtained with Storey’s BH procedure. Center: FDR; the dashed line corresponds to the nominal 10% level. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆ calculated using (A16). These results are consistent with the theoretical analysis of Section 2.4, which predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, even though the technical assumptions of Section 2.4 may not hold exactly. Further, these results show that the empirical estimate Ξˆ of Ξ described in Section A4.3 is quite accurate as long as the outliers are exchangeable and the sample size is sufficiently large. 140 n = 200 n = 2000 Power FDR 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.5 0.8 0.0 0.2 0.5 0.8 Fraction of outliers in the data Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure A7: Performance of Storey’s BH applied to conformal p-values computed with different methods, on simulated data. The results are shown as a function of the fraction of outliers in the labeled data, for different sample sizes. Other details are as in Figure 2.3. Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 TPR FPR 30 300 3000 30 300 3000 30 300 3000 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 Sample size Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure A8: Performance of conformal p-values computed with different methods, on simulated data. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure 2.3. 141 n = 200 n = 2000 TPR FPR 0.00 0.25 0.500.00 0.25 0.50 0.0 0.2 0.5 0.8 0.0 0.2 0.5 0.8 Fraction of outliers in the data Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure A9: Performance of conformal p-values computed with different methods, on simulated data. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure A7. 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical (uniform) 0.0 0.2 0.4 0.6 0.8 1.0 Empirical Inliers 0.0 0.2 0.4 0.6 0.8 1.0 Theoretical (uniform) 0.0 0.2 0.4 0.6 0.8 1.0 Empirical Outliers Integrative OCC (ensemble) Binary (ensemble) Binary (MLP) OCC (IF) Figure A10: QQ-plot comparing the empirical distribution of conformal p-values, computed with different methods on synthetic data, to the theoretical uniform distribution, separately for true inliers (left) and outliers (right). In the underlying experiments, the labeled sample size is 500 and the proportion of data points which are outliers is 50%. Other details are as in Figure 2.3. 142 Outlier shift: −0.25 Outlier shift: −0.10 Outlier shift: 0.00 Outlier shift: 0.10 Outlier shift: 0.25 Power FDR 30 300 3000 30 300 3000 30 300 3000 30 300 3000 30 300 3000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sample size Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure A11: Performance of Storey’s BH with conformal p-values based on different methods, on simulated data with non-exchangeable outliers. The results are shown as a function of the sample size, for data sets experiencing different amounts of distribution shift between the labeled and test outliers. The proportion of inliers in the labeled data is 50%. If we denote the outlier shift parameter as δa, the outliers in the test set are generated using the same model described at the beginning of Section 2.5 with the signal amplitude a replaced by (1 − δa)a. Note that the case with δa = 0 matches the results shown in Figure 2.3. Values of δa > 0 intrinsically make the problem easier because they result in test outliers that are more easily recognizable from the inliers, while δa < 0 makes the problem harder. Other details are as in Figure 2.3. 143 Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 Power FDR 30 300 3000 30 300 3000 30 300 3000 0.0 0.2 0.4 0.0 0.2 0.4 Sample size Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure A12: Performance of Storey’s BH applied to conformal p-values computed with different methods, on simulated data. The data are simulated from a Gaussian mixture model as in Section 2.5, but with parameter a = 1.25 instead of a = 0.7. Other details are as in Figure 2.3. Power FDR 0.5 30 300 3000 30 300 3000 0.0 0.2 0.4 Sample size OCC Oracle Ensemble IF LOF Figure A13: Performance of Storey’s BH applied to conformal p-values based on different underlying machine-learning models, on simulated data. Integrative conformal p-values without weighting (ensemble method) are compared to standard conformal p-values based on two different one-class classification models (IF and LOF), as well as to those based on the most powerful one-class classification model selected by an imaginary oracle. Other details are as in Figure A12. 144 Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 Power FDR 30 300 3000 30 300 3000 30 300 3000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sample size Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure A14: Performance of Storey’s BH applied to conformal p-values computed with different methods, on simulated data. The data are simulated from a logistic regression model with random parameters. Other details are as in Figure 2.3. Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 TPR FPR 30 300 3000 30 300 3000 30 300 3000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Sample size Method Integrative OCC (ensemble) Binary (ensemble) OCC (oracle) Binary (oracle) Figure A15: Performance of conformal p-values computed with different methods, on simulated data. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The data are simulated from a logistic regression model with random parameters. Other details are as in Figure A8. 145 Integrative p-values via TCV+ Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 Power FDR 30 100 300 1000 30 100 300 1000 30 100 300 1000 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Sample size Method Integrative (TCV+) Integrative One−Class (oracle, CV+) Binary (oracle, CV+) Figure A16: Performance on simulated data of Storey’s BH applied to integrative conformal p-values computed with either TCV+ or sample splitting. To serve as benchmarks, standard conformal p-values based on one-class or binary classification models tuned by an ideal oracle are computed using cross-validation+. Other details are as in Figure 2.3. Other results are as in Figure 2.5. Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 TPR FPR 30 100 300 1000 30 100 300 1000 30 100 300 1000 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Sample size Method Integrative (TCV+) Integrative One−Class (oracle, CV+) Binary (oracle, CV+) Figure A17: Performance on simulated data of integrative conformal p-values computed with either TCV+ or sample splitting. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. Other details are as in Figure 2.5. 146 Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 Power FDR 30 100 300 1000 30 100 300 1000 30 100 300 1000 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 Sample size Method Integrative (TCV+) Integrative One−Class (oracle, CV+) Binary (oracle, CV+) Figure A18: Performance on simulated data of Storey’s BH applied to integrative conformal p-values computed with either TCV+ or sample splitting. The data are simulated from a logistic regression model with random parameters. Other details are as in Figure 2.5. Inliers: 0.50 Inliers: 0.75 Inliers: 0.90 TPR FPR 30 100 300 1000 30 100 300 1000 30 100 300 1000 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 Sample size Method Integrative (TCV+) Integrative One−Class (oracle, CV+) Binary (oracle, CV+) Figure A19: Performance on simulated data of integrative conformal p-values computed with either TCV+ or sample splitting. The performance is measured in terms of true positive rate (TPR) and false positive rate (FPR). The horizontal dashed line corresponds to the nominal 10% FPR level. The data are simulated from a logistic regression model with random parameters. Other details are as in Figure A18. 147 Asymptotic power analysis n1 = 50 n1 = 25 n1 = 10 FD R Power 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 SVM gamma Method Integrative OCC (ensemble) n1 = 50 n1 = 25 n1 = 10 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 1 2 3 SVM gamma Informativeness ratio True (Ξ) Estimated (Ξ ^) Figure A20: Empirical validation, using synthetic data with exchangeable outliers, of the power analysis for integrative p-values discussed in Section 2.4. Top: power of BH applied to weighted and unweighted integrative conformal p-values, as a function of the scale parameter of the underlying support vector machine. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆexch calculated using (A16). These results are consistent with the theoretical analysis of Section 2.4, which predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, even though the technical assumptions of Section 2.4 may not hold exactly. 148 Shift: −0.25 Shift: −0.10 Shift: 0.00 Shift: 0.10 Shift: 0.25 FD R Power 1e−06 1e−03 1e+00 1e−06 1e−03 1e+00 1e−06 1e−03 1e+00 1e−06 1e−03 1e+00 1e−06 1e−03 1e+00 0.000 0.025 0.050 0.075 0.100 0.00 0.25 0.50 0.75 1.00 SVM gamma Method Integrative OCC (ensemble) Shift: −0.25 Shift: −0.10 Shift: 0.00 Shift: 0.10 Shift: 0.25 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 1e−05 1e−03 1e−01 3 6 9 SVM gamma Informativeness ratio True (Ξ) Estimated (Ξ ^) Estimated (Ξ ^), non−exch. Figure A21: Empirical validation, using synthetic data with non-exchangeable outliers, of the power analysis for integrative p-values discussed in Section 2.4. The data are generated from the same model utilized in Figure A11. The simulations corresponding to zero outlier shift are based on exchangeable outlier data, as in Figure A20. Top: power and FDR of BH applied to weighted and unweighted integrative conformal p-values, as a function of the scale parameter of the underlying support vector machine. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆ calculated either using (A16)—assuming outlier exchangeability—or using (A17)—without assuming exchangeability. These results are consistent with the theoretical analysis of Section 2.4, which predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, even though the technical assumptions of Section 2.4 may not hold exactly. Note that the values of Ξˆ estimated using (A17) are not shown in the panels corresponding to negative outlier shift because the method-of-moments estimator νˆ described in Section A4.4 does not converge to a non-trivial solution in those cases. 149 Numerical experiments with real data Inliers : 1000 Inliers : 7824 Power FD R 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 0.00 0.25 0.50 Number of outliers Power Method Integrative Ensemble (not weighted) OCC (oracle) Binary (oracle) Inliers: 1000 Inliers: 7824 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 1.00 Number of outliers Informativeness ratio True (Ξ) Estimated (Ξ ^) Figure A22: Performance of conformal p-values for out-of-distribution testing with animal image data, for the same experiments shown in Figure 2.1. Top: power and FDR of BH applied to different types of conformal p-values. Bottom: “informativeness” ratio Ξ defined in (2.10) and its empirical estimate Ξˆ calculated using (A16). These results are not inconsistent with the theoretical analysis of Section 2.4, because the latter predicts that the weighting approach of integrative conformal p-values should lead to higher power if Ξ ≥ 1, but does not imply the converse. Interestingly, the left-hand-side panels show that weighting can sometimes lead to higher power even when Ξ < 1. Note that the estimator Ξˆ of Ξ is not accurate if the number of labeled outliers is very small. 150 Inliers: 1000 Inliers: 7824 10 100 1000 10 100 1000 0.0 0.2 0.4 0.6 Number of outliers Power OCC Integrative IF LOF SVM−pol SVM−rbf SVM−sgd SVM−sig Figure A23: Power of Storey’s BH with conformal p-values based on different machine-learning models, on animal image classification data. Integrative p-values are compared to standard conformal p-values based on six different one-class classification models. Other details are as in Figure 2.7. Top: power and FDR of BH applied to integrative and standard conformal p-values, as a function of the scale parameter of the underlying support vector machine. Bottom: estimated “informativeness” ratio Ξˆ for the p-values calculated based on (A17). Values of Ξˆ ≥ 1 anticipate that integrative p-values may be more powerful. 151 (a) Outliers: 100 Outliers: 500 Outliers: 1000 Power FDR 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Outlier shift Method Integrative OCC (oracle) Binary (oracle) (b) Outliers: 100 Outliers: 500 Outliers: 1000 Power FDR 0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Outlier shift Method Integrative OCC (oracle) Binary (oracle) Figure A24: Power of conformal p-values under FDR control with image data, as a function of the distribution shift between the labeled and test outliers. The number of inliers in the labeled data is 1000. (a): If the outlier shift is 1, the species of all labeled outliers belongs to S = {chimpanzee, orangutan}. (b): If the outlier shift is 1, the species of all labeled outliers belongs to S = {lynx, cheetah, jaguar, cat}. Other details are as in Figure 2.1, which corresponds to no outlier shift. 152 (a) Outliers: 100 Outliers: 500 Outliers: 1000 Power FDR 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Outlier shift Method Integrative Integrative (OCC) OCC (oracle) Binary (oracle) (b) Outliers: 100 Outliers: 500 Outliers: 1000 Power FDR 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Outlier shift Method Integrative Integrative (OCC) OCC (oracle) Binary (oracle) Figure A25: Performance of conformal p-values for out-of-distribution testing with image data, as a function of the distribution shift between the outliers in the labeled data and those in the test data. The number of inliers in the labeled data is 7824. (a): If the outlier shift is 1, the species of all labeled outliers belongs to S = {chimpanzee, orangutan}. (b): If the outlier shift is 1, the species of all labeled outliers belongs to S = {lynx, cheetah, jaguar, cat}. Other details are as in Figure A24. Note that the results in (b) include an additional method in the comparison: integrative conformal p-values based on a single one-class classifier (support vector machine with sigmoid kernel) instead of an ensemble of different one-class and binary classifiers. This helps differentiate the relative sensitivity to outlier shift of two separate components of our method: data-driven p-value weighting and automatic model selection. 153 Power FDR 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 1.00 Number of outliers Method Integrative OCC (oracle) Binary (oracle) Figure A26: Performance of Storey’s BH applied to conformal p-values based on different machine-learning models, on flower classification data. Other details are as in Figure 2.1. Power FDR 10 100 1000 10 100 1000 0.0 0.2 0.4 0.6 0.8 Number of outliers OCC Integrative IF LOF SVM−pol SVM−rbf SVM−sgd SVM−sig (a) Power FDR 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 Number of outliers BC Integrative KNN MLP NB QDA RF SVC (b) Figure A27: Performance of Storey’s BH applied to conformal p-values based on different underlying machine-learning models, on flower image classification data. Integrative conformal p-values are compared to standard conformal p-values based on six different one-class classification models (a) and six different binary classification models (b). Other details are as in Figure A26. 154 Power FDR 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 1.00 Number of outliers Method Integrative (TCV+) Integrative One−Class (oracle, CV+) Binary (oracle, CV+) Figure A28: Performance on flower classification data of Storey’s BH applied to integrative conformal p-values computed with either TCV+ or sample splitting. To serve as benchmarks, standard conformal p-values based on one-class or binary classification models tuned by an ideal oracle are computed using cross-validation+. Other details are as in Figure A26. Power FDR 10 100 1000 10 100 1000 0.00 0.25 0.50 0.75 1.00 Number of outliers Method Integrative OCC (oracle) Binary (oracle) Figure A29: Performance of Storey’s BH applied to conformal p-values based on different underlying machine-learning models, on car image classification data. Other details are as in Figure A26. Thyroid disease Mammography 3 10 30 100 300 3 10 30 100 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Number of outliers Method Integrative OCC (oracle) Binary (oracle) Figure A30: Performance of Storey’s BH with conformal p-values based on different machine-learning models, on two medical diagnostics data sets. Other details are as in Figure 2.1. 155 Appendix B Appendix to Chapter 3 B1 Naive Early Stopping Benchmarks Detailed Implementation of the Naive Benchmarks We detail here the implementation of the naive benchmark discussed in Section 3.2.1. This approach can serve as an informative benchmark and it becomes useful in Appendix B4.1 to extend our rigorous conformalized early stopping method for regression problems in such as way as to explicitly avoid returning empty prediction intervals. For completeness, we present the implementation of the naive benchmark separately for outlier detection, multi-class classification, and regression, respectively in Algorithms B1, B2 and B3. Note that Algorithm B2 also allows for the possibility of computing prediction sets seeking (approximate) marginal coverage instead of (approximate) label-conditional coverage for multi-class classification problems; see Appendix B2 for further details on multi-class classification with marginal coverage. 156 Algorithm B1 Naive conformal outlier detection benchmark with greedy early stopping 1: Input: Exchangeable data points Z1, . . . , Zn; test point Zn+1. 2: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 3: Input: One-class classifier trainable via (stochastic) gradient descent. 4: Randomly split the exchangeable data points into Dtrain and Des-cal. 5: Train the one-class classifier for t max epochs and save the intermediate models Mt1 , . . . , MtT . 6: Pick the most promising model t ∗ ∈ [T] minimizing Les-cal(Mt) in (3.1), based on Des-cal. 7: Compute nonconformity scores Sˆ i(Zn+1) for all i ∈ Des-cal ∪ {n + 1} using model t ∗ . 8: Output: Naive conformal p-value uˆ naive 0 (Zn+1) given by (3.3). 157 Algorithm B2 Naive conformal multi-class classification benchmark with greedy early stopping 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with labels Yi ∈ [K]. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: K-class classifier trainable via (stochastic) gradient descent. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train the K-class classifier for t max epochs and save the intermediate models Mt1 , . . . , MtT . 7: Pick the most promising model t ∗ ∈ [T] minimizing Les-cal(Mt) in (3.4), based on Des-cal. 8: for y ∈ [K] do 9: if Label-conditional coverage is desired then 10: Define D y es-cal = {i ∈ Des-cal : Yi = y}. 11: Compute scores Sˆy i (Xn+1) for all i ∈ Dy es-cal ∪ {n + 1} using model t ∗ ; see Appendix B4. 12: Compute the naive conformal p-value uˆ naive y (Xn+1) according to (3.6). 13: else 14: Compute scores Sˆy i (Xn+1) for all i ∈ Des-cal ∪ {n + 1} using model t ∗ ; see Appendix B4. 15: Compute the naive conformal p-value uˆ naive y (Xn+1) according to uˆ naive y (Xn+1) = 1 + |i ∈ Des-cal : Sˆy i (Xn+1) ≤ Sˆy n+1(Xn+1)| 1 + |Des-cal| . 16: end if 17: end for 18: Output: Naive prediction set Cˆnaive α (Xn+1) given by (3.7). 158 Algorithm B3 Naive conformal regression benchmark with greedy early stopping 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with outcomes Yi ∈ R. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: Regression model trainable via (stochastic) gradient descent. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train the regression model for t max epochs and save the intermediate models Mt1 , . . . , MtT . 7: Pick the most promising model t ∗ ∈ [T] minimizing Les-cal(Mt) in (3.8). 8: Evaluate nonconformity scores Sˆ i(Xn+1) = |Yi − µˆt ∗ (Xi)| for all i ∈ Des-cal. 9: Compute Qˆ 1−α(Xn+1) = ⌈(1 − α)(1 + |Des-cal|)⌉-th smallest value in Sˆ i(Xn+1) for i ∈ Des-cal. 10: Output: Prediction interval Cˆnaive α (Xn+1) = ˆµt ∗ (Xn+1) ± Qˆ 1−α(Xn+1). Theoretical Analysis of the Naive Benchmark Although the naive benchmarks described above often perform similarly to CES in practice, they do not enjoy the same desirable theoretical guarantees. Nonetheless, we can study their behaviour in sufficient detail as to prove that their inferences are too far from being valid. Unfortunately, as demonstrated in Section 3.3, these theoretical results are still not tight enough to be very useful in practice. For simplicity, we will begin by focusing on outlier detection. Review of existing results based on the DKW inequality. [141] have recently studied the finite-sample coverage rate of a conformal prediction interval formed by naively calibrating a model selected among T possible candidates based on its performance on the calibration data set itself, which we denote by Des-cal. Although [141] focus on conformal prediction intervals, here we find it easier to explain their ideas in the context of conformal p-values for outlier detection. 159 Let Sˆ i(Zn+1;t), for all i ∈ Des-cal and t ∈ [T], denote the nonconformity scores corresponding to model t, and denote the ⌊α(1 + |Des-cal|)⌋-th smallest value in Sˆ i(Xn+1;t) as Qˆ α(Zn+1;t). Let t ∗ indicate the selected model. As we are interested in constructing a conformal p-value uˆ naive 0 (Zn+1), the goal is to bound from above the tail probability P uˆ naive 0 (Zn+1) > α = E h P Sˆ i(Xn+1;t ∗ ) > Qˆ α(Zn+1;t ∗ ) | Des-cali . (B1) Intuitively, if nes-cal = |Des-cal| is sufficiently large, the conditional probability inside the expected value on the right-hand-side above can be well-approximated by the following empirical quantity: 1 n X i∈Des-cal 1 n Sˆ i(Xn+1;t ∗ ) > Qˆ α(Zn+1;t ∗ ) o = ⌈(1 + nes-cal)(1 − α)⌉ nes-cal ≥ 1 + 1 nes-cal (1 − α). The quality of this approximation in finite samples can be bound by the DKW inequality, which holds for any ε ≥ 0: P sup s∈R 1 nes-cal X i∈Des-cal 1 n Sˆ i(Xn+1;t ∗ ) > so − P Sˆ i(Xn+1;t ∗ ) > s | Des-cal > ε ≤ 2e −2nes-calε 2 . (B2) Starting from this, Theorem 1 in [141] shows that P(ˆu naive 0 (Zn+1) > α) ≥ 1 + 1 nes-cal (1 − α) − p log(2T)/2 + c(T) √ nes-cal , (B3) where c(T) is a constant that can be computed explicitly and is generally smaller than 1/3. Intuitively, the [ p log(2T)/2 + c(T)]/ √ nes-cal term above can be interpreted as the worst-case approximation error among all possible models t ∈ [T]. 16 One limitation with this result is that is gives a worst-case correction that does not depend on the chosen level α, and one would intuitively expect this bound to be tighter for α = 1/2 and overly conservative for the small α values (e.g., α = 0.1) that are typically interesting in practice. (This intuition will be confirmed empirically in Figure B2.) This observation motivates the following alternative analysis, which can often give tighter results. Alternative probabilistic bound based on Markov’s inequality. Define Wt = P uˆ naive 0 (Zn+1;t) > α | Des-cal . Lemma 3 in [133] tells us that Wt follows a Beta distribution, assuming exchangeability among Des-cal and the test point. That is, Wt ∼ Beta(nes-cal + 1 − l, l), l = ⌊α(nes-cal + 1)⌋. In the following, we will denote the corresponding inverse Beta cumulative distribution function as I −1 (x; nes-cal+ 1 − l, l). This result can be used to derive an alternative upper bound for P(ˆu naive 0 (Zn+1) > α) using the Markov’s inequality. Proposition B1. Assume Z1, . . . , Zn, Zn+1 are exchangeable random samples, and let uˆ naive 0 (Zn+1) be the output of Algorithm B1, for any given α ∈ (0, 1). Then, for any fixed α ∈ (0, 1) and any b > 1, letting l = ⌊α(nes-cal + 1)⌋, P uˆ naive 0 (Zn+1) > α ≥ I −1 1 bT ; nes-cal + 1 − l, l · (1 − 1/b). Note that this bound depends on α in a more complex way compared to that of [141]. However, its asymptotic behaviour in the large-T limit remains similar, as shown below. 161 Lemma B1. Denote I −1 (x; nes-cal + 1 − l, l) as the inverse Beta cumulative distribution function. For any fixed b > 1 and α ∈ (0, 1) , letting l = ⌊α(nes-cal + 1)⌋, for sufficiently large T and nes-cal, we have: I −1 1 bT ; nes-cal + 1 − l, l = (1 − α) − s α(1 − α) nes-cal + 1 · p 2 log(bT) + O 1 p nes-cal log(T) ! . In simpler terms, Lemma B1 implies that the coverage lower bound in Proposition B1 is approximately equal to (1 − α) − s α(1 − α) nes-cal + 1 · p 2 log(bT) · 1 − 1 b , which displays an asymptotic behaviour similar to that of the bound from [141]. Further, the Markov bound is easy to compute numerically and often turns out to be tighter as long as b is moderately large (e.g., b = 100), as we shall see below. Naturally, the same idea can also be applied to bound the coverage of naive conformal prediction sets or intervals output by Algorithm B2 or Algorithm B3, respectively. Corollary B1. Assume (X1, Y1), . . . ,(Xn, Yn),(Xn+1, Yn+1) are exchangeable random sample, and let Cˆnaive α (Xn+1) be the output of Algorithm B2, for any given α ∈ (0, 1). Then, for any b > 1, letting l = ⌊α(nes-cal + 1)⌋, P h Yn+1 ∈ Cˆnaive α (Xn+1) i ≥ I −1 1 bT ; nes-cal + 1 − l, l · (1 − 1/b). Corollary B2. Assume (X1, Y1), . . . ,(Xn, Yn),(Xn+1, Yn+1) are exchangeable random samples, and let Cˆnaive α (Xn+1) be the output of Algorithm B3, for any α ∈ (0, 1). Then, for any b > 1, letting l = ⌊α(nes-cal + 1)⌋, P h Yn+1 ∈ Cˆnaive α (Xn+1) i ≥ I −1 1 bT ; nes-cal + 1 − l, l · (1 − 1/b). Hybrid probabilistic bound. Since neither the DKW nor the Markov bound described above always dominate the other for all possible combinations of T, nes-cal, and α, it makes sense to combine them to 162 obtain a uniformly tighter hybrid bound. For any fixed b > 1 and any T, nes-cal, and α, let M(T, nes-cal, α) = I −1 (1/bT; nes-cal + 1 − l, l)·(1−1/b) denote the Markov bound and D(T, nes-cal, α) = (1 + 1/(nes-cal)) (1− α) − ( p log(2T)/2 + c(T))/ √ nes-cal denote the DKW bound, define H(T, nes-cal, α) as H(T, nes-cal, α) = max {M(T, nes-cal, α), D(T, nes-cal, α)} . It then follows immediately from [141] and Proposition B1 that, under the same conditions of Proposition B1, for any fixed b > 1, P uˆ naive 0 (Zn+1) > α ≥ H(T, nes-cal, α). Of course, the same argument can also be utilized to tighten the results of Corollaries B1–B2. Numerical comparison of different probabilistic bounds. Figure B1 compares the three probabilistic bounds described above (DKW, Markov, and hybrid) as a function of the number of candidate models T and of the number of hold-out data points nes-cal, in the case of α = 0.1. For simplicity, the Markov and hybrid bounds are evaluated by setting b = 100, which may not be the optimal choice but appears to work reasonably well. These results show that Markov bound tends to be tighter than the DKW bound for large values of T and for small values of nes-cal, while the hybrid bound generally achieves the best of both worlds. Lastly, Figure B2 demonstrates that the Markov bound tends to be tighter when α is small. The Markov and hybrid bounds here are also evaluated using b = 100. B2 Classification with Marginal Coverage The conformalized early stopping method presented in Section 3.2.3 can be easily modified to produce prediction sets with marginal rather than label-conditional coverage, as outlined in Algorithm B4. The difference between Algorithm 5 and Algorithm B4 is that the latter utilizes all calibration data in Des-cal 163 Figure B1: Numerical comparison of different theoretical lower bounds for the marginal coverage of conformal prediction sets computed with a naive early stopping benchmark (e.g., Algorithm B2). Left: lower bounds for the marginal coverage as a function of the number of candidate models T, when α = 0.1 and nes-cal = 8000. Right: lower bounds for the marginal coverage as a function of the number of hold-out data points, nes-cal, when α = 0.1 and T = 100. Higher values correspond to tighter bounds. Figure B2: Numerical comparison of different theoretical lower bounds for the marginal coverage of conformal prediction sets computed with a naive early stopping benchmark (e.g., Algorithm B2), as a function of the nominal significance level α. Left: lower bounds for the marginal coverage as a function of α, when T = 1000 and nes-cal = 1000. Right: theoretically corrected significance level necessary needed to achieve the marginal coverage guarantees expected at the nominal α level, as a function of α when T = 1000 and nes-cal = 1000. The dashed grey lines indicate the ideal values corresponding to standard conformal inferences based on calibration data that are independent of those used for early stopping. Higher values correspond to tighter bounds. 164 to compute each conformal p-value uˆy(Xn+1), not only the samples with true label y. An advantage of this approach is that conformal p-values based on a larger calibration samples are less aleatoric [10] and require less conservative finite-sample corrections (i.e., the “+1” term the numerator of the p-value formula becomes more negligible as the calibration set size increases). In turn, this tends to lead to smaller prediction sets with potentially more stable coverage conditional on the calibration data [111, 10] Of course, the downside of these prediction sets is that they can only be guaranteed to provide marginal coverage, although they can sometimes also perform well empirically in terms of label-conditional coverage [104]. Theorem B1. Assume (X1, Y1), . . . ,(Xn, Yn),(Xn+1, Yn+1) are exchangeable random samples, and let Cˆ α(Xn+1) be the output of Algorithm B4, for any α ∈ (0, 1). Then, P[Yn+1 ∈ Cˆ α(Xn+1)] ≥ 1 − α. 165 Algorithm B4 Conformalized early stopping for multi-class classification with marginal coverage 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with labels Yi ∈ [K]. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: K-class classifier trainable via (stochastic) gradient descent. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train the K-class classifier for t max epochs and save the intermediate models Mt1 , . . . , MtT . 7: for y ∈ [K] do 8: Imagine Yn+1 = y. 9: Pick the model Mˆ ces(Xn+1, y) according to (3.5), using the data in Des-cal ∪ {n + 1}. 10: Compute scores Sˆ i(Xn+1, y) for all i ∈ Des-cal ∪ {n + 1} using Mˆ ces(Xn+1, y); see Appendix B4. 11: Compute the conformal p-value uˆ marg y (Xn+1) according to uˆ marg y (Xn+1) = 1 + |i ∈ Des-cal : Sˆy i (Xn+1) ≤ Sˆy n+1(Xn+1)| 1 + |Des-cal| . (B4) 12: end for 13: Output: Prediction set Cˆ α(Xn+1) given by (3.7), with uˆ marg y (Xn+1) instead of uˆy(Xn+1). B3 Efficient Computation of the Lower Envelope This section explains how to implement a computationally efficient divide-and-conquer algorithm for finding the lower envelope of a family of T parabolas or a family of shifted pinball loss functions at cost O(T log T) [33, 92]. This solution, outlined in Algorithm B5 and Algorithm B6, is useful to implement the proposed CES method for regression problems, as detailed in Algorithm 6 and Algorithm 7. 166 Algorithm B5 Divide-and-conquer algorithm for finding the lower envelope of many parabolas 1: Input: A set of parabolas L = {l1, l2, . . . , lT } of forms li = aix 2 + bix + ci for i = 1, . . . , T. 2: Randomly split L into two subsets. Repeat splitting until each subset only contains one parabola or is empty. 3: For each subset with only one parabola, set the parabola itself as the lower envelope and set the initial breakpoint list to [−∞, +∞]. 4: for each interval constructed by adjacent breakpoints do 5: Within the interval, identify the two parabolas contributing to the previous lower envelopes, denoted as P1, P2. 6: Evaluate P1 and P2 at the current interval endpoints. 7: Calculate the intersection point p of P1 and P2. There exists at most one such p because ai = 1, ∀i, by (3.8). 8: if p not exists or p exists but lies outside the current interval then 9: Set the new lower envelope as the parabola with smaller values computed at the interval endpoints. 10: else 11: Add p as a breakpoint. 12: Within the current interval, set the new lower envelope below and above p based on evaluations of the parabolas at the interval endpoints. 13: end if 14: Update and sort the breakpoint list and update the new lower envelope. 15: end for 16: Recursively merge two lower envelopes to form a new lower envelope by repeating Lines 4–15. 17: Output: A sorted dictionary of breakpoints and parabola indices characterizing the lower envelope of L. 167 Algorithm B6 Divide-and-conquer algorithm for finding the lower envelope of many pinball loss functions 1: Input: A set of shifted pinball loss functions L = {l1, l2, . . . , lT } of forms li = ci + ρβ(y, yˆ) for i = 1, . . . , T. 2: Randomly split L into two subsets. Repeat splitting until each subset only contains one pinball loss function or is empty. 3: For each subset with only one pinball loss function, set the function itself as the lower envelope and set the initial breakpoint list to [−∞, +∞]. 4: for each interval constructed by adjacent breakpoints do 5: Within the interval, identify the two pinball loss functions contributing to the previous lower envelopes; i.e., P1, P2. 6: Evaluate P1 and P2 at the current interval endpoints. 7: Calculate the intersection point p of P1 and P2. There exists at most one such p because β is the same ∀i, by (3.14). 8: if p not exists or p exists but lies outside the current interval then 9: Set the new lower envelope as the pinball loss function with smaller values computed at the interval endpoints. 10: else 11: Add p as a breakpoint. 12: Within the current interval, set the new lower envelope below and above p based on evaluations of the pinball loss functions at the interval endpoints. 13: end if 14: Update and sort the breakpoint list and update the new lower envelope. 15: end for 16: Recursively merge two lower envelopes to form a new lower envelope by repeating Lines 4–15. 17: Output: A sorted dictionary of breakpoints and pinball loss function indices characterizing the lower envelope of L. 168 B4 Review on related literature Review of Nonconformity Scores for Classification This section reviews the relevant background on the adaptive nonconformity scores for classification developed by [104]. For any x ∈ X and y ∈ [K], let πˆy(x) denote any (possibly very inaccurate) estimate of the true P[Y = y | X = x] corresponding to the unknown data-generating distribution. Concretely, a typical choice of πˆ may be given by the output of the final softmax layer of a neural network classifier, for example. For any x ∈ X and τ ∈ [0, 1], define the generalized conditional quantile function L, with input x, π, τ ˆ , as: L(x; ˆπ, τ ) = min{k ∈ [K] : ˆπ(1)(x) + ˆπ(2)(x) + . . . + ˆπ(k) (x) ≥ τ}, (B5) where πˆ(1)(x) ≤ πˆ(2)(x) ≤ . . . πˆ(K) (x) are the order statistics of πˆ1(x) ≤ πˆ2(x) ≤ . . . πˆK(x). Intuitively, L(x; ˆπ, τ ) gives the size of the smallest possible subset of labels whose cumulative probability mass according to πˆ is at least τ . Define also a function S with input x, u ∈ (0, 1), πˆ, and τ that computes the set of most likely labels up to (but possibly excluding) the one identified by L(x; ˆπ, τ ): S(x, u; ˆπ, τ ) = ‘y’ indices of the L(x; ˆπ, τ ) − 1 largest πˆy(x), if u ≤ V (x; ˆπ, τ ), ‘y’ indices of the L(x; ˆπ, τ ) largest πˆy(x), otherwise, (B6) where V (x; ˆπ, τ ) = 1 πˆ(L(x;ˆπ,τ))(x) L( Xx;ˆπ,τ) k=1 πˆ(k) (x) − τ . 169 Then, define the generalized inverse quantile nonconformity score function s, with input x, y, u; ˆπ, as: s(x, y, u; ˆπ) = min {τ ∈ [0, 1] : y ∈ S(x, u; ˆπ, τ )} . (B7) Intuitively, s(x, y, u; ˆπ)is the smallest value of τ for which the set S(x, u; ˆπ, τ ) contains the label y. Finally, the nonconformity score for a data point (Xi , Yi) is given by: Sˆ i = s(Xi , Yi , Ui ; ˆπ), (B8) whereUi is a uniform random variable independent of anything else. Note that this can also be equivalently written more explicitly as: Sˆ i = ˆπ(1)(Xi) + ˆπ(2)(Xi) + . . . + ˆπ(r(Yi,πˆ(Xi)))(Xi) − Ui · πˆ(r(Yi,πˆ(Xi)))(Xi), (B9) where r(Yi , πˆ(Xi))is the rank of Yi among the possible labels y ∈ [K] based on πˆy(Xi), so that r(y, πˆ(Xi)) = 1 if πˆy(Xi) = ˆπ(1)(Xi). The idea motivating this construction is that the nonconformity score Sˆ i defined above is guaranteed to be uniformly distributed on [0, 1] conditional on X if the model πˆ estimates the true unknown P[Y = y | X = x] accurately for all x ∈ X . This is a desirable property in conformal inference because it leads to statistically efficient prediction sets that can often achieve relatively high feature-conditional coverage in practice, even if the true data-generating distribution is such that some observations are much noisier than others; see [104] for further details. Finally, we conclude this appendix by noting that the nonconformity scores in Section 3.2.3 are written as Sˆ i(Xn+1, y), instead of the more compact notation Sˆ i adopted here, simply to emphasize that they are computed based on class probabilities πˆ estimated by a data-driven model Mˆ that depends on the test features Xn+1 as well as on the placeholder label y for Yn+1. 170 Review of Conformalized Quantile Regression This section reviews the relevant background on conditional quantile regression [66] and conformalized quantile regression (CQR) [103]. In contrast to the classical regression models that estimate the conditional mean of the test response Yn+1 given the test feature Xn+1 = x, quantile regression estimates the conditional quantile qβ of Yn+1 given Xn+1 = x, which is defined as qβ(x) = inf{y ∈ R : P(Yn+1 ≤ y|Xn+1 = x) ≥ β}. (B10) This can be formulated as solving the optimization problem: qˆβ(x) = f(x, ˆθ), ˆθ = arg min θ 1 n Xn i=1 ρβ(Yi , f(Xi , θ)), (B11) where f(x, θ) represents the quantile regression function [66] and ρβ is the convex “pinball loss" function [121], illustrated in Figure B3 and mathematically defined as ρβ(y, yˆ) = β(y − yˆ), if y − y >ˆ 0, (1 − β)(ˆy − y), otherwise. (B12) y − yˆ ρβ(y, yˆ) (1 − β)(ˆy − y) β(y − yˆ) Figure B3: Visualization of the pinball loss function defined in (B12). 171 To construct an efficient prediction interval Cˆ(Xn+1) whose length is adaptive to the local variability of Xn+1, CQR operates as follows. As in split conformal prediction, firstly the available data are randomly split into a proper training set, indexed by I1, and a calibration set, indexed by I2. Given any quantile regression algorithm A, two conditional quantile functions, qˆαlo and qˆαhi , are fitted on I1, where αlo = α/2 and αhi = 1 − α/2: {qˆαlo , qˆαhi} ← A({(Xi , Yi) : i ∈ I1}). (B13) Then, conformity scores are computed on the calibration data set I2 as: Ei = max{qˆαlo (Xi) − Yi , Yi − qˆαhi(Xi)} for i ∈ I2. (B14) The conformity score in (B14) can account both for possible under-coverage and over-coverage of the quantile regression model [103]. If Yi is outside the interval [ˆqαlo (Xi), qˆαhi(Xi)], then Ei is the (positive) distance of Yi from the closest endpoint of the interval. Otherwise, if Yi is inside the interval [ˆqαlo (Xi), qˆαhi(Xi)], then Ei is the negative of the distance of Yi from the closest endpoint of the interval. Therefore, if the quantile regression model is well-calibrated, approximately 90% of the calibration data points should have Ei ≤ 0 [103]. Finally, CQR constructs the prediction interval for the test response value Yn+1 through Cˆ(Xn+1) = [ˆqαlo (Xn+1) − Qˆ 1−α(E, I2), qˆαhi(Xn+1) + Qˆ 1−α(E, I2)], (B15) where Qˆ 1−α(E, I2) is the (1 − α)(1 + 1/|I2|)-th empirical quantile of {Ei : i ∈ I2}. 172 B4.1 Avoiding empty predictions Avoiding Empty Predictions in CES for Regression This section presents Algorithm B7, which extends Algorithm 6 from Section 3.2.4 in such a way as to explicitly avoid returning empty prediction intervals. Algorithm B7 Conformalized early stopping for regression, avoiding empty predictions 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with outcomes Yi ∈ R. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: Regression model trainable via (stochastic) gradient descent. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train the regression model for t max epochs and save the intermediate models Mt1 , . . . , MtT . 7: Evaluate Cˆ α(Xn+1) using Algorithm 6. 8: if Cˆ α(Xn+1) = ∅ then 9: Evaluate Cˆnaive α (Xn+1) using Algorithm B3. Set Cˆ α(Xn+1) = Cˆnaive α (Xn+1). 10: end if 11: Output: A non-empty prediction interval Cˆ α(Xn+1). Corollary B3. Assume (X1, Y1), . . . ,(Xn, Yn),(Xn+1, Yn+1) are exchangeable random samples, and let Cˆ α(Xn+1) be the output of Algorithm B7, for any α ∈ (0, 1). Then, P[Yn+1 ∈ Cˆ α(Xn+1)] ≥ 1 − α. Avoiding Empty Predictions for Regression with CQR In this section, we introduce Algorithm B8 as an extension of Algorithm 7 in the main text to address the rare possibility of generating empty prediction sets. Algorithm B8 ensures that the intervals it produces 173 are always non-empty, while still encompassing the intervals obtained from Algorithm 7. Consequently, Algorithm B8 maintains the guaranteed coverage provided by Theorem 9, as indicated in Corollary B4. Algorithm B8 Conformalized early stopping for quantile regression, avoiding empty predictions 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with outcomes Yi ∈ R. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: Trainable quantile regression model with target quantiles [βlow, βhigh]. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train for t max epochs and save the intermediate models Mβlow,t1 , . . . , Mβlow,tT , Mβhigh,t1 , . . . , Mβhigh,tT . 7: Evaluate Cˆ α(Xn+1) using Algorithm 7. 8: if Cˆ α(Xn+1) = ∅ then 9: Evaluate Cˆnaive α (Xn+1) using Algorithm B9. Set Cˆ α(Xn+1) = Cˆnaive α (Xn+1). 10: end if 11: Output: A non-empty prediction interval Cˆ α(Xn+1). Corollary B4. Assume (X1, Y1), . . . ,(Xn, Yn),(Xn+1, Yn+1) are exchangeable random samples, and let Cˆ α(Xn+1) be the output of Algorithm B8, for any α ∈ (0, 1). Then, P[Yn+1 ∈ Cˆ α(Xn+1)] ≥ 1 − α. 174 B4.2 Implementation of the Naive Benchmark Algorithm B9 Naive conformal quantile regression benchmark with greedy early stopping 1: Input: Exchangeable data points (X1, Y1), . . . ,(Xn, Yn) with outcomes Yi ∈ R. 2: Input: Test point with features Xn+1. Desired coverage level 1 − α. 3: Input: Maximum number of training epochs t max; storage period hyper-parameter τ . 4: Input: Trainable quantile regression model with target quantiles [βlow, βhigh]. 5: Randomly split the exchangeable data points into Dtrain and Des-cal. 6: Train for t max epochs and save the intermediate models Mβlow,t1 , . . . , MβlowtT , Mβhigh,t1 , . . . , MβhightT . 7: Pick the most promising models t ∗ low, t∗ high ∈ [T] minimizing Les-cal(Mt) in (3.14). 8: Evaluate the scores Eˆ i(Xn+1) = max{qˆt ∗ low (Xi) − Yi , Yi − qˆt ∗ high (Xi)} for all i ∈ Des-cal. 9: Compute Qˆ 1−α(Xn+1) = ⌈(1 − α)(1 + |Des-cal|)⌉-th smallest value in Eˆ i(Xn+1) for i ∈ Des-cal. 10: Output: Cˆnaive α (Xn+1) = [ˆqt ∗ low (Xn+1) − Qˆ 1−α(Xn+1), qˆt ∗ high (Xn+1) + Qˆ 1−α(Xn+1)]. 175 B5 Mathematical Proofs Proof of Theorem 6. It suffices to show that the nonconformity scores Sˆ i for i ∈ {n + 1} ∪ Des-cal are exchangeable. In fact, if the nonconformity scores are almost-surely unique, this implies the rank of Sˆ n+1 is uniformly distributed over {Sˆ i}i∈{n+1}∪Des-cal , and in that case the conformal p-value is uniformly distributed over {1/(1 + |Des-cal|), 2/(1 + |Des-cal|), . . . , 1}. If the nonconformity scores are not almostsurely unique and ties are not broken at random, then the distribution of the conformal p-value becomes stochastically larger than uniform, in which case the result still holds. To prove the exchangeability of the nonconformity scores, let σ be any permutation of {n + 1} ∪ Des-cal, and imagine applying Algorithm 4, with the same random seed, to the shuffled data set indexed by σ({n + 1} ∪ Des-cal), which has the same distribution as the original data set. To clarify the notation, we will refer to quantities computed under this data shuffling scenario with their usual symbol followed by an apostrophe; i.e., M′ t1 instead of Mt1 . As the gradient updates only involve the unperturbed observations in Dtrain and the maximum number of epochs t max is fixed, the sequence of saved models remains exactly the same under this scenario: (M′ t1 , . . . , M′ tT ) = (Mt1 , . . . , MtT ). Further, the loss function in (3.1) is also invariant to permutations of {n + 1} ∪ Des-cal, in the sense that L +1′ es-cal = L +1 es-cal, because L is additive. Therefore, the model selected according to (3.2) is also invariant, Mˆ ′ ces = Mˆ ces, which implies the nonconformity scores are simply re-ordered: Sˆ′ σ(i) = Sˆ i . Therefore, we have: σ({Sˆ i}i∈{n+1}∪Dcal) = {Sˆ′ i}i∈{n+1}∪Dcal d= {Sˆ i}i∈{n+1}∪Dcal, where the last equality follows from the initial data exchangeability assumption. Proof of Theorem 7. Note that, conditional on Yn+1 = y, the miscoverage event Yn+1 ̸∈ Cˆ α(Xn+1) occurs if and only if uˆy(Xn+1) ≤ α, where uˆy(Xn+1) is defined as in (3.6). Therefore, it suffices to show 176 P [ˆuy(Xn+1) ≤ α | Yn+1 = y] ≤ α for any α ∈ (0, 1). However, this is directly implied by Theorem 6, because the uˆy(Xn+1) calculated by Algorithm 5 is equivalent to the conformal p-value uˆ0(Zn+1) given by Algorithm 4 applied to the subset of the data in Des-cal with Yi = y, with the understanding that Zi = (Xi , Yi) for all i ∈ {n + 1} ∪ Des-cal. Proof of Theorem B1. Note that Yn+1 ̸∈ Cˆm α (Xn+1)if and only if uˆ marg(Xn+1; Yn+1) ≤ α, where uˆ marg(Xn+1; Yn+1) is defined as in (B4). Hence it suffices to show that P [ˆu marg(Xn+1; Yn+1) ≤ α] ≤ α for any α ∈ (0, 1). This can be established using the same approach as in the proof of Theorem 6, setting Zi = (Xi , Yi) for all i ∈ {n + 1} ∪ Des-cal. In fact, the maximum number of epochs t max is fixed, the sequence of saved models is invariant to permutations of {n + 1} ∪ Des-cal, and the model Mˆ ces selected according to (3.5) is also invariant. Thus, it follows that the nonconformity scores Sˆ i are exchangeable with one another for all i ∈ {n + 1} ∪ Des-cal. Proof of Theorem 8. Consider an imaginary oracle algorithm producing an interval Cˆoracle α (Xn+1) defined as Cˆoracle α (Xn+1) = Bl ∗(Yn+1) T Cˆ α(Xn+1, Bl ∗(Yn+1) ), where l ∗ (Yn+1) is the exact index of the bin Bl to which the true Yn+1 belongs. Clearly, this oracle is just a theoretical tool, not a practical method because the outcome value for the test point is unknown. However, this oracle is useful because it is easier to analyze, and it suffices to establish that P[Yn+1 ∈ Cˆoracle α (Xn+1)] ≥ 1 − α, for any α ∈ (0, 1), since Cˆ α(Xn+1) ⊇ Cˆoracle α (Xn+1) almost-surely. The coverage property for the oracle can be established using an approach similar to that of the proof of Theorem 6, setting Zi = (Xi , Yi) for all i ∈ {n + 1} ∪ Des-cal. In fact, the maximum number of epochs t max is fixed, the sequence of saved models is invariant to permutations of {n + 1} ∪ Des-cal, and the model Mˆ ces selected by the oracle according to (3.10) is also invariant. Thus, it follows that the oracle nonconformity scores Sˆ∗ i = Sˆ i(Xn+1, Bl ∗(Yn+1) ) are exchangeable with one another for all i ∈ {n + 1} ∪ Des-cal. Further, by construction of the prediction intervals (3.12), we know that the miscoverage event Yn+1 ̸∈ Cˆoracle α (Xn+1) occurs if and only if Sˆ∗ i > Qˆ∗ 1−α , where Qˆ∗ 1−α is 177 the ⌈(1−α)(1 +|Des-cal|)⌉-th smallest value among all nonconformity scores Sˆ i(Xn+1, Bl). However, it is a well-known exchangeability result that P[Sˆ∗ i ≤ Qˆ∗ 1−α ] ≥ 1 − α; see for example Lemma 1 in [103]. Proof of Theorem 9. Same as the proof of Theorem 8. Proof of Corollary B3. This corollary follows immediately from Theorem 8 because the prediction interval given by Algorithm B7 is always contained in that output by Algorithm 6. Proof of Corollary B4. Same as the proof of Corollary B3. Proof of Proposition B1. Note that uˆ naive 0 (Zn+1) = ˆu naive 0 (Zn+1;t ∗ ), hence P uˆ naive 0 (Zn+1) > α = E P uˆ naive 0 (Zn+1;t ∗ ) > α | Des-cal ≥ E min t∈[T] Wt ≥ sup a∈[0,1] a · P min t∈[T] Wt ≥ a = sup a∈[0,1] a 1 − P min t∈[T] Wt ≤ a ≥ sup a∈[0,1] a (1 − T · P [Wt ≤ a]), where the last inequality follows from a union bound. To simplify the right-hand-side term above, let a = I −1 1 bT ; nes-cal + 1 − l, l , where b is any large constant. Hence we obtain P uˆ naive 0 (Zn+1) > α ≥ I −1 1 bT ; nes-cal + 1 − l, l · (1 − 1/b). 17 Proof of Corollary B1. Note that Yn+1 ∈ Cˆnaive α (Xn+1) if and only if uˆ naive Yn+1 (Xn+1;t ∗ ) > α. Let Wt denote the calibration conditional coverage P h uˆ naive Yn+1 (Xn+1;t) > α | Des-cali . Then, we have P h Yn+1 ∈ Cˆnaive α (Xn+1) i = E h P h uˆ naive Yn+1 (Xn+1;t ∗ ) > α | Des-calii = E [Wt ∗ ] ≥ E min t∈[T] Wt . The rest of the proof follows the same argument as in the proof of Proposition B1. Proof of Corollary B2. Let Sˆ i(Xn+1, t) = |Yi − µˆt(Xi)| denote the residual score calculated with model t ∈ [T], for all i ∈ Des-cal. Note that Yn+1 ∈ Cˆnaive α (Xn+1) if and only if SˆXn+1 (Xn+1, t∗ ) ≤ Qˆ 1−α. Then, we just need to bound Wt = P h SˆXn+1 (Xn+1, t∗ ) ≤ Qˆ 1−α | Des-cali , and the rest of the proof follows the same steps as the proof of Proposition B1. Proof of Lemma B1. Recall that l = ⌊α(nes-cal + 1)⌋, and define the following helpful notations: Beta (nes-cal + 1 − l, l) := Beta (nes-cal · c, nes-cal · d), where c = nes-cal + 1 − l nes-cal , d = l nes-cal . Denote Gamma(k, θ) as the gamma distribution with shape parameter k and scale parameter θ. It is a well known fact that the beta distribution can be expressed as a ratio of gamma distributions as: Beta (nes-cal · c, nes-cal · d) = Gamma(nes-cal · c, 1) Gamma(nes-cal · c, 1) + Gamma(nes-cal · d, 1). Further, Gamma(nes-cal · c, 1) can be seen as the distribution of a sum of nes-cal · c independent exponentially distributed random variables with mean equal to 1; therefore, by the central limit theorem, Gamma(nes-cal · c, 1) has an asymptotic Gaussian distribution as nes-cal → ∞. Denote Φ(x, µ, σ2 ) as the 179 cumulative distribution function of a Gaussian random variable with mean µ and variance σ 2 . Applying the delta method, it follows that, in the limit of large nes-cal, I (x; nes-cal · c, nes-cal · d) = Φ x; c c + d , 1 nes-cal · cd (c + d) 3 + O 1 nes-cal , for any x ∈ [0, 1]. Since I and Φ are continuous and strictly increasing over [0, 1], letting Φ −1 be the inverse Gaussian CDF, we have I −1 1 bT ; nes-cal · c, nes-cal · d = Φ−1 1 bT ; c c + d , 1 nes-cal · cd (c + d) 3 + O 1 nes-cal = Φ−1 1 bT ; 1 − α, α(1 − α) nes-cal + 1 + O 1 nes-cal = (1 − α) + s α(1 − α) nes-cal + 1 · Φ −1 1 bT ; 0, 1 + O 1 nes-cal = (1 − α) − s α(1 − α) nes-cal + 1 · p 2 log(bT) + O 1 p nes-cal log(T) ! , where the second equality is obtained by substituting c and d with their defined values and the last inequality follows from Equation 26.2.23 in [1] for sufficiently large T. 180 Appendix C Appendix to Chapter 4 C1 Additional applications LASLA has a wide range of applications aside from the network-structured data like the GWAS example discussed in the main article. In this section, we introduce two additional challenging settings: data-sharing regression and integrative inference with multiple auxiliary data sets. In both scenarios, traditional frameworks are not applicable since the auxiliary data U and the primary data T do not match in dimension. Example 1. Data-sharing high-dimensional regression. Suppose we are interested in identifying genetic variants associated with type II diabetes (T2D). Consider a high-dimensional regression model: Y = µ + Xβ + ϵ, (C1) where Y = (Y1, . . . , Yn) T are measurements of phenotypes, µ = µ1 T is the intercept, with 1 T being a vector of ones, β = (β1, . . . , βm) T is the vector of regression coefficients, X ∈ R n×m is the matrix of measurements of genomic markers, and ϵ = (ϵ1, . . . , ϵn) T are random errors. 181 Both genomics and epidemiological studies have provided evidence that complex diseases may have shared genetic contributions. The power for identifying T2D associated genes can be enhanced by incorporating data from studies of related diseases such as cardiovascular disease (CVD) and ischaemic stroke. Consider models for other studies: Y k = µ k + X kβ k + ϵ k , (C2) where the superscript k indicates that the auxiliary data are collected from disease type k ∈ [K]. The notations Y k , µ k , β k , Xk and ϵ k have similar explanations as above. The identification of genetic variants associated with T2D can be formulated as a multiple testing problem (4.1), where θ = (θi : i ∈ [m]) = {I(βi ̸= 0) : i ∈ [m]} is the primary parameter of interest. The primary and auxiliary data sets are T = (Y , X) and U = {(Y k , Xk ) : k ∈ [K]}, respectively. The auxiliary data U can provide useful guidance by prioritizing the shared risk factors and genetic variants. Example 2. Integrative “omics” analysis with multiple auxiliary data sets. The rapidly growing field of integrative genomics calls for new frameworks for combining various data types to identify novel patterns and gain new insights. Related examples include (a) the analysis of multiple genomic platform (MGP) data, which consist of several data types, such as DNA copy number, gene expression and DNA methylation, in the same set of specimen ([23]); (b) the integrated copy number variation (iCNV) caller that aims to boost statistical accuracy by integrating data from multiple platforms such as whole exome sequencing (WES), whole genome sequencing (WGS) and SNP arrays ([144]); (c) the integrative analysis of transcriptomics, proteomics and genomic data ([91]). The identification of significant genetic factors can be formulated as (4.1) with mixed types of auxiliary data. 182 C2 Forming local neighborhoods: illustrations Recall that, in Section 4.1, LASLA first summarize the structural knowledge in a distance matrix D ∈ R m×m where m is the number of hypotheses. The distance matrix describes the relation between each pair of hypotheses in the light of the auxiliary data. For the GWAS example detailed in Section 4.1, D = (1 − r 2 ij : i, j ∈ [m]) where rij measures the linkage disequilibrium between the two SNPs i and j. In Example 1 (data-sharing regression) from Supplement C1, we can extract the structural knowledge provided by the related regression problems via Mahalanobis distance [68]. Specifically, let { ˆβ k = (βˆk 1 , . . . , βˆk m) T : k ∈ [K]} denote the estimation of {β k = (β k 1 , . . . , βk m) T : k ∈ [K]}. Denote by ˆβi = (βˆk i : k ∈ [K]) the vector of estimated coefficients for the ith genomic marker across K different studies. The distance matrix D = (Dij )i,j∈[m] is then constructed via Mahalanobis distance with Dij = (ˆβi − ˆβj )Σˆ −1 β ( ˆβi − ˆβj ) T , where Σˆ β is the estimated covariance matrix based on { ˆβi : i ∈ [m]}. Similarly, in Example 2 (analysis with multiple auxiliary data sets), suppose we collect a multivariate variable Ui from different platforms as the side information for gene i, then the Mahalanobis distance can be used to construct a distance matrix D = (Dij )i,j∈[m] with Dij = (Ui − Uj )Σˆ −1 U (Ui − Uj ) T , where Σˆ U is the estimated covariance matrix based on the auxiliary sample {Ui : i ∈ [m]}. We emphasize that LASLA is not limited to the aforementioned examples. Most of the traditional covariate-assisted methods focus on the array-like auxiliary data U = {Ui : i = 1, 2, . . .} that matches primary data coordinate by coordinate. LASLA can also handle this dimension-matching side information as the latter can be represented by a distance matrix D through simple manipulations. Below, we provide a list of practical types of side information and their corresponding methods for constructing the distance matrix. 183 (a) A vector of categorical covariates. The elements in U take discrete values and the local neighborhoods can be defined as groups. With suitably chosen weights LASLA reduces to the methods considered in [57], [80] and [138] that are developed for multiple testing with groups. (b) A vector of continuous covariates. We can define distance as either the absolute difference or the standardized difference in rank Dij = |Fˆm(Ui) − Fˆm(Uj )|, where Fˆm(t) is the empirical CDF. (c) Spatial locations. Such structures have been considered in, for example, [86], [77] and [21]. The locations are viewed as covariates and Dij is the Euclidean distance between locations i and j. (d) The correlations in a network or partial correlations in graphical models. See the GWAS example discussed in Section 4.1 of the main article. (e) Multiple auxiliary samples. The Mahalanobis distance or its generalizations [68] can be used to calculate the distance matrix D. Note that in practical applications, it could be beneficial to “standardize" the distance matrix D; this step ensures algorithm robustness. A more comprehensive discussion on the implementation details is relegated to Supplement C5. C3 Details on sparsity-adaptive weights Recall the definition from Section 4.2.1 that the primary statistics Ti has the hypothetical mixture distribution: F ∗ i (t) = (1 − π ∗ i )F0(t) + π ∗ i F ∗ 1i (t) for i ∈ [m]. The quantity π ∗ i indicates the sparsity level of signals at location i, and π ∗ i is allowed to be heterogeneous across m testing locations. 184 The key idea in existing weighted FDR procedures such as GBH [57], SABHA ([80]) and LAWS ([21]) is to construct weights that leverage π ∗ i by prioritizing the rejection of the null hypotheses in groups or at locations where signals appear to be more frequent. Specifically, SABHA defines the weight as w sabha i = 1/(1 − π ∗ i ), and LAWS as w laws i = π ∗ i /(1 − π ∗ i ). The sparsity adaptive-weights have an intuitive interpretation. Consider the LAWS weight w laws i , if π ∗ i is large, indicating a higher occurrence of signals at location i, the weighted p-value P w i := Pi/wlaws i = (1−π ∗ i )Pi/π∗ i will be smaller, up-weighting the significance level of hypothesisi. However, compared to the proposed weights, such weighting scheme ignores structural information in alternative distributions as discussed in Section 4.2.4. C4 Alternative weight construction Recall that Ni = {j ̸= i : Dij ≤ aϵ} is the neighborhood index set which only contains indexes with distance to i smaller than aϵ and satisfies |Ni | = m1−ϵ for some small constant ϵ > 0. Additionally, Dij1 ≤ Dij2 for any j1 ∈ Ni , j2 ∈ N/ i , and j2 ̸= i. The alternative computationally efficient procedure mentioned in Section 4.2.3 is summarized in Algorithm C10 below. Algorithm C10 Simplified Oracle-assisted weights 1: Input: the nominal FDR level α, ϵ specifying the size of the sub-neighborhood, kernel function K(·), primary statistics T = {Ti : i ∈ [m]} and distance matrix D = (Dij : i, j ∈ [m]). 2: For i ∈ [m] do: 3: Estimate πi as given by (4.5) with summation taken over Ni . 4: Estimate fi(t) as given by (4.6) with summation taken over Ni . 5: Compute Li as given by (4.7). Denote the sorted statistics by L(1) ≤ . . . ≤ L(m) . 6: Choose L(k) to be the oracle threshold, where k = max{j ∈ [m] : j −1 Pj i=1 L(i) ≤ α}. 7: For i ∈ [m] do: 8: If Ti ≥ 0 then: Compute t + i as given by (4.9), and the weight by wi = 1 − F0(t + i ). 9: If Ti < 0 then: Compute t − i as given by (4.10), and the weight by wi = F0(t − i ). 10: Output: Oracle-assisted weights {wi : i ∈ [m]}. 185 C5 Implementation Details and Additional Numerical Results In this section, we provide the numerical implementation details and collect additional simulation results for data-sharing high-dimensional regression, latent variable model and multiple auxiliary samples. Implementation Details In all of our numerical results, the bandwidth h for the kernel estimations in (4.5) and (4.6) is chosen automatically by applying the density function with the option “SJ-ste” in R package stats. For the size of neighborhoods m1−ϵ , the default choice for ϵ is 0.1 for marginally independent p-values, while for dependent p-values, we set ϵ = 0 to comply with our FDR control theory under weak dependence in Supplement C7. To enhance algorithmic robustness and numerical performance, we perform a data-driven scaling of the distance matrix D by a constant factor a. A practical guideline is to ensure that the spread of entries in the scaled distance matrix D/a is similar to that of the entries in T . One can use the interquartile range (IQR) to measure the data spread. Below are the adjustments made for each numerical experiment in order of appearance: (1) a = 1/2 for the network example in Section 4.4; (2) a = 1.4 for the GWAS application in Section 4.5; (3) a = 4 for the high-dimensional regression models in Supplement C5; (4) a = 1 for the latent variable settings in Supplement C5; (5) a = 5 for the multiple auxiliary experiments in Supplement C5. Data-sharing high-dimensional regression Example 2 in Supplement C1 discussed how the knowledge in regression models from related studies can be transferred to improve the inference on regression coefficients from the primary model. This section designs simulation studies to illustrate the point. 186 Consider the regression model (C1) defined in Supplement C1 with Xij ∼ N(0, 1) for i ∈ [n], j ∈ [m] where Xij denotes the entry of X at coordinate (i, j); ϵi ∼ N(0, 1) for i ∈ [n]. Let P(βi = 0) = 0.9. For the non-null locations, βi ∼ (−1)u |N(µ, 0.1)|; u ∼ Bernoulli(0.2). Note that signals will be more likely to take positive signs, hence asymmetric rejection rules are desired. Models from K related studies are generated by model (C2). If the auxiliary model is closely related to the primary model, they tend to share similar coefficients, therefore we generate the coefficients for study k ∈ [K] as β k = β + σ, where each coordinate of σ is drawn from normal distribution N(0, σ). Other quantities are defined similarly as the primary model. We compute the distance matrix D using the Mahalanobis distance on the estimated coefficients as specified in Supplement C2. Fix K = 3, n = 1000, m = 800, consider the following settings: • Setting 1: Fix µ = 0.3, vary the noise level σ from 0.05 to 0.1 by 0.01. • Setting 2: Fix σ = 0.05, vary the signal strength µ from 0.25 to 0.35 by 0.025. We compare BH with the data-driven LASLA. To apply LASLA, it’s essential to have knowledge of the null distribution for the test statistics. In this simulation we use the ordinary least square estimators and Ti follows a t-distribution. Alternatively, one can explore the approach outlined in [138], where the test statistics follow the N(0, 1) distribution asymptotically. Figure C1 shows that LASLA can effectively leverage the side information from related studies. Latent variable setting Suppose the primary and auxiliary data are associated with a common latent variable ξ = (ξi : i ∈ [m]) where ξi ∼ (1 − θi)∆0 + θiN(µ, 1) and ∆0 is the Dirac delta function, namely, ξi = 0 if θi = 0. The primary data T = (Ti : i ∈ [m]) and auxiliary data U = (Ui : i ∈ [m]) respectively follow: Ti ∼ N(ξi , 1), Ui ∼ N(ξi , σ2 s ), (C3) 187 FDR Power 0.05 0.06 0.07 0.08 0.09 0.10 0.05 0.06 0.07 0.08 0.09 0.10 0.800 0.825 0.850 0.000 0.025 0.050 0.075 0.100 σ Method BH LASLA.DD (a) FDR Power 0.250 0.275 0.300 0.325 0.350 0.250 0.275 0.300 0.325 0.350 0.7 0.8 0.9 0.000 0.025 0.050 0.075 0.100 µ Method BH LASLA.DD (b) Figure C1: Empirical FDR and power comparison. (a): Regression Setting 1; (b): Regression Setting 2. where σs controls the informativeness the auxiliary data. Our goal is to test m hypotheses on θi as stated in (4.1). Fix m = 1200 and let θi ∼ Bernoulli(0.1), independently for i ∈ [m]. We consider two settings: • Setting 1: Fix µ = 2.5, vary σs from 0.5 to 2 by 0.25. • Setting 2: Fix σs = 1 , vary µ from 3 to 4 by 0.2. We compute the distance matrix D from the auxiliary data U using the Euclidean distance, i.e. Dij = |Ui − Uj |. The following methods are implemented: • Benjamini-Hochberg procedure (BH); • LASLA with known π ∗ i and f ∗ i (t) (LASLA.OR); • Data-driven LASLA (LASLA.DD); • Data-driven SABHA (SABHA.DD) as reviewed in Supplement C3; • WBH method with LASLA weights (WBH) discussed in Section 4.2.5, Remark 4.2.5. 188 The result is summarized in Figure C2. In both settings, LASLA has smaller FDR than SABHA but still dominate SABHA in power due to the fact that SABHA is entirely p-value based and its weights only takes sparsity into account. Some other p-value based methods like AdaPT [77] may also suffer from power loss since the potential asymmetry in the primary statistics is ignored. FDR Power 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.3 0.4 0.5 0.6 0.7 0.8 0.000 0.025 0.050 0.075 0.100 σ Method BH LASLA.DD LASLA.OR SABHA WBH (a) FDR Power 3.00 3.25 3.50 3.75 4.00 3.00 3.25 3.50 3.75 4.00 0.5 0.6 0.7 0.8 0.9 0.000 0.025 0.050 0.075 0.100 µ Method BH LASLA.DD LASLA.OR SABHA WBH (b) Figure C2: Empirical FDR and power comparison. (a): Latent variable Setting 1; (b): Latent variable Setting 2. Multiple auxiliary samples We explore two scenarios with multiple auxiliary samples: (1) all samples are informative; (2) some samples are non-informative. Similar to the previous section, consider a latent variable ξ = (ξi : i ∈ [m]) where ξi ∼ (1 − θi)∆0 + θiN(µ, 1) and θi ∼ Bernoulli(0.1). The primary statistics Ti ∼ N(ξi , 1) for i ∈ [m]. The goal is to make inference on the unknown θi . Let U k = (U k i : i ∈ [m]) denote the kth auxiliary sequence for k ∈ [K]. If U is informative, it should carry knowledge on the underlying signal θi . Hence we introduce the first setting where all auxiliary samples are associated with the latent variable ξ: • Setting 1: U k i ∼ N(ξi , σ2 s ) for i ∈ [m], k = 1, ..., 4. 189 Let γi ∼ Bernoulli(0.1) for i ∈ [m] independently of everything else, and ψi ∼ (1 − γi)∆0 + γiN(µ, 1). Consider: • Setting 2: U k i ∼ N(ξi , σ2 s ) , for k = 1, 2; U k i ∼ N(ψi , σ2 s ) , for k = 3, 4. Note that γi being independent of θi can lead to significant divergence between the latent variables ψi and ξi , potentially making U 3 and U 4 anti-informative. The construction of D from {U k}k∈[K] is not unique, we explore two different methods: using Mahalanobis distance vs using Euclidean distance with the averaged data U avg i = 1 4 (U 1 i +U 2 i +U 3 i +U 4 i ) for i ∈ [m]. We assess their effectiveness under varying degrees of informativeness exhibited by the auxiliary samples. In both settings, we fix m = 1200, µ = 3, and change σs from 0.5 to 2 by 0.25. The results are summarized in Figure C3. Intuitively, the averaging method reduces variance when all auxiliary samples are informative and leads to power gain over the Mahalanobis approach. However, the latter appears to be more robust when some samples are anti-informative. FDR Power 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 0.6 0.7 0.8 0.9 1.0 0.000 0.025 0.050 0.075 0.100 σ Method BH LASLA.OR Mahalanobis Average (a) FDR Power 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 0.5 0.6 0.7 0.8 0.9 0.000 0.025 0.050 0.075 0.100 σ Method BH LASLA.OR Mahalanobis Average (b) Figure C3: Empirical FDR and power comparison. (a): Multiple auxiliary samples Setting 1; (b): Multiple auxiliary samples Setting 2. 190 C6 Proof of Main Results Recall that Di be the ith column of D, and Di is a continuous finite domain (w.r.t. coordinate i) in R with positive measure by adapting the fixed-domain asymptotics in [120]. Each d ∈ Di is a distance and 0 ∈ Di . The two sets Di and Di can be viewed as distances measured from a partial/full network, respectively, and Di ⊂ Di . Throughout the proofs, we assume that Di → Di as m → ∞ in the sense that, for any d0 ∈ Di , there exists at least an index j such that |Dij − d0| = O(m−1 ) as m → ∞. Proof of Proposition 2 Proof. For simplicity of notation, throughout we omit the conditioning on D, and use P(Pj > τ |Dij = x) and P(Pi > τ ) to denote P(Pj > τ | Dj , Dij = x) and P(Pi > τ | Di) respectively. Recall that 1 − πi = P j∈Ni [Kh(Dij )I{Pj > τ}] (1 − τ ) P j∈Ni Kh(Dij ) Also note that, for all i ∈ [m], E X j∈Ni [Kh(Dij )I{Pj > τ} | D = X j∈Ni [Kh(Dij ) P(Pj > τ | Dj )] Then by m−1 ≪ h ≪ m−ϵ , as Di → Di , we have E(1 − πi | D) R D˜ i Kh(x) P(Pjx > τ | Dijx = x) dx/(1 − τ ) R D˜ i Kh(x) dx → 1, 191 where jx represents the index such that Dijx = x and D˜ i is the limit of {Dij , j ∈ Ni} in the asymptotic framework described in Section 4.2.2. Using Taylor expansion at x = 0, combined with Assumption (A1), we have Z D˜ i Kh(x) P(Pjx > τ | Dijx = x) dx = P(Pi > τ ) Z D˜ i Kh(x) dx + P ′ (Pi > τ ) Z D˜ i xKh(x) dx + P ′′(Pi > τ ) 2 Z D˜ i x 2Kh(x) dx + O(h 2 ). Thus, by the assumptions of K(·) in (4.4), uniformly for all index i, there exists some constant c > 0 such that [E(πi | D) − π τ i ] 2 ≤ c Z D˜ i xKh(x) dxZ D˜ i Kh(x) dx + c Z D˜ i x 2Kh(x) Z D˜ i Kh(x) dx2 + o(1) → 0, as h → 0. Now we inspect the variance term. By Condition (A2), there exists a constant c ′ > 1, VarX j∈Ni [Kh(Dij )I{Pj > τ}] | D ≤ c ′ X j∈Ni [K2 h (Dij ) Var I{Pj > τ} | Dj ] = c ′ X j∈Ni K2 h (Dij ) P(Pj > τ | Dj ){1 − P(Pj > τ | Dj )} . 19 Hence, as h ≫ m−1 , by the assumptions of K(·) in (4.4) and that it is positive and bounded, we have Var(1 − πi) ≤ c ′′m−1 R D˜ i K2 h (x) dx [(1 − τ ) R D˜ i Kh(x) dx] 2 ≤ c ′′(mh) −1 R R K2 (y) dy [(1 − τ ) R D˜ i Kh(x) dx] 2 ≤ c ′′′(mh) −1 = o(1), for some constant c ′′, c′′′ > 0. Hence, as Di → Di , by combining the bias term and variance term, the consistency result is proved. Proof of Theorem 10 Proof. For simplicity of notation, throughout we omit the conditioning on D, and use P(θi = 0) to denote P(θi = 0 | Di). Note that, by Algorithm C10, the FDP of LASLA at the thresholding levelt can be calculated by F DP(t) = Pm i=1 I{P w i ≤ t, θi = 0} max[Pm i=1 I{P w i ≤ t}, 1] = Pm i=1 P(P w i ≤ t, θi = 0 | wi , Di) max[Pm i=1 I{P w i ≤ t}, 1] · Pm i=1 I{P w i ≤ t, θi = 0} Pm i=1 P(P w i ≤ t, θi = 0 | wi , Di) , where for the ease of notation, throughout we use P(P w i ≤ t, θi = 0)to denote P(P w i ≤ t, θi = 0 | wi , Di), and use P(P w i ≤ t | θi = 0) to denote P(P w i ≤ t | θi = 0, wi), and similarly for the corresponding zvalues. Step 1: We first show that, uniformly for all i ∈ [m], we have Xm i=1 P(P w i ≤ t, θi = 0) ≤ (1 + oP(1))Xm i=1 wi(1 − πi)t. (C4) 193 Note that, in Algorithm 8, Ti is not used in the computation of wi . Then by the assumption that all p-values are marginally independent, Ti is independent of wi . Hence, we have Xm i=1 P(P w i ≤ t, θi = 0) = Xm i=1 P(P w i ≤ t | θi = 0)P(θi = 0) = Xm i=1 wi(1 − π ∗ i )t ≤ Xm i=1 wi(1 − π τ i )t, where the last inequality follows from the fact that π τ i is a conservative approximation of π ∗ i as showed in [21]. By the result of Proposition 2 and Assumption (A3), together with the fact that ξ ≤ wi ≤ 1 for i ∈ [m], we have Xm i=1 P(P w i ≤ t, θi = 0) ≤ Xm i=1 wi(1 − πi + oP(1))t = (1 + oP(1))Xm i=1 wi(1 − πi)t. Hence (C4) is proved. Step 2: We next show that P θi=0 P(P w i ≤ t | θi = 0) − Pm i=1 P(P w i ≤ t, θi = 0) Pm i=1 P(P w i ≤ t, θi = 0) → 0, (C5) in probability. Define the event B = " {θi} m i=1, Xm i=1 I{θi = 0} ≥ cm for some constant c > 0 # . 194 It follows from Condition (A3) that P(B) → 1. Then by the fact that ξ ≤ wi ≤ 1, we can obtain E Pm i=1 [P(P w i ≤ t|θi = 0)I{θi = 0} − P(P w i ≤ t, θi = 0)] Pm i=1 P(P w i ≤ t, θi = 0) 2 = E Pm i=1 P(P w i ≤ t|θi = 0) I{θi = 0} − P(θi = 0) Pm i=1 P(P w i ≤ t|θi = 0)P(θi = 0) 2 = Var Xm i=1 [P(P w i ≤ t|θi = 0)I{θi = 0}] !Xm i=1 P(P w i ≤ t|θi = 0)P(θi = 0)2 = O(mζ−1 ), where 0 ≤ ζ < 1. Hence (C5) is proved. Step 3: Finally, we take care of the quantity Pm i=1 I{P w P i ≤t,θi=0} m i=1 P(P w i ≤t,θi=0) . We first check the range of the cutoff t, or equivalently the threshold for the weighted z-values, i.e., z w i = Φ−1 (1 − P w i /2), for i ∈ [m]. Then as shown in [21] and replace their weights πi 1−πi by wi , it is easy to see that, by applying BH procedure at level α to the adjusted p-values with weights wi , the corresponding threshold is no larger than the threshold of LASLA for the adjusted p-values with the same weights wi . Hence it suffices to obtain the threshold for the weighted z-values z w i = Φ−1 (1 − P w i /2) of such BH procedure with weights wi . Let tm = (2 log m − 2 log log m) 1/2 . By Condition (A4), we have X θi=1 I{|zi | ≥ (c log m) 1/2+ρ/4 } ≥ {1/(π 1/2α) + δ}(log m) 1/2 , with probability going to one. Recall that we have ξ ≤ wi ≤ 1 for some constant ξ > 0. Thus, for those indices i ∈ H1 (equivalently θi = 1) such that |zi | ≥ (c log m) 1/2+ρ/4 , we have P w i ≤ (1 − Φ((c log m) 1/2+ρ/4 ))/wi = o(m−M), 19 for any constant M > 0. Thus we have X i∈[m] I{z w i ≥ (2 log m) 1/2 } ≥ {1/(π 1/2α) + δ}(log m) 1/2 , with probability going to one. Hence, with probability tending to one, 2m P i∈[m] I{z w i ≥ (2 log m) 1/2} ≤ 2m{1/(π 1/2α) + δ} −1 (log m) −1/2 . Because 1 − Φ(tm) ∼ 1/{(2π) 1/2 tm} exp(−t 2 m/2), it suffices to show that, sup 0≤t≤tm Pm i=1 I{z w i ≥ t, θi = 0} − Pm i=1 P(z w i ≥ t, θi = 0) Pm i=1 P(z w i ≥ t, θi = 0) → 0, (C6) in probability. Let the event A = [{θi} m i=1 : (C5) holds]. By the proofs in Step 2, we have P(A) → 1. Hence, it is enough to show that, for {θi} m i=1 ∈ A, we have sup 0≤t≤tm P θi=0 [I{z w i ≥ t} − P(z w i ≥ t | θi = 0)] P θi=0 P(z w i ≥ t | θi = 0) → 0, (C7) in probability. Let 0 ≤ t0 < t1 < · · · < tb = tm such that tι − tι−1 = vm for 1 ≤ ι ≤ b − 1 and tb − tb−1 ≤ vm, where vm = 1/ p log m(log4 m). Thus we have b ∼ tm/vm. For any t such that tι−1 ≤ t ≤ tι , due to the fact that G(t + o((log m) −1/2 ))/G(t) = 1 + o(1) with G(t) = 2(1 − Φ(t)) uniformly in 0 ≤ t ≤ c(log m) 1/2 for any constant c, by [138], it suffices to prove that max 0≤ι≤b P θi=0 [I{z w i ≥ tι} − P(z w i ≥ tι | θi = 0)] P θi=0 P(z w i ≥ tι | θi = 0) → 0, (C8) 196 in probability. Thus, it suffices to show that, for any ϵ > 0, Z tm 0 P ( P θi=0 [I{z w i ≥ t} − P(z w i ≥ t | θi = 0)] P θi=0 P(z w i ≥ t | θi = 0) ≥ ϵ ) dt = o(vm). (C9) Note that E P θi=0 [I{z w i ≥ t} − P(z w i ≥ t | θi = 0)] P θi=0 P(z w i ≥ t | θi = 0) 2 = P θi=0,θj=0 P{z w i ≥ t, zw j ≥ t | θi = 0, θj = 0} − P θi=0 P(z w i ≥ t | θi = 0) 2 P θi=0 P(z w i ≥ t | θi = 0) 2 . Recall that, by Algorithm 8 we only use O(m1−ϵ ) neighbors to construct wi for any small enough constant ϵ > 0, Hence, we can divide the indices pairs H˜ 0 = {(i, j) : θi = 0, θj = 0} into two subsets: H˜ 01 = {(i, j) ∈ H˜ 0, either P w i is correlated with Pj or P w j is correlated with Pi}, H˜ 02 = H˜ 0 \ H˜ 01, where |H˜ 01| = O(m2−ϵ ) while among them m pairs with i = j are perfectly correlated. Note that, for (i, j) ∈ H˜ 01, P (i,j)∈H˜01 h P{z w i ≥ t, zw j ≥ t | θi = 0, θj = 0} − Q h=i,j P(z w h ≥ t | θh = 0)i P θi=0 P(z w i ≥ t | θi = 0) 2 ≤ P (i,j)∈H˜01 P{z w i ≥ t, zw j ≥ t | θi = 0, θj = 0} P θi=0 P(z w i ≥ t | θi = 0) 2 . 197 Recall that the event B = [{θi} m i=1, Pm i=1 I{θi = 0} ≥ cm for some constant c > 0] and P(B) → 1. Because ξ ≤ wi ≤ 1, we have for {θi} m i=1 ∈ A ∩ B, P (i,j)∈H˜01 h P{z w i ≥ t, zw j ≥ t | θi = 0, θj = 0} − Q h=i,j P(z w h ≥ t | θh = 0)i P θi=0 P(z w i ≥ t | θi = 0) 2 ≤ 1 P θi=0 P(z w i ≥ t | θi = 0) + O m2−ϵ m2 ≤ O 1 mG(t) + O m−ϵ , where the first term reflects the pairs with i = j. On the other hand, X (i,j)∈H˜02 P z w i ≥ t, zw j ≥ t | θi = 0, θj = 0 − Y h=i,j P(z w h ≥ t | θh = 0) = 0. Then by the fact that Z tm 0 1 mG(t) + m−ϵ dt = o(vm), and that P(A ∩ B) → 1, (C9) is proved and (C7) is thus proved. Combining (C7) and (C5), we obtain (C6). This together with (C4) prove the result of Theorem 10. Proof of Theorem 2 Proof. Note that Q 1 (t) = Pm i=1(1 − π ∗ i )t Pm i=1(1 − π ∗ i )t + Pm i=1 π ∗ i F ∗ 1i (t | D) . 19 Recall that w˜i = wi Pm j=1(1 − π ∗ j )/ Pm j=1{(1 − π ∗ j )wj}, we have Q w˜ (t) = Pm i=1(1 − π ∗ i ) ˜wit Pm i=1(1 − π ∗ i ) ˜wit + Pm i=1 π ∗ i F ∗ 1i ( ˜wit | D) = Pm i=1(1 − π ∗ i )t Pm i=1(1 − π ∗ i )t + Pm i=1 π ∗ i F ∗ 1i ( ˜wit | D) . Under the Assumption (A6) we have, Xm i=1 π ∗ i F ∗ 1i ( ˜wit | D) = Xm i=1 π ∗ i F ∗ 1i (t/w˜ −1 i | D) ≥ Xm i=1 π ∗ i F ∗ 1i Pm i=1 π ∗ j t Pm j=1 π ∗ j w˜ −1 j | D! . By Assumption (A5) and the construction that w˜i = wi Pm j=1(1 − π ∗ j )/ Pm j=1{(1 − π ∗ j )wj}, we have Pm i=1 π ∗ P i m i=1 π ∗ i w˜ −1 i ≥ 1. Therefore, Xm i=1 π ∗ i F ∗ 1i ( ˜wit | D) ≥ Xm i=1 π ∗ i F ∗ 1i (t | D). Hence, by the definition of t 1 o , it is easy to see that Qw˜ (t 1 o ) ≤ Q1 (t 1 o ) ≤ α. It yields that t w˜ o ≥ t 1 o and thus Ψw˜ (t w˜ o ) ≥ Ψw˜ (t 1 o ) ≥ Ψ1 (t 1 o ). C7 Asymptotic theories under weak dependence In this section, we study the asymptotic control of FDP and FDR for dependent p-values. We collect some additional regularity conditions to develop the theories under weak dependence. We first introduce in Section C7.1 the benchmark oracle weight. Then the proofs are developed in two stages: Section C7.2 shows the consistency of the weight estimators; Section C7.3 illustrates that the oracle-assisted LASLA controls FDP and FDR asymptotically. 199 C7.1 Oracle weight With slight abuse of notation, we let L ∗ i = (1−π τ i )f0(ti)/f ∗ i (ti |D), where f ∗ i (·|D) can be interpreted as the density function of the primary statistic in light of full side information. Again we omit the conditioning on D throughout for notation simplicity. Since fi(t) is calculated in light of partial side information Di , it should become close to f ∗ i (t) as Di → Di , which will be shown rigorously later in Section C7.2. Similarly as the oracle-assisted weights defined in Section 4.2.3, denote the sorted statistics by L ∗ (1) ≤ . . . ≤ L ∗ (m) . Let L ∗ (k ∗) be the threshold, where k ∗ = max{j : j −1 Pj i=1 L ∗ (i) ≤ α}. Then for Ti > 0, let t ∗,+ i = ∞ if (1 − π ∗ i )f0(t)/f ∗ i (t)} ≥ E n L ∗ (k ∗) o for all t ≥ 0, else: t ∗,+ i = inf h t ≥ 0 : {(1 − π τ i )f0(t)/f ∗ i (t)} ≤ E n L ∗ (k ∗) oi , and define w ∗ i = 1 − F0(t ∗,+ i ). For Ti < 0, we let t ∗,− i = −∞ if (1 − π ∗ i )f0(t)/f ∗ i (t)} ≥ E n L ∗ (k ∗) o for all t ≤ 0, else: t ∗,− i = sup h t ≤ 0 : {(1 − π τ i )f0(t)/f ∗ i (t)} ≤ E n L ∗ (k ∗) oi , and the corresponding weight is given by w ∗ i = F0(t ∗,− i ). Again, we let w ∗ i = max{w ∗ i , ξ} and w ∗ i = min{w ∗ i , 1 − ξ} for any sufficiently small constant 0 < ξ < 1. Then the oracle thresholding rule is provided by k ∗,w = max ( j : (1/j) Xm i=1 w ∗ i (1 − π τ i )P w∗ (j) ≤ α ) . (C10) We show next that the oracle-assisted weight wi in Algorithm C10 estimates w ∗ i consistently under some regularity conditions in the following section. 200 C7.2 Consistency of the weight estimator The weight consistency result is built upon the consistency of sparsity estimator (4.5) and density estimator (4.6). The theoretical properties of the former can be similarly proved as Proposition 2 under conditions (A1) and (A2), while letting Ni = {j ∈ [m], j ̸= i} and h ≫ m−1 . We shall focus on the consistency of the density estimator below. Recall that fi(t) = P j̸=i [Vh(i, j)Kh(tj − t)] P j̸=i Vh(i, j) . We will focus on the cases when the support of the primary statistics T = {Ti : i ∈ [m]} is R, e.g. z-statistics and t-statistics. (A7) Assume that for all i, j, f ∗ j (t | Dij = x) has bounded first and second partial derivatives at t and x. (A8) Assume that, for all i ∈ [m], Var Xm j=1 Kh(Dij )Kh(tj − t) | D ≤ C Xm j=1 Var {Kh(Dij )Kh(tj − t) | Dj} for some constant C > 1, for all t. Remark. Assumption (A7) is a mild regularity condition on the densities of the primary statistics. Condition (A8) assumes that most of the primary statistics are weakly correlated. Lemma C2. Let K(·) be a kernel function that satisfies (4.4) and let T be a random variable with support R. Assume that its conditional density f(· | D) has bounded first and second derivative. Then for any fixed t, as the bandwidth h → 0, we have, 201 E(Kh(T − t) | D) = f(t | D) + O(h 2 )σ 2 K E(K2 h (T − t) | D) = f(t | D) R(K) h + O(h)G(K), where R(K) = R R K2 (x) dx and G(K) = R R x 2K2 (x) dx. Once Lemma C2 is developed, we can obtain the following proposition on density estimation consistency. Proposition C2. Under Assumptions (A7) and (A8), if h ≫ m−1/2 , we have for any t, uniformly for all i ∈ [m], E({fi(t) − f ∗ i (t)} | D) 2 → 0, as Di → Di . Next we develop the consistency result of the oracle-assisted weight in Algorithm C10. Without loss of generality, we assume that −∞ < t∗,− i ≤ t ∗,+ i < +∞ for all i ∈ [m]. Let gi(t) = (1 − π τ i )f0(t)/f ∗ i (t) and define functions g −1 i,+ : x → t and g −1 i,− : x → t as g −1 i,+(x) = inf{t ≥ 0 : gi(t) ≤ x}, and g −1 i,−(x) = sup{t ≤ 0 : gi(t) ≤ x}. We let g −1 i,+(x) = +∞ if gi(t) ≥ x for all t ≥ 0 and let g −1 i,−(x) = −∞ if gi(t) ≥ x for all t ≤ 0. We also assume that −∞ < t− i ≤ t + i < +∞ for all i ∈ [m] for simplicity. If not, the data-driven testing 202 procedure will be more conservative than the oracle one and hence the asymptotic FDR control can again be guaranteed. Then based on Proposition C2, we obtain the following corollary. Corollary C5. Assume that g −1 i,+(x) and g −1 i,−(x) have bounded first derivative for all 0 < x < 1 such that −∞ < g−1 i,−(x) ≤ g −1 i,+(x) < +∞ and there exists some constants α1 and α2 such that 1 k ∗ Pk ∗ i=1 L ∗ (i) ≤ α1 < α < α2 ≤ 1 k ∗+1 Pk ∗+1 i=1 L ∗ (i) with probability tending to 1. Assume that 1/f ∗ i (t) are bounded with probability tending to 1 uniformly for all i ∈ [m]. Further assume that π τ i ≤ 1 − ξ for sufficiently small constant ξ > 0 and Var L ∗ (k ∗) = o(1). Then under the conditions of Propositions 2 and C2, we have, as m → ∞, wi = w ∗ i + oP(1), uniformly for all i ∈ [m]. Remark. The conditions on g −1 i,+, g −1 i,− and L ∗ (i) are mild and can be easily satisfied by the commonly used distributions such as normal distribution, t-distribution, etc. The condition on 1/f ∗ i (t) can be further relaxed by a more sophisticated calculation on the convergence rate of fi(t)in the proof of Proposition C2. The condition Var L ∗ (k ∗) = o(1) is mild and can be satisfied by most of the settings in the scope of this paper. For example, in Setting 1 of Section C5, Var L ∗ (k ∗) is of the order 10−2 . C7.3 FDP control under weak dependence Recall that we define the z-values by Zi = Φ−1 (1 − Pi/2), and let Z = (Z1, . . . , Zm) T . We collect below one additional regularity condition for the asymptotic error rates control. We allow dependency to come from two sources: Dependence of the θi ’s and dependency of the p-values given θi ’s. Our conditions on these two types of correlations are respectively specified in (A3) and (A9). (A9) Define (ri,j )m×m = R = Corr(Z). Assume max1≤i<j≤m |ri,j | ≤ r < 1 for some constant r > 0. Moreover, there exists γ > 0 such that max{i:θi=0} |Γi(γ)| = o(mκ )for some constant 0 < κ < 1−r 1+r , where Γi(γ) = {j : 1 ≤ j ≤ m, |ri,j | ≥ (log m) −2−γ}. 203 We first consider the oracle case. Recall that k ∗,w = max ( j : P w∗ (j) /jXm i=1 w ∗ i (1 − π τ i ) ≤ α ) . Denote the corresponding threshold for the weighted p-values as t w∗ and the set of decision rules as δ w∗ . The next theorem shows that both FDP and FDR are controlled at the nominal level asymptotically under dependency. Theorem C2. Under (A3), (A4) and (A9), we have for any ϵ > 0, lim Di→Di,∀i FDR(δ w∗ ) ≤ α, and lim Di→Di,∀i P(FDP(δ w∗ ) ≤ α + ϵ) = 1. The next theorem establishes the theoretical properties of data-driven LASLA. Based on the weight consistency result, the FDP and FDR can be asymptotically controlled as well. Theorem C3. Under the conditions in Corollary C5 and Theorem C2, we have for any ϵ > 0, lim Di→Di,∀i FDR(δ w) ≤ α, and lim Di→Di,∀i P(FDP(δ w) ≤ α + ϵ) = 1. 204 C8 Proof of the theoretical results under dependency Proof of Lemma C2 Proof. By Taylor expansion of f(y | D) at y = t, we have E(Kh(T − t) | D) = Z Kh(y − t)f(y | D) dy = Z Kh(y − t) f(t | D) + f ′ (t | D)(y − t) + f ′′(t | D) 2 (y − t) 2 dy + O(h 2 ) = f(t | D) + O(h 2 )σ 2 K. Similarly, E(K2 h (T − t) | D) = Z K2 h (y − t)f(y | D) dy = Z K2 h (y − t) f(t | D) + f ′ (t | D)(y − t) + f ′′(t | D) 2 (y − t) 2 dy + O(h) = f(t | D)R(K) h + O(h)G(K). Proof of Proposition C2 Proof. By Lemma C2, we have E(fi(t) | D) = P j̸=i Kh(Dij ) E(Kh(tj − t) | Dj ) P j̸=i Kh(Dij ) = P j̸=i Kh(Dij )f ∗ j (t | Dj ) P j̸=i Kh(Dij ) + O(h 2 ). By h ≫ m−1/2 , as Di → Di , we have P j̸=i Kh(Dij )f ∗ j (t | Dj )/ P j̸=i Kh(Dij ) R Di Kh(x)f ∗ jx (t | Dijx = x) dx/R Di Kh(x) dx → 1, 205 where jx represents the index such that Dijx = x. By Taylor expansion of f ∗ jx (t | Dijx = x) at x = 0, we have, R Di Kh(x)f ∗ jx (t | Dijx = x) dx R Di Kh(x) dx = R Di Kh(x) f ∗ i (t) + (f ∗ i ) ′ (t)x + (f ∗ i ) ′′(t) 2 x 2 dx R Di Kh(x) dx + O(h 2 ) = f ∗ i (t) + R Di Kh(x) (f ∗ i ) ′ (t)x + (f ∗ i ) ′′(t) 2 x 2 dx R Di Kh(x) dx + O(h 2 ). Under assumption (A7) and the condition that Di is finite, we have that for some constant c > 0, [E(fi(t) | D) − f ∗ i (t)]2 ≤ c Z Di |x|Kh(x) dxZ Di Kh(x) dx + c Z Di x 2Kh(x) Z Di Kh(x) dx2 + o(1) → 0, as h → 0. Now for the variance term, by Assumption (A8), we have VarX j̸=i [Kh(Dij )Kh(tj − t)] | D ≤ c ′X j̸=i K2 h (Dij ) Var(Kh(tj − t) | Dj ) 206 Hence, as h ≫ m−1/2 , by Lemma C2, Assumption (4.4) and the fact that K(·) is positive and bounded, we take Taylor expansion again and obtain that Var(fi(t) | D) ≤ c ′m−1 R Di K2 h (x) f ∗ jx (t | Dijx = x) R(K)/h − f ∗ jx (t | Dijx = x) dx + O(1) ( R Di Kh(x) dx) 2 ≤ c ′m−1 R Di R(K) h K2 h (x) f ∗ i (t) + (f ∗ i ) ′ (t)x + (f ∗ i ) ′′(t) 2 x 2 dx + O(1) ( R Di Kh(x) dx) 2 ≤ c ′m−1 R Di R(K) h K2 h (x)f ∗ i (t) dx + O(h −1 ) ( R Di Kh(x) dx) 2 ≤ c ′′m−1 R Di R(K) h K2 h (x) dx + O(h −1 ) ( R Di Kh(x) dx) 2 ≤ c ′′m−1h −2R(K) R R K2 (y) dy ( R Di Kh(x) dx) 2 ≤ c ′′′m−1h −2 = o(1), for some constant c ′′, c′′′ > 0. Hence, as Di → Di , combining the bias term and variance term, the consistency result is proved. Proof of Corollary 1 Proof. Recall that Li = (1 − πi)f0(Ti) fi(Ti) . Then based on the consistency results on πi and fi(t) in Propositions 2 (with Ni = {j ∈ [m], j ̸= i} and h ≫ m−1 ) and C2, together with the condition that 1/f ∗ i (t) is bounded and π τ i ≤ 1 − ξ, we have, uniformly for all i ∈ [m], Li = (1 + oP(1))L ∗ i . 20 Then by the condition that 1 k ∗ Pk ∗ i=1 L ∗ (i) ≤ α1 < α < α2 ≤ 1 k ∗+1 Pk ∗+1 i=1 L ∗ (i) with probability tending to 1 and Var L ∗ (k ∗) = o(1), it yields that L(k) = L ∗ (k ∗) + oP(1) = E n L ∗ (k ∗) o + oP(1). Then based on the definitions of g −1 i,+ and g −1 i,−, we have that t ∗,+ i = g −1 i,+ h E n L ∗ (k ∗) oi and t ∗,− i = g −1 i,− h E n L ∗ (k ∗) oi , and that t + i = g −1 i,+ (1 + oP(1))L(k) and t − i = g −1 i,− (1 + oP(1))L(k) , based on the condition that π τ i ≤ 1−ξ. Then because g −1 i,+ and g −1 i,− have bounded first derivative, we have t + i = t ∗,+ i + oP(1) and t − i = t ∗,− i + oP(1). By Assumption (A7), f0 is bounded, then we obtain wi = w ∗ i + oP(1), uniformly for all i ∈ [m]. 208 Proof of Theorem C2 Proof. The FDP of the oracle procedure at the thresholding level t can be calculated by FDP(t) = Pm i=1 I{P w∗ i ≤ t, θi = 0} max{ Pm i=1 I{P w∗ i ≤ t}, 1} = Pm i=1 P(P w∗ i ≤ t, θi = 0) max{ Pm i=1 I{P w∗ i ≤ t}, 1} · Pm i=1 I{P w∗ i ≤ t, θi = 0} Pm i=1 P(P w∗ i ≤ t, θi = 0) = Pm i=1 w ∗ i (1 − π ∗ i )t max{ Pm i=1 I{P w∗ i ≤ t}, 1} · Pm i=1 I{P w∗ i ≤ t, θi = 0} Pm i=1 P(P w∗ i ≤ t, θi = 0) ≤ Pm i=1 w ∗ i (1 − π τ i )t max{ Pm i=1 I{P w∗ i ≤ t}, 1} · Pm i=1 I{P w∗ i ≤ t, θi = 0} Pm i=1 P(P w∗ i ≤ t, θi = 0). Then by Steps 2 and 3 in the proofs of Theorem 10 by replacing wi ’s with the true w ∗ i ’s, and together with the proofs of Theorem 2 in [21], by Assumption (A9), we have sup 0≤t≤tm Pm i=1 I{Z w∗ i ≥ t, θi = 0} − Pm i=1 P(Z w∗ i ≥ t, θi = 0) Pm i=1 P(Z w∗ i ≥ t, θi = 0) → 0, in probability. Then the FDP and FDR are controlled and Theorem C2 is proved. Proof of Theorem C3 Proof. Note that, the FDP of the data-driven procedure at the thresholding level t can be calculated by FDP(t) = Pm i=1 I{P w i ≤ t, θi = 0} max{ Pm i=1 I{P w i ≤ t}, 1} = Pm i=1 wi(1 − πi)t max{ Pm i=1 I{P w i ≤ t}, 1} · Pm i=1 I{P w i ≤ t, θi = 0} Pm i=1 wi(1 − πi)t , = Pm i=1 wi(1 − πi)t max{ Pm i=1 I{P w i ≤ t}, 1} · Pm i=1 P{P w i ≤ t, θi = 0} Pm i=1 wi(1 − πi)t · Pm i=1 I{P w i ≤ t, θi = 0} Pm i=1 P{P w i ≤ t, θi = 0} , 209 Define the event A = [{wi} m i=1 : wi = w ∗ i + o(1)], then based on the result of Corollary C5, we have that P(A) → 1. Next, we shall focus on the event A. For {wi} m i=1 ∈ A, uniformly for all i ∈ [m], P(P w i ≤ t|θi = 0) = (1 + o(1))P(P w∗ i ≤ t|θi = 0), uniformly for all t defined in the range defined in Step 3 of Theorem 10. Then we have, uniformly for all i ∈ [m], P(P w i ≤ t, θi = 0) = (1 + o(1))(1 − π ∗ i )P(P w∗ i ≤ t|θi = 0) = (1 + o(1))w ∗ i (1 − π ∗ i )t ≤ (1 + o(1))w ∗ i (1 − π τ i )t. Thus, based on the results of Proposition 2 and Corollary C5 and proofs of Theorems 10 and C2, we obtain that the oracle-assisted weight produces a more conservative procedure asymptotically. This concludes the proof of Theorem C3. 210
Abstract (if available)
Abstract
This thesis explores the development of computationally efficient algorithms to extract powerful and reliable inferences from large-scale datasets, focusing on conformalized post-selection inference and structured multiple testing. Conformalized post-selection inference utilizes conformal inference techniques to address the challenge of providing valid inferences post-model or parameter selection. The problem is especially challenging when the same hold-out dataset is used for both selection and inferential purposes due to potential selection biases. I present innovative conformalized methods that address the selection biases dilemma with provable finite-sample guarantees under the outlier detection, multi-class classification, and regression settings. In the area of structured multiple testing, I present a weighted hypothesis testing framework that tackles the dual challenges of leveraging side information to enhance test power and controlling error rates amidst data dependencies. The presented framework effectively utilizes generic side information, particularly highlighting the applications to genome-wide association studies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Stochastic Variational Inference as a solution to the intractability problem of Bayesian inference
PDF
Topics in selective inference and replicability analysis
PDF
Large scale inference with structural information
PDF
High dimensional estimation and inference with side information
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
PDF
Physics-based data-driven inference
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Validating structural variations: from traditional algorithms to deep learning approaches
PDF
Research on power load time series forecasting method based on transformer model
PDF
Model falsification: new insights, methods and applications
PDF
Robust estimation of high dimensional parameters
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Robust causal inference with machine learning on observational data
PDF
Statistical citation network analysis and asymmetric error controls
PDF
Autoregression and structured low-rank modeling of sinograms
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
Asset Metadata
Creator
Liang, Ziyi
(author)
Core Title
Conformalized post-selection inference and structured prediction
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2024-05
Publication Date
04/10/2024
Defense Date
03/27/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
conformal inference,machine learning.,multiple testing,OAI-PMH Harvest,post-selection inference,structured prediction,uncertainty quantification
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Goldstein, Larry (
committee chair
), Minsker, Stanislav (
committee member
), Sesia, Matteo (
committee member
)
Creator Email
liangziyi0521@gmail.com,ziyilian@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113871423
Unique identifier
UC113871423
Identifier
etd-LiangZiyi-12794.pdf (filename)
Legacy Identifier
etd-LiangZiyi-12794
Document Type
Dissertation
Format
theses (aat)
Rights
Liang, Ziyi
Internet Media Type
application/pdf
Type
texts
Source
20240412-usctheses-batch-1139
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
conformal inference
machine learning.
multiple testing
post-selection inference
structured prediction
uncertainty quantification