Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Neighborhood and graph constructions using non-negative kernel regression (NNK)
(USC Thesis Other)
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Neighborhood and Graph constructions using Non-Negative Kernel regression (NNK) by Sarath Shekkizhar A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL AND COMPUTER ENGINEERING) August 2023 Copyright 2023 Sarath Shekkizhar To my grandfather, Gunasekaran, who instilled in me a desire to learn and inspire. ii Acknowledgements My Ph.D., and the roller-coaster of experience that led to it, would not have been possible without several people’s support, guidance, and company. First and foremost, I would like to thank my advisor, Antonio Ortega, for his patience in helping me grow as a researcher. My acquaintance with Antonio started much before my Ph.D. in a course (EE586L) I took in Spring, 2013. Little did I know that what started as a playful banter regarding his football favorite, Real Madrid, winning over Manchester United, was the beginning of an important relationship in my life. I still remember the conversation I had with him when I was trying to pivot from a software engineer to a Ph.D. in 2016 – research is about creating knowledge and not about just learning new things – and the subsequent time he took to mentor me virtually before joining his lab in 2017. In my early Ph.D. days, Antonio allowed me to pursue and explore several topics, carefully feeding me with advice and criticism that helped me see a research path by myself. His support during the pandemic, a difficult and isolated time both socially and mentally, gave me hope and motivation to keep at my research. Throughout my Ph.D., there were several personal and professional discussions, all of which have left an impression on me and shaped who I’m today. His approach to research, work ethic, and always positive attitude are some things that I hope to incorporate and hold on to in my career. Another significant influence in shaping my outlook toward research and academia is BartKosko. Bartisthetypeofacademicwhotakesaninterestinthewell-beingandgrowth of all his students. To say his classes are unique would be an understatement – there were several occasions where I was mesmerized 1 by the knowledge that was imparted on me in 1 To quote him “And now, to pull the rabbit out of the hat” iii few hours. I enjoyed our walks to the coffee room and then to his office, where Bart would discuss everything from math, copyright laws, and neural networks to world history, music, and movies. I thank him for his guidance and mentoring, things that have truly helped me mature as a person and as a researcher. I want to thank my other qualification and defense committee members for their time, support, and encouragement. I got acquainted with Salman and Mahdi through a DARPA- funded project that Antonio was part of – Learning with Less Labels. Their questions, comments, and presentations during the project helped me polish and be critical of my work. Additionally, I sincerely thank Mohammad Rostami, Fei Sha, and Cyrus Shahabi for theirinputinimprovingandpositioningmyresearchinabroaderdomain. Mymeetingswith Mohammad helped me stay in touch with the problems in the machine-learning community. I’m thankful to all my course faculties and the Viterbi Student Government for providing me a platform to learn, collaborate, and develop professionally. I’m indebted to the staff at USC, ECE, and MHI for all their time and help. MytimeatUSCwouldnothavebeenpossibleifnotforthecompanyofmyfellowSTAC- USC lab mates (Pratyusha, Ajinkya, Eduardo, Romain), mentees (David, Keisuke, Aryan, and Carlos), and several others – I thank them for all the discussions and encouragement throughout my Ph.D. Most importantly, I would like to thank my parents, Sekkilar and Amutha, brother, Gowtham, and countless other relatives for their never-ending source of love, inspiration, and faith in me. It would be remiss of me not to mention the crucial role of my friends – Pooja 2 , Monika, Mythili, Shantanu, Harish, Rahul, Snehal, Nitish, Anjali – their support and company helped me through my ups and downs during my Ph.D. Special mention to my dogs, Ralph and Maui, for the never-ending stream of kisses and cuddles. Thank you all. 2 One person outside my research community who most understands this thesis and the efforts behind it! iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Algorithms xi List of Figures xii Abstract xxii Chapter 1: Introduction 1 1.1 Research contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Neighborhood and Graph constructions on data . . . . . . . . . . . . 2 1.1.2 Applications in graph-based learning and analysis . . . . . . . . . . . 3 1.1.3 Graph-based framework for understanding deep learning . . . . . . . 4 1.2 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Non-negative kernel regression (NNK) neighborhood 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Similarity Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Locally linear neighborhoods . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Neighborhoods and sparse signal approximations . . . . . . . . . . . . . . . . 16 2.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 kNN and ϵ -neighborhood . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Proposed: Non-negative kernel regression . . . . . . . . . . . . . . . . 18 2.3.3.1 NNK and constrained LLE objective . . . . . . . . . . . . . 19 2.3.4 Basis Pursuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4.1 Matching Pursuits . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4.2 ℓ 1 regularized pursuits . . . . . . . . . . . . . . . . . . . . . 21 2.3.5 Graph construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.6 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 v 2.4 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.1 NNK and active constraint set condition . . . . . . . . . . . . . . . . 25 2.4.2 Geometry of NNK neighbors . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.3 NNK Geometry in kernel space . . . . . . . . . . . . . . . . . . . . . 28 2.4.4 Geometry of MP, OMP neighborhoods . . . . . . . . . . . . . . . . . 30 2.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.1 NNK neighborhoods . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.2 NNK graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.2.1 Sparsity and runtime complexity . . . . . . . . . . . . . . . 37 2.5.2.2 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.2.3 Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . . . . 38 2.6 Generalizations and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.1 General Kernel Ratio Interval . . . . . . . . . . . . . . . . . . . . . . 41 2.6.2 Neighborhoods in subvector spaces . . . . . . . . . . . . . . . . . . . 42 2.6.3 Gaussian kernels and Multi-scale neighborhoods . . . . . . . . . . . . 43 2.7 Discussion and Open questions . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.8.1 Karush-Kuhn-Tucker conditions of the NNK objective. . . . . . . . . 46 2.8.2 Proof of Proposition 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.8.3 Proof of Proposition 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.4 Proof of Proposition 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.5 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.8.5.1 Proof of Corollary 2.1.1 . . . . . . . . . . . . . . . . . . . . 52 2.8.5.2 Proof of Corollary 2.1.2 . . . . . . . . . . . . . . . . . . . . 53 2.8.6 Proof of Proposition 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.8.7 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.8.8 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Chapter 3: Applications in graph-based learning and analysis 58 3.1 Image graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.1.1.1 Bilateral Filter . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.1.1.2 Spectral Graph Wavelet (SGW) Transforms . . . . . . . . . 61 3.1.2 NNK image graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.2.1 Kernel Ratio Interval for images . . . . . . . . . . . . . . . 62 3.1.2.2 Sparse graph representation of images . . . . . . . . . . . . 62 3.1.3 Representation and filtering with NNK image graphs . . . . . . . . . 64 3.1.3.1 Energy compaction . . . . . . . . . . . . . . . . . . . . . . . 66 3.1.3.2 Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Data summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.1.1 Sparse Dictionary Learning . . . . . . . . . . . . . . . . . . 71 3.2.1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2.2 NNK-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.3 Classification with summarized data . . . . . . . . . . . . . . . . . . 78 vi 3.3 Neighborhood based interpolation . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.1 Local Polytope Interpolation . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.2 Theoretical analysis of NNK interpolation . . . . . . . . . . . . . . . 84 3.3.2.1 A general bound on the NNK classifier . . . . . . . . . . . . 84 3.3.2.2 Leave one out stability . . . . . . . . . . . . . . . . . . . . . 88 3.3.3 DeepNNK: Neural Networks + NNK Interpolation. . . . . . . . . . . 90 3.3.3.1 Predicting generalization (L− L) . . . . . . . . . . . . . . . 91 3.3.3.2 A simple few shot framework (UL− L) . . . . . . . . . . . 95 3.3.3.3 Instance based explainability (UL− L) . . . . . . . . . . . . 98 3.3.3.4 Evaluating self-supervised learning models (UL− L) . . . . 100 3.3.3.5 Distance between datasets (UL− UL) . . . . . . . . . . . . 102 3.4 Discussion and Open questions . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.5.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.5.1.1 Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.5.2 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . 109 3.5.3 Proof of Proposition 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.5.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.5.5 Proof of Proposition 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 112 3.5.5.1 Active set Lemma . . . . . . . . . . . . . . . . . . . . . . . 112 3.5.6 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.5.6.1 Proof of Corollary (3.3.1) . . . . . . . . . . . . . . . . . . . 115 3.5.6.2 Proof of Corollary (3.3.2) . . . . . . . . . . . . . . . . . . . 115 3.5.7 Proof of Theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.5.8 Proof of Proposition 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 4: Geometry of deep learning 118 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2 Manifold Graph Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2.1 Graph signal variation . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.2 Invariance to augmentations . . . . . . . . . . . . . . . . . . . . . . . 124 4.2.3 Curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.2.4 Intrinsic Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.3 Case study: Self-supervised learning models . . . . . . . . . . . . . . . . . . 128 4.3.1 Geometry of SSL models . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.3.2 Transfer performances of SSL models . . . . . . . . . . . . . . . . . . 137 4.4 Discussion and Open questions . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 References 154 vii List of Tables 2.1 Contributions of our proposed method (NNK) in relation to previous works. NNK combines the concepts of non-negativity, kernel-based similarity, and locally linear neighborhoods. Earlier works on LLE-neighborhood/graph con- struction did not explore the sparsity or the geometric consequences of non- negativity (why a data point i is selected while another j is not selected for representing a query q) in their definitions. In this work, we present the ge- ometry of NNK, which applies to these earlier works, and provides insights into their solutions. Further, unlike our work, these related approaches do not make the connection between neighborhoods and sparse approximation problems. The last column demonstrates the geometry of the neighborhood obtained (shaded in blue), based on Theorem 2.1, for linear and Gaussian kernels – The points are color-coded as data (gray), query (pink), neighbors selected with the non-negative optimization (blue). . . . . . . . . . . . . . . 15 2.2 Testclassificationerror(in%,lowerisbetter)usinglocalneighborhoodmeth- ods on datasets with train (N tr ) and test (N te ) sample size, dimension (d), and classes (N class ). NNK-based classification consistently outperforms base- line methods while being robust to k,σ choices. . . . . . . . . . . . . . . . . 34 2.3 Performance summary of neighborhood methods on 70 OpenML datasets. Listed metrics include (i) Average classification error and standard devia- tion (in parentheses) across all datasets, (ii) the number of datasets where a method performs better than the other two (Win), performs within one standard deviation of the best performance (Tie), and performs poorly in comparison (Loss), and (iii) average rank of the method based on the classi- fication (1 st ,2 nd ,3 rd ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Computational runtime T (in seconds) and number of edges|ξ | (> 10 − 8 ) for various graph constructions on subset of MNIST data (N=1000, d=784) with k=30. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 viii 3.1 Key differences between earlier dictionary learning (DL) methods and NNK- Means. All methods iteratively use two steps: sparse coding and dictionary update. Our sparse coding procedure is explicitly based on local neighbors, with the sparsity defined adaptively based on the relative position of the data and atoms (i.e., the data geometry). Further, the atoms obtained with our method are points in the input space and correspond to a smooth partition of the data space. In contrast, atoms obtained in previous approaches need not belong to the input space as they strive to represent both the inputs and their approximation residuals [1,2]. . . . . . . . . . . . . . . . . . . . . . . . 74 3.2 Classificationaccuracy(in%,higherisbetter)onMNIST,CIFAR10andtheir subset (S, 20% of randomly sampled training set). Each method learns a 50- atom dictionary per class, initialized randomly, with a sparsity constraint, where applicable, of 30 and solved for at most 10 iterations. NNK-Means consistentlyproducesbettertestclassificationaccuracywhilehavingareduced runtimecomparedtokSVDapproachesandcomparabletokMeans. Kernelk- SVD produces comparable performance but at the cost of 67× and 7× slower train and test time relative to NNK-Means. . . . . . . . . . . . . . . . . . . 80 3.3 Overview of local neighborhood methods based on the availability of labels (Labeled, UnLabeled) at data point and corresponding neighbors with few applications in each category. The last column in the Table links to relevant materials in our work corresponding to the setting in each group. . . . . . . 91 3.4 1-shot and 5-shot accuracy (in %, higher is better) for 5-way classification on mini-ImageNet and tiered-ImageNet averaged over 600 runs. Results from kNN, NNK transductive classification are compared to an inductive method SimpleShot (CL2N) [3] and listed performances from recently studied trans- ductive methods such as [4–7]. We see that NNK outperforms kNN as k increases while achieving robust performance. Further, our simple framework iscomparableto,andoftenbetterthan,recentandmorecomplextransductive FSL algorithms that require additional training, fine-tuning of hyperparame- ters, or preprocessing as in [7]. . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.5 Top-1 Linear, weighted kNN and NNK classification accuracy on ImageNet. We report performance on the validation dataset with ResNet-50 models trained using different self-supervised training strategies. The kNN and NNK evaluationsweredoneontheVISSLframeworkusingofficiallyreleasedmodel weights with the number of neighbors k = 50. We do not run linear evalu- ations but list the performance as reported on the corresponding works for comparison. Code for the experiment is available at github.com/shekkizh. . . 101 ix 4.1 (Left)Projection of SSL models MGMs onto principal components. We observe three distinct clusters based on the MGMs, which represent the geometric similarities and differences between the various models. Note that these clusters are not necessarily aligned with the underlying SSL train- ing paradigm, i.e., contrastive, non-contrastive, prototype-clustering based. (Right)MGMs that make up the principal components and their val- ues for each model, where we indicate by the two yellow boxes (a) and (b) the MGMsthatcapturethemaximumvariationofthemodelsalongtheprincipal directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.2 Evaluation setting and composed augmentation functions (sequentially ap- pliedintheorderlistedfromlefttoright). Allimagesareresizedto224(bicu- bic interpolation) if not randomly cropped to the size and are mean-centered and standardized along each color channel based on statistics obtained with the ImageNet training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.3 Proposed Manifold Graph Metrics, their relationship to the geometric prop- erty of the manifold, and method of estimation using observed embeddings. Note that a similar diameter evaluation with a k-nearest neighbor will explic- itly depend on the choice of k while using 1-nearest neighbor will reduce the local geometry to only one direction. Thus, the use of a neighborhood defini- tionthatisadaptivetothelocalgeometryofthedataiscrucialtosuccessfully observe the properties of the manifold. . . . . . . . . . . . . . . . . . . . . . 144 4.4 SSL model details and the associated dendrogram split learned by our geometric metrics. We highlight the structural differences between SSL models that would possibly correspond to the geometric differences and similarities we observe. It appears that the main difference between SimCLR models and all the other SSL models could be explained by input normal- ization. Then, the batch size appears to also affect drastically the geometry. Finally, we observe on a finer scale that momentum encoder, the presence of projection head and memory bank could also lead to geometric differences. WebelievethatthecurrentclassificationofSSLapproaches(contrastive,non- contrastive, cluster-based) is not sufficient to capture their geometric differ- ences. All the parameters described in here should be taken into account as we believe there exists a complex interplay between these choices and the in- duced geometry of the trained SSL model. . . . . . . . . . . . . . . . . . . . 150 4.5 Proposed MGMs observed for different SSL models. We display here the values of each MGM for each SSL model. These correspond to the 26 MGMs that we consider in all the analyses we provide in the paper. . . . . . 150 x List of Algorithms 1 NNK Neighborhood algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 MP and OMP Neighborhood algorithms . . . . . . . . . . . . . . . . . . . . . 22 3 NNK Graph algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 NNK Algorithm for Image Graphs . . . . . . . . . . . . . . . . . . . . . . . . 65 5 NNK-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6 NNK Transductive Few Shot Learning . . . . . . . . . . . . . . . . . . . . . . 96 7 MGMs for SSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 xi List of Figures 2.1 Geometric difference in neighborhood definitions using k-nearest neighbors (kNN) and the proposed NNK approach with a Gaussian kernel. kNN selects neighbors based on the choice of k and can lead to the selection of geomet- rically redundant neighbors, i.e., the vector from the query to neighbors is almost collinear with other vectors joining the query and neighbors. In con- trast, the proposed NNK definition selects only neighbors corresponding to non-collinear directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Problem setup for a query q with two points i,j with θ i ,θ j corresponding to weight assigned. A kNN or ϵ -neighborhood methods assigns these weights us- ingakernel,i.e. K q,i ,K q,j irrespectiveoftherelativepositionsofi,j(K i,j ). In contrast, theNNKneighborhoodassignsnon-zero θ i andθ j iffequation(2.15) is satisfied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Geometry(representedinred)ofanNNKneighborhoodforakernelsimilarity proportional to the distance between the data points. (a) Hyper plane asso- ciatedwiththeselectedNNKneighbori. (b)Convexpolytopecorresponding to the NNK neighbors around a query. . . . . . . . . . . . . . . . . . . . . . 27 2.4 (a) Geometric setup of NNK in RKHS. (b) Problem setup at the 2 nd step of pursuit in MP, OMP neighborhoods. . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Top: Average cross-validation error for different values of hyperparameters associated with k ∗ NN , kNN, and our NNK method. The value of k in kNN and NNK is set to 10 in these experiments. Bottom: Test classification error of kNN and NNK for different values of k with hyperparameter σ selected using cross-validation. Here we include k ∗ NN for reference. We see that the proposed NNK approach produces better performance for a wide range of values for the hyperparameter σ with improvements in the estimation as k increases. In contrast, we observe that kNN and k ∗ NN methods require a careful selection of the hyperparameters for a good classification which is still inferior to NNK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 xii 2.6 Graph constructions in synthetic data settings. Unlike kNN and LLE, NNK achieves a sparse representation where the relative locations of nodes in the neighborhood are key. In the seven-node example (top row), the node in the center has only three direct neighbors, which in turn are neighbors of nodes extending the graph further in those three directions. Graphs are constructed with k = 5 and Gaussian kernel (σ = 1). Edge thickness indicates neighbor weights normalized globally to have values in [0,1]. . . . . . . . . . . . . . . 36 2.7 Ratio of the number of edges (with weights≥ 10 − 8 ) to the number of nodes. Left: Noisy Swiss Roll (N=5000, d=3), Right: MNIST (N=1000, d=784). The sparsity of NNK, MP, and OMP graphs saturates for increasing k and is reflective of the intrinsic dimensionality of the data. . . . . . . . . . . . . . . 37 2.8 SSLperformance(meanover10differentsubsetswith10randominitialization of available labels) with graphs constructed by different algorithms. Left column: ClassificationerroratdifferentpercentagesoflabeleddataonUSPS (top row) and MNIST (bottom row) dataset with k=30. The time taken and the sparsity of the constructed graphs is indicated as part of the legend. Rightcolumn: Boxplotsshowingtherobustnessofgraphconstructionsusing L andL Laplacians for different choices of k (10 ,15,...,50) in SSL task with 10% labeled data on USPS and MNIST. . . . . . . . . . . . . . . . . . . . . 39 2.9 Syntheticdatasetsusedingraphexperiments. Left:Swissroll(N=5000,d=3), Right: Severed sphere (N=3000, d=3). The color map corresponding to the location on one axis is used to identify the relative position of the points in dimensionality reduction experiments. . . . . . . . . . . . . . . . . . . . . . 40 2.10 Two-dimensional embedding learned by different algorithms. Left: Swiss roll, Right: Severed sphere. Each row corresponds to different choice of k = 10,20,40(toptobottom). Weincludethetimetakentoconstruct thegraph and the number of edges obtained for each method in the subplot title. We see that NNK graphs produce a robust embedding for both datasets, with the sparsity of the graph constructed correlating with the intrinsic dimension of the dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1 Left. A simple scenario of 4-connected neighbors and remaining pixels in a 7× 7 window with their associated threshold factor (∆). For example, given pixel j is connected to i, the proposed graph construction eliminates all pixel intensities in the green region (right of i), which satisfy the condition in Theorem 3.1. The threshold factor corresponding to each connected pixel is independent of pixel intensity and can be computed offline once for a given window size. The algorithm continues pruning by moving radially outwards connectingpixelsthatarenotprunedandremovingonesthataretobepruned based on proposition until no pixel is left unprocessed. Right. Average processing time per pixel for our proposed simplified NNK and the original NNK construction. We observe a similar trend on all our test images, with the difference widening further for increasing window sizes. . . . . . . . . . . 63 xiii 3.2 (Top: 11× 11 Bilateral Filter Graph vs Bottom: Proposed NNK Graph Con- struction) Energy compaction using spectral graph wavelets. NNK graphs capture most of the image in the lower bands. . . . . . . . . . . . . . . . . . 66 3.3 The energy captured by BF Graph (blue) and NNK graph (red) for different polynomial degree approximations of SGW. The wavelets were designed with frame bounds A = 1.71,B = 2.35 as in [8]. NNK consistently captures the image content better than the BF graph, irrespective of the Chebychev poly- nomial degree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 Images used for denoising experiment. . . . . . . . . . . . . . . . . . . . . . 67 3.5 Denoising performance using SGW on BF graph and proposed method with comparisons to original BF and BM3D algorithms. NNK graphs significantly improveovertheBFGraphversioninSSIMandPSNR.Ourmethodimproves the SSIM of the output while matching PSNR performance with the original BF. The BM3D method included for completeness shows that the proposed graph method with BF kernel achieves comparable SSIM measures. . . . . . 68 3.6 Left: Proposed NNK-Means. The algorithm alternates between sparse cod- ing (W) using NNK and dictionary update (A) until either the dictionary elements converge (up to a given error) or a given number of iterations is achieved. Middle: During sparse coding, kMeans assigns each data point to its nearest neighbor while NNK represents each data point in an adaptively formed convex polytope made of the dictionary atoms. Right: Comparison between existing dictionary learning methods and the proposed NNK-Means. kMeans offers a 1-sparse dictionary learning approach. kSVD offers a more flexible representation, where the sparse coding stage uses a chosen, fixed k 0 - sparsity in its representation but lacks geometry. NNK-Means has adaptive sparsity that relies on the relative positions of atoms around each data point to be represented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7 100 atoms for MNIST digits (contrast scaled for visualization) obtained using kMeans (a), DL methods and their constrained variants (b, c, d), NNK- Means (e) using cosine kernel. Unlike earlier DL approaches, the proposed NNK-Meanslearnsindividualatomsthatarelinearcombinationsofthedigits andcanbeassociatedgeometricallywiththeinputdata. Suchexplicitproper- ties of atoms are lost when working with ℓ 1 -regularized or thresholding-based sparse coding methods used previously in DL. We also observe that atoms learned by NNK-Means are more visually similar to the input data. . . . . . 74 xiv 3.8 Visualizing predictions (each color corresponds to a class) obtained using var- iousdictionary-basedclassificationschemesona4-classsyntheticdataset( a). Each method learns a 10 atom dictionary per class based on given training data(N =600)withasparsityconstraint,whereapplicable,of5. Thelearned per class dictionary is then used to classify test data (N = 200), with accu- racy indicated in parenthesis for each method. In (a), training and test data are denoted by times and bullet, respectively. We see that the kSVD (d) approach cannot adapt to the nonlinear structure of the data, and adding a kernel (e) is crucial in such scenarios. We see that NNK-Means (f) is more adaptive to the data geometry. Also, we observe that NKK-Means has a run- time comparable to kMeans (b,c) while having 4× and 2× faster train and test times than kernel kSVD. . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.9 Testclassificationaccuracy( Left),Traintime(Middle),andTesttime(Right) asafunctionofthenumberofdictionaryatomsperclassontheUSPSdataset for various DL methods. Each method is initialized similarly and is trained for a maximum of 10 iterations with a sparsity constraint, where applicable, of 30. The plots demonstrate the benefits of NNK-Means in classification ac- curacy and runtime. The major gain in runtime for NNK-Means comes from thepre-selectionofatomsintheformofnearestneighbors,whichleadstofast sparsecoding(ascanbeseenviafromthetesttimeplot, whichperformsonly sparse coding) relative to the kSVD approaches that sequentially perform a linear search for atoms that correlate with the residue at that stage. Training timeinthekSVDapproachesdecreasesasthenumberofatomsincreasessince the sparse coding stage requires fewer atom selection steps than for smaller dictionaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.10 (a) Comparison of simplicial and polytope interpolation methods. In the sim- plex case, the nodex i label can be approximated based on different triangles (simplices), one of which must be chosen. With the chosen triangle, two out ofthe threepoints areused for interpolation. Thus, inthis example, only half theneighboringpointsareusedforinterpolation. Instead, NNKinterpolation is based on all four data points, which together form a polytope. (b) KRI plane (dashed orange line) corresponding to chosen neighbor x j . NNK will not select data points to the right of this plane as neighbors of x i . (c) KRI boundary and associated convex polytope formed by NNK neighbors at x i . 82 xv 3.11 Misclassification error ( ξ ) using fully connected softmax classifier model and interpolatingclassifiers(weightedkNN,NNK)fordifferentvaluesofkparame- terateachtrainingepochonCIFAR-10dataset. Thetrainingdata(Top)and test data (Bottom) performance for three different model settings is shown in each column. NNK classification consistently performs as well as the ac- tual model with classification error decreasing slightly as k increases. On the contrary, a weighted kNN model error increases for increasing k showing ro- bustness issues. The classification error gap between the DNN model and leave-one-out NNK interpolator for train data is suggestive of underfitting (ξ NNK < ξ model ) and overfitting ( ξ NNK > ξ model ). We claim that models are good when their performance on the training data agrees with that of the local NNK model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.12 Histogram (normalized) of leave-one-out interpolation score (3.24) after 20 epochs with k = 50 on CIFAR-10. While the network performance on the trainingdatasetisconsiderablydifferentineachsetting,weseethatthechange in the interpolation (classification) landscape associated with the input data is minimal, and, consequently, all networks have a similar test dataset per- formance. However, the interpolation score spread is shifted towards zero in a regularized model, indicating a relatively better generalization or classifica- tion performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.13 Two test examples (first image in each set) with identified NNK neighbors from CIFAR-10 for k = 50. We show the assigned and predicted label for the test sample, and the assigned label and NNK weight for neighboring (and influential) training instances. Though we were able to identify the correct label for the test sample, one might want to question such duplicates in the dataset for downstream applications. . . . . . . . . . . . . . . . . . . . . . . 98 3.14 Two training set examples (first images in each set) observed to have maxi- mumclassificationerrorinLOONNKinterpolationscore,andtheirrespective neighbors, for k = 50. We show the assigned and predicted label for the im- agebeingclassified,andtheassignedlabelandNNKweightfortheneighbors. These instances exemplify the possible brittleness of the classification model, which can better inform a user about the limits of the trained model. . . . . 99 3.15 Selected black box adversarial examples (first image) and their NNK neigh- bors from CIFAR-10 training dataset with k = 50. Though changes in the input image are imperceptible to a human eye, one can better characterize a prediction using NNK by observing the interpolation region of the test instance. 99 xvi 3.16 Histogram (normalized) of the number of neighbors for (a) generated im- ages [9], (b) black box adversarial images [10] and actual CIFAR-10 images. Weseethatgeneratedandadversarialimagesonaveragehavefewerneighbors than real images, suggestive of the fact these examples often fall in interpo- lating regions where few train images span the space. An adversarial image is born when these areas of interpolation belong to unstable regions in the classification surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.17 Weighted kNN vs proposed NNK evaluation of self-supervised models from [11]. Self-supervised vision transformers and their distilled versions trained with patch sizes 8 (Left) and 16 (Right) are evaluated for different values of k. 102 3.18 (Left) CIFAR-10 and (Right) CIFAR-100. Wide-ResNet-28-10 model accu- racy vs NNK distance (3.26) between the clean dataset and five different noise levels of various corrupted datasets. The dashed line denotes accuracy on the clean dataset, and the scatter point size corresponds to the standard deviation of the terms within the summation in each distance. We see that the proposed distance NNK(D clean |D corrupt ) is indicative of the model’s ability to transfer with performance decreasing with increasing distance. . . . . . . 102 4.1 Left: Data-driven view of the geometry of the embedding manifold using a graph. Because neighborhoods and graphs are intrinsically independent of the exact data position, we can compare fundamentally heterogeneous obser- vations (such as representations from different dimensional spaces or models). Right: Progressive transformation of input space over successive layers of a DNN.Thesamplesinthedatasetarethesame,andthustheirattributes(e.g., labels) are the same, but their position in feature space, and hence the graph constructed, changes as the model is optimized for a particular task. . . . . . 119 4.2 MGMsinfeaturespacearebasedonNNKpolytopes. Wedisplaydata samples (red and blue dots) and two NNK polytopes (P(x i ),P(x j )) in the encoder space that approximate the underlying manifold of the deep learning model (gray surface). Our proposed MGMs capture invariance (polytope di- ameter),manifoldcurvature(anglebetweenneighboringpolytopes),andlocal intrinsic dimension (number of vertices in polytope) of the output manifold for a given DNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 xvii 4.3 BigGAN images corresponding to samples from three manifolds. Observed NNK diameter (Invariance) and angle between neighboring NNK polytopes (Affinity/Curvature) for three embedding manifold settings: Random ( Left), Linear(Middle),andSpherical(Right). Asexpected,therandomlysampled examples produce neighboring polytopes oriented almost orthogonal with re- spect to each other, indicative of the absence of a smooth manifold. Further, thelargediameterinthissettingshowsthattheexamplesarelocallydistinct, i.e., there is no collapse in the representation of the neighbors that form the polytope. In contrast, we see that the embeddings from linear and spherical subspaceshavepolytopesthatarecloselyrelatedtoeachotherandarelocally invariant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.4 (Left) Approach toward the analysis of SSL models. For each model, we use as input of the DNN the images in the validation set of ImageNet and their augmented versions. The output of the backbone encoder is used to quantify the properties of the manifold induced by the SSL algorithm. Specifically,wedevelop ManifoldGraphMetrics(section4.2)thatcapture manifold properties known to be crucial for transfer learning. The MGMs allow us to capture the specificity of each SSL model (Section 4.3.1) and to characterize their transfer learning capability (Section 4.3.2). (Right) We provide the dendrogram of the SSL models considered in this chapter based on our proposed MGMs. Although the underlying hyper-parameters, loss functions, andSSLparadigmsaredifferent, themanifoldsinducedbytheSSL algorithms can be categorized into three types. An important observation is thattheresultingclustersarenotnecessarilyalignedwiththedifferentclasses of SSL algorithms. This result shows that although some training procedures appear more similar, one must provide a deeper analysis to understand in what aspect SSL models differ. . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.5 Histogram of observed equivariance (Semantic, Augmentation) for convolutional encoder backbone (ResNet50) and a vision transformer back- bone (ViT-S) at (Left) initialization and (Right) after training with same SSL procedure (DINO [11]). We observe that at initialization the inductive bias in convolutional networks leads to an invariant representation with re- spect to both semantic and augmentation manifold. However, a ViT leads to a more scatteredrepresentations of inputimages belongingto thesemantic as wellasaugmentationmanifolds. Aftertrainingbotharchitecturesconvergeto a similar representation where the marginal variation in spread between the two models impact the performance in downstream tasks. . . . . . . . . . . . 136 4.6 Dendrogram of SSL models based on their transfer learning accu- racy(Left)andproposedMGMs(Right). Weobservethatthegroupings obtainedarehighlycorrelatedwiththeclusteringperformedusingourMGMs, thus showing an intrinsic connection between the geometric properties of the SSL model and the transfer learning performances. . . . . . . . . . . . . . . 138 xviii 4.7 Correlationbetweenequivarianceandtransferlearningperformances. We display the Pearson coefficient ρ at the top left of each subplot. We ob- serve that the equivariance to semantic (inputs with same class labels) and rotation augmentations of the SSL model are negatively correlated with its capability to perform well in few-shot learning tasks (small domain distance). However, these quantities are positively correlated with the accuracy of the DNN on a dense surface normal estimation task. This observation confirms common intuitions regarding the properties that an embedding should have to transfer accurately on these two tasks. The p− values for all results are ≤ 0.01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.8 Correlation between MGMs and transfer performances. We display the Pearson coefficient ρ at the top left of each subplot. We observe that for most transfer learning tasks, there exists an MGM that correlates with the per-task transfer performance. We recover some intuitive results, such as for many shot learning linear, higher invariance in the SSL model corresponds to better transfer learning capability. We also note that dense segmentation appears not to be correlated to the considered geometrical metrics of SSL models. The p− values for each result was ≤ 0.01 except for the last plot where the p-value=0.34. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.9 Manifold graph metrics variability across SSL models. We depict the variation in evaluated MGMs across 14 pre-trained SSL models considered in thispaper. WenotethatthedifferencesinthedifferentSSLmodelembedding can be summarized using five manifold properties, namely, ( i) Sem. equivari- ance, (ii) Rotate equivariance, (iii) Sem. equivariance spread, (iv) Colorjit. equivariance spread, (v) Sem.-Aug. affinity. These properties highlight the key manifold properties with the largest variation across the different SSL models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.10 (Sorted)EvaluationofmanifoldgraphmetricsvariabilityacrossSSL model. We compute here the standard deviation of normalized manifold graphmetricstohighlightthevaryingmanifoldpropertiesacrossSSLmodels. We observe that five manifold properties explain most of the differences in SSL models: (i) affinity between semantic and augmentations manifold , ( ii) spread of semantic equivariance, (iii)spread of color jitter equivariance, (iv) rotation equivariance, (v) semantic equivariance . . . . . . . . . . . . . . . . 147 xix 4.11 Absolutevalueofprincipalcomponentsofper-featuremanifoldgraph metrics matrix. We display here the absolute value of the two principal components obtained after sparse PCA of the matrix of dimension number of models × number of graph manifold metrics. This corresponds to the principal components used to visualize the distribution of the SSL models in the two-dimensional plane as in Figure 4.4. We observe that few manifold graph metrics encapsulate most of the variance contained across SSL models; namely: SemanticEquivariance,AugmentationsEquivariance,CropEquivari- ance, Rotation Equivariance, Semantic Equivariance Spread, Augmentations Equivariance Spread, Rotation Equivariance Spread, Colorjitter Equivariance Spread, Semantic-Augmentations Affinity. . . . . . . . . . . . . . . . . . . . 148 4.12 Dendrogram of SSL models. We compute the dendrogram of the MGM of SSL models. This shows us that there are in fact three main classes of SSL modelsintermsofmanifoldpropertyaswellastheproximitybetweendifferent models. This result also confirms the clustering based on PCA visualized in Figure 4.4. In particular, simCLR-v1 and simCLR-v2 appear to be the most distant model from all the SSL paradigms tested here. Interestingly, the clustering obtained here does not correspond to the different classes of SSL algorithms: contrastive, non-contrastive, cluster-based, and memory-based. . 149 4.13 Per transfer learning task MGM importance - Decision Tree. For each transfer learning task, we exploit the feature selection of decision trees (depth= 5) to visualize which MGM are crucial to characterize the transfer learning accuracy. To do so, we fit the MGMs with the transfer learning accuracies. In this figure, we highlight the MGMs that explain the most the per-task transfer learning accuracy. Note that we are not interested in the regression error but in the importance of each MGM for predicting each transfer learning task. While being intuitive, this result shows that mainly the invariance and curvature of the DNN characterize their transfer learn- ing capability. The first observation is that the intrinsic dimension of the DNN (displayed in green as # Neighbors) does not allow one to characterize any task-specific transfer learning accuracy. Depending on the task, differ- ent geometrical properties matter. For many shot linear, the equivariance to semantic direction of the data manifold is crucial. For few-shot with small transferdomaindistance,thelinearizationcapabilityoftheDNNwithrespect to the augmentations captures most of the information regarding the transfer learning capability. While most tasks appear to be explained by a single of few MGMs, in the case of frozen detection and dense segmentation, multiple MGMs are required to explain the transfer learning capability of SSL models. 151 xx 4.14 MGM Importance Per Transfer Learning Task - Lasso. For each transfer learning task, we exploit the feature selection of LASSO method to visualize which MGM are crucial to characterize the transfer learning ac- curacy. As intuitively expected, we observe that depending of the transfer learning task specific manifold properties are more important. For instance, for dense detection, the equivariance spread of the semantic manifold highly correlateswiththetransferlearningaccuracy. Indicatingthatthevariationof equivariance/invarianceinducedbytheDNNmanifoldw.r.t. theinputdatais critical to accomplish dense detection. Similarly, for dense segmentation, the curvature of the semantic manifold appears to be the most important feature. 152 xxi Abstract This dissertation advances the algorithmic foundations and applications of neighborhood- and graph-based data processing by introducing (i) a novel sparse signal approximation per- spective and an improved method, non-negative kernel regression (NNK), for neighborhood andgraph construction; (ii) geometricandscalable extensionsof the proposedNNK method for image representation, summarization, and non-parametric estimation; and (iii) a graph framework that forms the basis for a geometric theory of deep neural networks (DNNs). Neighborhoods are data representations based on sparse signal approximations. This perspective leads to a new optimality criterion for neighborhoods. Conventional kNN and ϵ -neighborhood methods are sparse approximations based on thresholding, with their hy- perparameters k/ϵ used to control the sparsity level. We introduce NNK, an improved and efficient approach inspired by basis pursuit methods. We derive the polytope geometry of neighborhoodsdefinedwithNNKandbasispursuit-basedapproaches. Inparticular,weshow that, unlike earlier approaches, NNK accounts for the relative position of the neighbors in their definitions. Our experiments demonstrate that NNK produces robust, adaptive, and sparse neighborhoods with superior performance in several signal processing and machine learning tasks. We then study the application of NNK in three domains. First, we use the properties of images to obtain a scalable NNK graph construction algorithm. Image graphs obtained with our approach are up to 90% more sparse and possess better energy compaction and denoisingabilitythanconventionalmethods. Second,weextendtheNNKalgorithmfordata summarization using dictionary learning. The proposed NNK-Means algorithm has runtime and geometric properties similar to the kMeans approach. However, unlike kMeans, our xxii approach leads to a smooth partition of the data and superior performance in downstream tasks. Third, we demonstrate the use of NNK for interpolative estimation. We bound the estimationriskofproposedestimatorsbasedonthegeometryofNNKandshow,empirically, its superior performance in classification, few-shot learning, and transfer learning. Finally, we present a data-driven graph framework to understand and improve DNNs. In particular, we propose a geometric characterization of DNN models by constructing NNK graphs on the feature embeddings induced by the sequential mappings in the DNN. Our proposedmanifoldgraphmetricsprovideinsightsintothesimilaritiesanddisparitiesbetween different DNNs, their invariances, and the transfer learning performances of pre-trained DNNs. xxiii Chapter 1 Introduction Pattern recognition is about guessing the unknown, a discrete choice like real or fake infor- mation, or a continuous quantity such as the future price of a stock. The nearest neighbors method is a fundamental paradigm in learning: given a new data item, find the closest matches encountered in the past to make an informed decision. This primitive idea – Birds of a feather flock together – seems sensible and has been, over the years, improved and made formal. Further, studies show that this approach is possibly biological, as there is evidence that fruit flies execute an approximate nearest neighbor strategy for sensing odors and com- ing up with a response [12]. However, despite the computational simplicity and history of the nearest neighbor approaches, several open questions still exist and are an important topic of study by researchers in machine learning and non-parametric statistics. Compre- hensivemonographs[13–15]provideanextensivesurveyoftheexistingliterature,theoretical guarantees, and limitations of neighborhood methods. 1.1 Research contributions The last decade has seen a deluge in the data available for processing and analysis [16]. Almost every avenue of human endeavor is now recorded as data, from electronic medical records to social, personal, and financial activities. The internet and increasingly powerful sensors, whether a relatively inexpensive smartphone or more elaborate devices such as 1 magnetic resonance imaging scanners, make collecting and storing massive amounts of data common. A fundamental problem in learning from and analyzing such data is that we do not know therichstructureunderlyingthedata a priori. Today, practitionershavefoundthatonecan sidesteptheexplicitmodelingofthestructureofdataaltogetherandcaninsteadletthedata directly drive our learning systems and predictions. Though this black-box approach, e.g., deep learning [17], has led to significant advances in several domains, the lack of reliability and understanding of these systems impede their adoption in critical applications. This thesis aims to tackle the problem of modeling the structure of the data and its impact on data-driven learning. Graphs and neighborhood approaches are powerful tools that can model data and complex interactions among them [18–21]. Graph structures are intrinsically independent of the input modality, i.e., each node in a graph can be associated withdifferentdatatypes,withedgescapturingrelationshipsbetweenthenodes[22]. Thus,in thiswork,ratherthandirectlymodelthedatabyestimatingadistributionbasedonavailable observations ({x 1 ,··· ,x N }), we will consider an abstraction using graphs to achieve our goals. 1.1.1 Neighborhood and Graph constructions on data Acriticalfirststeptowardsourgoalofunderstandingdataandlearningsystemsusinggraph- based models is the construction of the neighborhoods and graphs from the data. Note that a graph can be constructed as a series of local neighborhoods at each data point. A good neighborhood construction is essential for accurately capturing the intricate structure of the data. Without it, the model may deviate from reality and fail to account for the full range of possibilities in the data [23,24]. Local neighborhood methods such as ϵ -neighborhood, k-nearest neighbor (kNN) [25] and the related Nadarya-Watson estimator [26,27] are some of the most popular approaches for 2 neighborhood construction or definition, with applications in density estimation, classifica- tion, and regression [14]. What is meant by local in these methods is left open-ended and is based on the choice of an appropriate feature space for data representation and a similarity kernel or distance. Thus, using these methods, e.g., kNN, requires a careful choice of param- eters (k) to guarantee that good construction is obtained. Theoretical results in [14,28,29] suggest that the value of k in the asymptotic regime, where the number of samples (N) goes to infinity, should be such that k →∞ and k/N → 0. However, in practice for finite N, there is no clear interpretation of these parameters or a strategy applicable in all set- tings. Asimilarconundrumexistsforgraphconstructionframeworksbasedonneighborhood methods [30,31]. A key contribution in this thesis is a novel interpretation of neighborhood construction as a non-negative sparse approximation [32,33]. This view of neighborhoods allows one to analyze graph and neighborhood-based problems using tools from sparse signal processing and opens up several directions for further research. In particular, we show that kNN and ϵ -neighborhood are equivalent to a signal approximation based on thresholding, with their respective hyperparameters, k or ϵ , used to control the level of sparsity. This perspective allows us to establish (i) a new notion of optimality for neighborhoods and (ii) an improved neighborhoodandgraphdefinition,non-negativekernelregression( NNK),motivatedbythe well-known limitations of thresholding. Moreover, we show that enforcing orthogonality in a non-negative approximation of a point by its neighbors corresponds to selecting neighboring points that are not “geometrically redundant”. 1.1.2 Applications in graph-based learning and analysis Thenovelviewofneighborhoodasasparsesignalrepresentationhasallowedustore-explore and improve graph signal processing and machine learning applications: • Image processing [34] – we show that one can obtain a better image graph repre- sentation with up to 90% fewer edges while achieving better performance in energy 3 compaction and denoising. This work was recognized as Best Student paper at ICIP- 2020; • Data summarization [35] – We show that one can obtain an adaptive (soft) data space partition as an alternative to the hard partitions (boundaries defined based on hy- perplanes) resulting from conventional algorithms. We propose a geometric clustering approach that associates each data item to multiple (neighboring) cluster centers, se- lected using NNK, and leads to a better representation of data; • Non-parametric estimation [36–39] – We show that NNK neighbors produce robust and adaptive interpolative estimators. The resulting NNK interpolation is close to the Bayes optimal estimator, with the difference in the estimation bounded by factors dependent on the number of NNK neighbors and the diameter of the NNK polytope. The NNK interpolation can perform similarly to the learned parametric classification boundary in neural networks. It can be used to understand and improve the perfor- mance in classification, few-shot learning, and evaluating dataset overlaps for transfer learning. 1.1.3 Graph-based framework for understanding deep learning Deep learning has achieved significant advancements through supervised and self-supervised approaches. However, in these deep network models, the data representations and the pro- cess behind their formation remain unclear. To this end, we have developed a data-driven framework for understanding deep neural networks (DNNs) by characterizing the geometry of the representation manifolds using graphs. The proposed framework is motivated by the following observation. Although DNNs involve complex non-linear mappings, it is possible to deduce the induced transformations and the structure of the representation space using graphs. For example, consider a DNN trained or evaluated on a dataset. The dataset itself can be represented as a graph: each 4 noderepresentsadatapoint,andedgesareobtainedasafunctionoftheembeddingposition of the data points. Now, the labels (in supervised learning) can be modeled as functions on thegraphconstructed. AteachstageoftheDNNmodel,datapointsintheoriginalspaceare mapped to new values (in some other feature space) so that we can now associate a different graph(sincethepositionsofthedatapointschange)withthesamedataset. Inthisthesis,we presentapplicationsofthisframeworkdirectedtowards Trust [36,40], Explainability [37,41], and Transfer [42,43]. 1.2 Thesis organization This dissertation is organized as follows: • In Chapter 2, we present NNK, the fundamental framework for neighborhood and graph constructions behind this thesis. We introduce a sparse signal approximation view of neighborhoods and show that an optimally sparse neighborhood structure can be obtained using the NNK method. We then derive the theoretical and geometrical properties of NNK and discuss generalizations of NNK and multi-scale neighborhood definitions. • Chapter3extendstheNNKframeworkforthreespecificapplicationsinmachinelearn- ing and signal processing. First, we present a strategy to adapt NNK based on the properties of images, realizing fast and efficient procedures to construct graph repre- sentationsofthedata. Second,weintroduceageometricdatasummarizationapproach with runtimes and representations similar to the popular kMeans approach. Through this application, we extend our neighborhood framework to provide sparse represen- tations based on a learned data summary. In our perspective, the summarizing data points are cluster centers and form a dictionary. Finally, we study interpolative esti- mation based on NNK, showing that it possesses desirable theoretical properties. We revisit theoretical and experimental settings where we show that NNK interpolation 5 has desirable geometric properties and allows for new and improved approaches to tackling learning tasks. • Chapter 4 presents a data-driven framework for a geometric theory of deep learning. In particular, we leverage NNK to characterize the mappings induced by a deep neural network based on the local and global geometry of data. We develop various manifold graph metrics (MGMs) that can be derived from graphs constructed on data embed- dings and study how specific MGMs obtained from data can help predict the transfer performance of deep learning models. 6 Chapter 2 Non-negative kernel regression (NNK) neighborhood A simple pattern recognition scheme, whose application has traces dating back to the 11th century [44], is that of the nearest neighbors rule [25,45]: data points that are close will have similar properties. In this chapter, we present an alternative view of the neighborhood definitionproblemandanimprovedalgorithmforthedefinition. Theinsightsandalgorithms from this chapter form the backbone for this thesis. 2.1 Introduction Formally, given data points {x 1 ,x 2 ...x N } and a query x q , a successful neighborhood def- inition for the query x q involves two tasks, namely, (1) selection – a subset of the N points (neighbors) that are part of the neighborhood; and (2) weight assignment – a non- negative value that represents the level of similarity of each selected neighbor with respect to the query. k-nearest neighbor (kNN) [26,27] and ϵ -neighborhood are among the most common local neighborhood approaches used in practice, with fundamental applications in density estima- tion, classification, and regression [14]. These methods define a local neighborhood based on a parameter choice for neighborhood selection, namely the number of nearest neighbors (k) in kNN or the maximum distance (equivalently minimum similarity) of the data points from 7 query (ϵ ) in ϵ -neighborhood. Selected neighbors are then assigned weights based on their similarity, usually defined using a positive definite function dependent on the query. Local neighborhood definition or construction is often the starting point for graph-based algorithms in scenarios where no graph is given a priori and a graph has to be constructed to fit the data. A popular approach is to start by constructing a directed graph using the neighborhoods and weights provided by kNN or ϵ -neighborhood, leading to kNN-graphs and ϵ -graphs, respectively. If necessary, an undirected graph can be obtained from the directed graph. Note that graph construction is the first step in several graph signal processing [19,21] and graph-based learning methods [18,20]. Consequently, the quality (optimality, robustness,sparsity)ofthegraphrepresentationiscrucialtothesuccessofthesedownstream algorithms [23,24]. Approaches for neighborhood definition have focused on the selection task, namely, the choice of a single optimal value for k or ϵ [13,28]. However, these techniques can fail in non- uniformlydistributeddatawhereitmightbedesirabletoadaptthenumberofneighbors(k/ϵ ) to the local characteristics of the data. Intuitively, it would be beneficial to have a compact set of neighbors that captures the local geometry of the data while avoiding redundancy among the neighbors. For example, for one-dimensional data, it would be sufficient to consider the two neighbors to a query, namely, those closest on each side of the query. Several approaches have been proposed to address neighborhood adaptivity: [46] introduced across-validationmethodtoselectklocally;[47,48]definedkusingaclasspopulation-based heuristic; while [49,50] optimized for k using a Bayesian and neural network-based learning model. It is important to note that these methods typically do not address the weight assignment task. Recently, [51] proposed an algorithm (k ∗ NN) for solving both the selection and weighting tasks, assuming that the functions defined on the data are smooth. However, all of these adaptive neighborhood approaches [46–51] can only be applied to labeled data settings, where the labels can be used for (hyper) parameter selection. Since no extensions are available for unlabelled data, a common scenario for neighborhood definitions 8 [14,15], this severely restricts their application. Further, a shortcoming common to all existing approaches is that they have limited geometrical interpretation: they only consider the distance (or similarity) between the query and the data and ignore the relative position of the neighbors themselves. For example, two data points i and j, both at a distance d to the query, may be included in the local neighborhood irrespective of whether i and j are far or close to each other. In contrast, our proposed method considers the similarity between the neighbors i and j and removes redundant neighbors, leading to sparser and better representative neighborhoods. Figure2.1: Geometricdifferenceinneighborhooddefinitionsusingk-nearestneighbors(kNN) andtheproposedNNKapproachwithaGaussiankernel. kNNselectsneighborsbasedonthe choiceofkandcanleadtotheselectionofgeometricallyredundantneighbors,i.e.,thevector from the query to neighbors is almost collinear with other vectors joining the query and neighbors. In contrast, the proposed NNK definition selects only neighbors corresponding to non-collinear directions. A key contribution of this thesis is a novel interpretation of the neighborhood definition as a non-negative sparse approximation [52] of the query x q using data points{x 1 ...x N }. This view of neighborhoods allows one to analyze the problem using tools from sparse signal processing [52] and opens up several directions for further research. In particular, we show that kNN and ϵ -neighborhood are signal approximations based on thresholding, with their 9 correspondinghyperparameters, kandϵ , usedtocontrolthesparsity. Ourworkismotivated by the well-known limitations of thresholding, which is only optimal when the candidate representation vectors (in our case, the neighbors) are orthogonal. In this chapter, we leverage the sparse signal representation perspective to (i) estab- lish a new notion of optimality for neighborhood definition and ( ii) propose an improved algorithm, non-negative kernel regression (NNK), based on this optimality criterion. The optimality criterion requires approximation errors to be orthogonal to the candidate representation vectors. We show that this criterion prevents the selection of points that are “geometrically redundant”. This idea is illustrated in Figure 2.1. Geometrically, the pro- posedNNKneighborhoodscanbeviewedasconstructingapolytopearoundthequeryusing points closest to it but eliminating those candidate neighbors that are further away along a similar direction as an already selected neighbor. Thus, instead of selecting the number of neighbors based on pre-set parameters, e.g., k or ϵ , the NNK neighborhood is adaptive to thelocaldatageometryandresultsinaprincipled neighbor selection and weight assignment. Contributions and organization of the chapter This chapter is based on our conference [32] and journal publication [33]. There are two main contributions in the chapter. First, we present a thorough theoretical analysis of the geometry, sparse representation, and optimization involved in the NNK framework, where we • make explicit the connection between NNK and LLE-based neighborhoods [53] (Sec- tion 2.3.3), and basis pursuit approaches (Section 2.3.4), • formalizewithproofsourgeometriccharacterizationofNNKneighborsininputandkernel space via our kernel ratio interval (KRI) theorem (Section 2.4.2), • demonstrate the equivalence between NNK geometry and non-negative basis pursuit al- gorithms (Section 2.4.4); 10 Second, we provide a comprehensive experimental evaluation of the proposed NNK method in neighborhood and graph-based machine learning problems, where we • use 70 OpenML classification datasets to show that NNK neighborhoods produce robust local classifiers outperforming those of kNN, k ∗ NN [51] (Section 2.5.1), • quantifythescalabilityandsparsityofNNKgraphconstructionrelativetokNN,AEW[54], LLE [53], and Kalofolias [55] graph constructions (Section 2.5.2.1), and • validate that NNK graphs lead to better performance in graph-based semi-supervised learning and dimensionality reduction (Sections 2.5.2.2 and 2.5.2.3). In subsequent chapters, we will further discuss applications of NNK in image processing [34], data summarization [56], machine learning [37,38], and geometric evaluation of deep learning models [40,41,57]. While these complementary works have reinforced the benefits of NNK, this chapter provides the theoretical background and fundamental concepts behind the NNK formulation. The chapter is organized as follows: Section 2.2 introduces notations, background, and related work. We discuss the connection between neighborhood definition and basis pursuit for sparse signal approximation and introduce our NNK algorithm in Section 2.3. We then derive the geometrical properties of NNK neighbors in Section 2.4. We conclude with exper- imental validation, extensions of the NNK framework, and open problems in Sections 2.5, 2.6, and 2.7, respectively. The proofs for all statements can be found in the Section 2.8. 2.2 Background Wefirstintroducethenotationandtwokeycomponents, kernelsimilarityandlocallinearity ofdata,thatleadtodistinctcategoriesofexistingneighborhoodandgraphconstructions[58]. We also provide a brief review of related work. 11 2.2.1 Notation Throughout this thesis, we use lowercase (e.g., x and θ ), lowercase bold (e.g.,x andθ ), and uppercasebold(e.g.,M andΦ )letterstodenotescalars,vectors,andmatrices,respectively. We reserve the use ofK for representing the matrix of the kernel function so that the kernel evaluated between two data points i and j is writtenK i,j =κ (x i ,x j ). We usex − i to denote the subvector ofx with its i-th component removed andM − i,− j to denote the submatrix of M with its i-th row and j-th column removed. Notations such as M i,− j (i.e., i-th row of M with j-th entry removed) are defined similarly. Given a subset of indices S, x S denotes the subvector of x obtained by taking the elements of x at locations in S. A submatrix M S,S can be obtained similarly from M. The set complement ¯ S corresponds to the set of indices not in set S. We depict the vectors of all zeros and all ones as 0,1 respectively. The indicator function is represented usingI:{False,True}→{0,1} and the Hadamard product of matrices is denoted by⊙ operator. A graph G = (V,ξ ) is a collection of nodes indexed by the set V = {1,...,N} and connected by edges ξ = {(i,j,w ij )}, where (i,j,w ij ) denotes an edge of weight w ij ∈ R + between nodes i and j. The weighted adjacency matrixW of the graph is an N× N matrix with W ij = w ij . A graph signal is a function f: V → R defined on the vertices of the graph (a scalar value assigned to each node, e.g., class labels). It can be represented as a vector f, where the i-th entry, f i , is the signal value on the i th vertex. The combinatorial Laplacian of a graph is defined as L=D− W, whereD is the diagonal degree matrix with D ii = P j W ij . The symmetric normalized Laplacian is defined as L=I− D − 1/2 WD − 1/2 . 2.2.2 Similarity Kernels Kernels have a wide range of applications in machine learning [59,60] due to their ability to capture the abstract similarity between data points [61]. As in most kernel-based works, we focus on kernels satisfying Mercer’s theorem [62] (Definition 2.1). 12 Definition 2.1. If κ :R d × R d →R is a continuous symmetric kernel of a positive integral operator inL 2 space of functions, i.e., ∀g∈L 2 : Z R d κ (x,y)g(x)g(y)dxdy≥ 0 then there exists F, referred to as Reproducing Kernel Hilbert space (RKHS), and mapping ϕ :R d →F such that κ (x i ,x j )=ϕ ⊤ i ϕ j where ϕ i ,ϕ j are the kernel space representations of x i ,x j 1 . From Definition 2.1, a Mercer kernel κ evaluated between two points corresponds to an inner product in a transformed space (RKHS) [63]. Note that, due to the non-linear nature of the mapping, linear operations in RKHS space corresponds to nonlinear operations in the input data space. Similarity kernels are used for assigning weights to selected neighbors in neighborhood andgraphconstructions. Thechoiceofthekernelis,ingeneral,task-ordomain-specific,with some predefined kernels widely used in practice. Alternatively, one can learn a kernel based onavailabledata[64,65]. Examplesofdata-drivenkernellearningmethodsforneighborhood definition include kernel learning [66] and adaptive edge weighting (AEW) [54]. It should be emphasized that the learned kernels in these methods are used with k or ϵ neighborhood selection and thus only offer a solution for the weight assignment task. In this chapter, we primarily use the Gaussian kernel: κ σ 2(x i ,x j )=e − ∥x i − x j ∥ 2 2σ 2 , (2.1) 1 With a slight abuse of notation we use inner products as if the kernel representations ϕ are real vectors. This allows us to use a common notation without considering the specific choice of kernel. In particular, our statements do generalize to the RKHS setting with continuous functions and inner products defined as <f,g >= R f(u)g(u)du. 13 where σ is the bandwidth (variance) of the kernel. However, we emphasize that the theo- retical statements and algorithms we present, unless stated otherwise, are applicable to all symmetric and positive-definite kernels, including those learned from data . 2.2.3 Locally linear neighborhoods A canonical hypothesis often made in analyzing high-dimensional data is that the data lies on or near a smooth manifold and can be approximated using locally linear patches. This is the principal assumption behind local linear embedding (LLE) [67], which relies on a local regression objective to obtain the neighborhood weights, i.e., θ S =argmin θ ∥x q − X S θ ∥ 2 2 , (2.2) whereX S is the matrix containing the k-nearest neighbors ofx q , whose indices are denoted by set S. A solution to (2.2) can produce both positive and negative values for the entries ofθ , i.e., neighborhood weights associated with the elements in the set S. Thus, [53] refor- mulated this problem for defining neighborhoods and graphs by introducing non-negativity constraints on the weights. θ S =argmin θ :θ ≥ 0 ∥x q − X S θ ∥ 2 2 . (2.3) Subsequent modifications to objective (2.3) for neighborhood construction include: post- processing of the neighborhood weights obtained from (2.3) by using b-matching to enforce regularity [68]; formulating a global objective to construct symmetrized graphs [69]; and ad- ditional regularizations in the objective to facilitate robust optimization [70,71]. Alternative approaches, such as [55,72], can be considered a reformulation of the LLE objective from a graph signal processing perspective with appropriate regularizers. Our proposed neighbor- hood method is similar to these related works in that it assumes local linearity and involves 14 Table 2.1: Contributions of our proposed method (NNK) in relation to previous works. NNK combines the concepts of non-negativity, kernel-based similarity, and locally linear neighborhoods. Earlier works on LLE-neighborhood/graph construction did not explore the sparsity or the geometric consequences of non-negativity (why a data point i is selected while another j is not selected for representing a query q) in their definitions. In this work, we present the geometry of NNK, which applies to these earlier works, and provides insights into their solutions. Further, unlike our work, these related approaches do not make the connection between neighborhoods and sparse approximation problems. The last column demonstrates the geometry of the neighborhood obtained (shaded in blue), based on Theorem 2.1, for linear and Gaussian kernels – The points are color-coded as data (gray), query (pink), neighbors selected with the non-negative optimization (blue). optimization. However, unlike previous works, the solutions obtained (i) are sparse and ro- bust; and (ii) possess an explicit geometric understanding of why a data point i is selected as a neighbor, while another data point j is not selected, to a query q. 2.2.4 Related work The proposed NNK framework reduces to non-negativity constrained LLE algorithms under a specific choice of cosine kernel (see Section 2.3.3.1). We note that kernelized versions of LLE [67] have been previously studied for dimensionality reduction [73,74], ridge regression [75],subspaceclustering[76],andmatrixcompletion[77]. However,weemphasizethatthese 15 methods did not (i) develop a connection between neighborhood construction and sparse signal approximation or (ii) study the kernelized objective under non-negativity constraints to define neighborhoods. The lack of non-negativity in the kernelized formulations in these relatedworkspreventedthemfromdevelopinggeometricalandtheoreticalpropertiesoftheir solutions, which are possible for those developed in our framework. We summarize these differences in Table 2.1. To our knowledge, the view of neighborhoods as a sparse signal representation has not been made explicit previously. We believe this perspective can lead to new ideas and even help improve solutions to related problems in data analysis. Other related works, such as [78], solve a complementary problem where the graph is given, and kernels based on the graph are constructed for solving machine learning tasks. 2.3 Neighborhoods and sparse signal approximations Inthissection,wefirstformulatetheneighborhooddefinitionproblemasasignalapproxima- tion. We then show that the kNN/ϵ -neighborhood methods are equivalent to thresholding- based sparse approximation and introduce a better representation approach with our NNK framework. 2.3.1 Problem formulation Given a dataset of N data points{x 1 ,x 2 ,...,x N }, the problem of neighborhood definition for a query x q is that of obtaining a weighted subset of the given N data points that best represents the query. The weights in a neighborhood definition are required to be non- negative,asnegativeweightswouldimplya neighbor withnegativeinfluenceordissimilarity, which are not meaningful in typical neighborhood-based tasks, e.g., density estimation [14] or label propagation [79]. Using non-negative constraints, either implicitly or explicitly, is a common feature in all neighborhood definitions. 16 Neighborhood definition for a query can be viewed as a sparse signal approximation problem with non-negative approximation coefficients. The basic idea is to approximate a signal ϕ q in a kernel space, corresponding to the query x q , as a sparse weighted sum of functions or signals (called atoms) corresponding to the given N data points, i.e., ϕ q ≈ Φ θ (2.4) where Φ = [ϕ 1 ϕ 2 ... ϕ N ] is the dictionary of atoms based on the N data points and θ is a sparse non-negative vector. 2.3.2 kNN and ϵ -neighborhood We first analyze standard approaches such as kNN and ϵ neighborhoods from the signal approximation perspective. A kNN or ϵ neighborhood approach uses the kernel weights to select the neighbors and assign the corresponding edge weights. A kernel value can be interpreted as the correlation between two data points, i.e., for a query q, ϕ ⊤ q ϕ j =κ (x q ,x j )=K q,j (2.5) can be interpreted as the similarity between q and j. Thus, kNN andϵ -neighborhood methods correspondtoa thresholding techniqueforbasis selection wherein atoms corresponding to the k largest correlations are selected from Φ to approximate the signal ϕ q . This approach is reminiscent of early methods for basis pursuit using thresholding [80], i.e., selecting and weighing atoms based on the magnitude of the correlation between the atoms and the signal. This strategy is optimal only when the dictionary is orthogonal (Φ ⊤ Φ =I) but is sub-optimal for over-complete or non-orthogonal dictionaries. In our problem setting, the dictionary Φ is not orthogonal since its atoms typically have a non-zero correlation with a few other data points, i.e., ϕ ⊤ i ϕ j >0. 17 2.3.3 Proposed: Non-negative kernel regression We now propose an improved neighborhood definition, NNK, that avoids the sub-optimality due to thresholding that arises in kNN/ϵ -neighborhood. NNK can be viewed as a sparse representation approach where the coefficients θ for the representation are such that the errorinrepresentationisorthogonaltothespacespannedbytheselectedatoms. Intermsof sparse approximation, this is the property found in approaches such as orthogonal matching pursuits (OMP) [81]. We note that this reformulation of neighborhood definition results in neighbors that are adaptive to the local geometry of the data and robust to parameters such as k/ϵ . For a queryx q , the NNK weightsθ are found by solving: θ S =argmin θ :θ ≥ 0 ∥ϕ q − Φ S θ ∥ 2 2 , (2.6) where Φ S corresponds to the kernel space representations of an approximate set of data point candidates S for the neighborhood, such as the k nearest neighbors. Now,employingthekerneltrick,i.e.,K i,j =ϕ ⊤ i ϕ j ,theobjectivein(2.6)canberewritten as the minimization of: J q (θ )= 1 2 ∥ϕ q − Φ S θ ∥ 2 2 = 1 2 θ ⊤ K S,S θ − K ⊤ S,q θ + 1 2 K q,q . (2.7) Consequently, the NNK neighborhood is obtained as the solution to the problem, θ S =argmin θ :θ ≥ 0 1 2 θ ⊤ K S,S θ − K ⊤ S,q θ , (2.8) withnon-zeroelementsofthesolutionθ S correspondingtotheselectedneighborsandweights given by the corresponding non-zero values in θ . As in other kernel-based learning meth- ods [82], the NNK objective (2.8) does not require the explicit kernel space representations and requires only knowledge of the similarity, the kernel matrix K, of a subset S for the 18 Algorithm 1: NNK Neighborhood algorithm Input : Queryx q , DataD ={x 1 ··· x N }, Kernel function κ , No. of neighbors k S ={k nearest neighbors of queryx q } θ S =argmin θ ≥ 0 1 2 θ ⊤ K S,S θ − K ⊤ S,q θ ˆ S ={ elements of S corresponding toI(θ S >0)} Output: Neighbors ˆ S, Neighbor weightsθ ˆ S neighborhood definition. The pseudo-code for the proposed method is presented in Algo- rithm 1. The NNK framework combines two key principles that have been part of neighborhood definitions: kernels and local linearity. Unlike earlier methods, such as kNN, which used kernels (predefined or learned) only to define the weights for selected neighbors, the NNK neighborhood can be viewed as identifying a small number of non-negative regression coeffi- cients (as in LLE) for an approximation in the representation space associated with a kernel (see Table 2.1). 2.3.3.1 NNK and constrained LLE objective Previously studied locally linear neighborhoods [53] can be seen as a special case of the proposed NNK approach with a kernel similarity defined in the input data space. Proposition 2.1. For the query-dependent cosine similarity, κ q (x i ,x j )= 1 2 + (x i − x q ) ⊤ (x j − x q ) 2∥x i − x q ∥∥x j − x q ∥ , (2.9) the NNK algorithm reduces to the LLE neighborhood definition of [53]. Proof. See Section 2.8.2. 19 2.3.4 Basis Pursuits In this section, we present basis pursuit [83] approaches to sparse signal approximation and their application for defining neighborhoods. We show alongside that the proposed NNK is an efficient sparse approximation that is equivalent to these pursuit methods. 2.3.4.1 Matching Pursuits Matchingpursuits(MP)[84]isagreedyalgorithmforsparseapproximationwhichiteratively selects atoms that have the highest correlation with the signal’s approximation error at a given iteration. Variants of this method include orthogonal matching pursuits (OMP) [81] and stagewise OMP (Block Selection + OMP) [85], where the coefficients for the selected atoms are recalculated at each step so that the residue is orthogonal to the space spanned by the selected vectors. We now present the steps involved so that MP and OMP can be used in neighborhood definition and then establish the analogy between NNK and OMP. The first step in MP and OMP is the same, where we find j 1 =argmax j ϕ ⊤ q ϕ j =K q,j . With this choice, we can compute the error or residue incurred by approximatingϕ q with a single vectorϕ j 1 as: ϕ r 1 =ϕ q − K q,j 1 ϕ j 1 . Now, at a step s, we would find: j s = argmax j̸=j 1 ,j 2 ...j s− 1 ϕ ⊤ j ϕ r s− 1 Then, denote: Φ S =[ϕ j 1 ϕ j 2 ... ϕ js ], 20 wheretheS ={j 1 , ... j s }aretheindicesofalltheselectedatomssofar. InMP,oneassigns the correlation between the residue and the selected atom as weights for approximation. However, in OMP, we find weights w S associated with each selected basis such that the energy of the residue is minimized, i.e., w S =argmin w:w≥ 0 ∥ϕ r S ∥ 2 =argmin w:w≥ 0 K q,q − 2K ⊤ S,q w S +w ⊤ S K S,S w S , (2.10) whereϕ r S =ϕ q − Φ S w S is the residue at step s. This ensures that the approximation error is orthogonal to the span of the selected atoms Φ S . Note that by constraining the weights to be positive, the span of selected bases at each step is a convex cone, while the residue is orthogonal to this cone [86]. We see that solving for the weightsw S in (2.10) is equivalent to the NNK objective (2.8) given the set S. Thus, the proposed NNK algorithm (Algorithm 1) bypasses the greedy selection by making use of a pre-selected set of good atoms (such as the k nearest neighbors), thus avoiding expensive computation (iterative selection and orthogonalization). Given a set of pre-selected atoms, we note that it is straightforward to perform an orthogonal projection, such as the one per- formed at each OMP step. The neighborhood definition problem setup makes it possible to perform this pre-selection using computationally efficient approximate neighborhood meth- ods. Note that this one-shot selection of the set S can lead to optimal sparse approximation as long as S is large enough (so that all candidates needed to form an optimal neighborhood arelikelytobeincluded). Algorithm2providesthepseudo-codeforneighborhooddefinition based on MP and OMP with sequential neighbors selection. 2.3.4.2 ℓ 1 regularized pursuits Another set of methods for solving sparse approximation problems formulates a convex relaxation of (2.4) where one replaces the sparsity constraint (ℓ 0 norm of reconstruction 21 Algorithm 2: MP and OMP Neighborhood algorithms Input : Queryx q , DataD ={x 1 ··· x N }, Kernel function κ , Maximum no. of neighbors k j 1 =argmax j K q,j θ 1 =K q,j 1 , S ={j 1 } for s=2,3,...k do j s =argmax j/ ∈S K q,j − K ⊤ S,j θ s− 1 if K q,js − K ⊤ S,js θ s− 1 <0 then break if OMP then θ s =argmin θ ≥ 0 1 2 θ ⊤ K S,S θ − K ⊤ S,q θ else θ ⊤ s = θ ⊤ s− 1 (K q,js − K ⊤ S,js θ s− 1 ) S =S∪{j s } end Output: Neighbors S, Neighbor weightsθ S coefficients) with an ℓ 1 norm constraint [87,88]. This problem setup, referred to as least absolute shrinkage and selection operator (LASSO) regression [89], can be solved using opti- mization techniques, such as linear programming, to obtain an approximate sparse solution. A neighborhood definition at a query q using a ℓ 1 regularization penalty would correspond to the optimization problem of the form: θ D =argmin θ :θ ≥ 0 1 2 θ ⊤ K D,D θ − K ⊤ D,q θ +η ∥θ ∥ 1 , where D is the set of all N data points and η is the Lagrangian hyperparameter. Given the non-negativity constraint on the coefficients in our setting, the ℓ 1 norm of coefficients reduces to the sum of the coefficients. Thus, the objective simplifies to θ D =argmin θ :θ ≥ 0 1 2 θ ⊤ K D,D θ − K ⊤ D,q θ +η θ ⊤ 1. (2.11) 22 Proposition 2.2. Let θ ∗ be the solution to objective (2.11). Then, ∀j:K q,j <η θ ∗ j =0 (2.12) Proof. See Section 2.8.3. Proposition 2.2 implies that the set ¯ S ={i: K q,i < η } can be safely removed from the optimization since θ ∗ ¯S = 0. Thus objective (2.11) is equivalent to the NNK objective (2.8) withsetS,suchthatD =S∪ ¯ S andS∩ ¯ S =∅,i.e.,S isthesetofdatapointswithsimilarity greater than η . Thus, choosing S as a set of kNN neighbors to initialize the objective (2.8) provides an optimal solution to (2.11) with parameter η . 2.3.5 Graph construction The NNK neighborhood definition in Section 2.3.3 and alternative basis pursuit approaches in Section 2.3.4 can be adapted for graph construction by solving for the neighborhood at each data pointx i to first obtain a directed graph adjacency matrix W, namely,W S,i =θ S and W¯S,i = 0. To obtain an undirected graph, we observe that an edge conflict occurs, say between nodes i and j, due to the varying influence of one node over the other node’s representation. In such a scenario, we consider the approximation error corresponding to nodes i and j, namely J i (θ ) and J j (θ ) from (2.7), and keep the weight corresponding to the sparse approximation with smaller error. We note that the proposed symmetrization, though intuitive from a representation perspective, requires further study, and we leave its formulation and its connection to performance in downstream tasks as open problems for future work. The pseudo-code for NNK graph construction is presented in Algorithm 3. 2.3.6 Complexity The proposed NNK method consists of two steps. First, we find k-nearest neighbors corre- sponding to each node. Although brute force implementation has complexity O(Nkd), there 23 Algorithm 3: NNK Graph algorithm Input : DataD ={x 1 ··· x N }, Kernel function κ , No. of neighbors k for i=1,2,...N do S ={ k nearest neighbors ofx i inD − i } θ S =argmin θ ≥ 0 1 2 θ ⊤ K S,S θ − K ⊤ S,i θ J i = 1 2 θ ⊤ S K S,S θ S − K ⊤ S,i θ S + 1 2 K i,i W S,i =θ S , W¯S,i =0 E S,i =J i 1, E¯S,i =0 end W =I(E≤ E ⊤ )⊙ W +I(E >E ⊤ )⊙ W ⊤ Output: Graph AdjacencyW, Local errorE exist efficient algorithms to find approximate neighborhoods using sub-linear-time search withadditionalmemory[90]. Second, anon-negativekernelregression(2.8)issolvedateach node. This objective is a constrained quadratic function of θ and can be solved efficiently using structured programming methods that require O(k 3 ) or, with more careful analysis, O(k ˆ k 2 ) complexity, where ˆ k is the number of neighbors with non-zero weights in the optimal solution. In practice, as observed in our experiments (Section 2.5), the added complexity of NNK, after initializing with kNN, is often negligible compared to the cost of kNN. We note that the choice of k for the initial set S provides a trade-off between runtime complexity and optimality: asmallkhaslowercomplexitybutresultsinaset S thatmaynotberepresenta- tive enough, while a larger k has higher complexity but is more likely to lead to an optimal neighborhood selection. For the graph construction problem, the complexity of the NNK optimization is O(Nk ˆ k 2 ), in addition to the time required for kNN. Note that, due to the local nature of the optimization problem, the NNK algorithm can be executed in parallel, so theoverallcomplexityisdominatedbytheproblemoffindingagoodsetofinitialneighbors. 24 2.4 Geometric Interpretation Unlikeexistingneighborhooddefinitions,suchaskNN,whereaneighborselectionforaquery q involving two points i and j is driven solely on the similarities of the pairs (q,i) and (q,j), NNK also considers the relative positions of nodes i and j using the similarity (i,j). We now present a theoretical analysis of the geometry in NNK neighborhoods. 2.4.1 NNK and active constraint set condition In constrained optimization problems, some constraints will be strongly binding, i.e., the solutionatsomeindiceswillbezerotosatisfytheKKTconditionofoptimality(asoutlinedin Section 2.8.1). These constraints are referred to asactive constraints, knowledge of which helpsreducetheproblemforoptimizationandanalysis. Thisisbecauseanyconstraintsthat areactiveatacurrentfeasiblesolutionwillremainactiveattheoptimum[91]. Inparticular, the active constraint set in the NNK objective (Proposition 2.3) allows for an interpretable geometric analysis of the NNK neighborhoods obtained. Proposition 2.3. The NNK optimization objective (2.7) at a query x q satisfies active con- straint conditions. Given a partition set P such thatθ P >0 (inactive) andθ ¯ P =0 (active), the solution [θ P θ ¯ P ] ⊤ is the optimal solution to NNK iff K P,P θ P =K P,q (2.13) K ⊤ P, ¯ P θ P − K¯ P,q ≥ 0. (2.14) Proof. See Section 2.8.4. Proposition 2.3 allows us to analyze the neighbors obtained in the NNK framework, one pair at a time, as a data point that is zero-weighted (active constraint) will remain zero- weighted at the optimal solution. We leverage this property to obtain the condition for the 25 existence of NNK neighbors in the form of the Kernel Ratio Interval (KRI) theorem, which applied inductively unfolds the geometry of NNK neighborhoods. 2.4.2 Geometry of NNK neighbors Figure 2.2: Problem setup for a query q with two points i,j with θ i ,θ j corresponding to weight assigned. A kNN or ϵ -neighborhood methods assigns these weights using a kernel, i.e. K q,i ,K q,j irrespective of the relative positions of i,j (K i,j ). In contrast, the NNK neighborhood assigns non-zero θ i and θ j iff equation (2.15) is satisfied. Theorem 2.1. Kernel Ratio Interval: Given a scenario with a query q and two data points, i and j (see Fig 2.2) and a similarity kernel with range [0,1], the necessary and sufficient condition for both i and j to be chosen as neighbors of q by NNK is given by K i,j < K q,i K q,j < 1 K i,j . (2.15) Proof. See Section 2.8.5. In words, Theorem 2.1 states that when i and j are very similar, it is less likely that both will be chosen as neighbors of q because the interval in (2.15), which defines when this can happen, becomes narrower. The KRI condition of (2.15) does not make any assumptions on the kernel, other than that it be symmetric with values in [0,1] 2 . 2 The general form of KRI will be presented in Section 2.6.1. We restrict kernels with values in [0,1] for ease of understanding and to make the basis pursuit connection explicit in our analysis. We note that the statements and properties of NNK apply to all Mercer kernels. 26 Corollary 2.1.1. Plane Property: Each NNK neighbor corresponds to a hyperplane with normal in the direction of the line joining the neighbor and query (Figure 2.3a). Gaussian kernel (2.1), only points on the same half-space separated by the hyperplane and containing the query point can be selected as NNK neighbors of the query. Proof. See Section 2.8.5.1. (a) (b) Figure 2.3: Geometry (represented in red) of an NNK neighborhood for a kernel similarity proportional to the distance between the data points. (a) Hyper plane associated with the selectedNNKneighbori. (b)ConvexpolytopecorrespondingtotheNNKneighborsaround a query. Corollary 2.1.1 can be described in terms of projections as illustrated in Figure 2.3a. Assume that query q is connected to a neighbor i (θ i > 0) and define a hyperplane that contains i and is perpendicular to the line going from q to i. Then any data point j that is beyond this hyperplane (i.e., on the half-space not containing q) will not be an NNK neighbor (θ j =0). This is because the orthogonal projection of (x j − x q ) along the direction (x i − x q ) is on the half-space not containing q. Corollary 2.1.2. Polytope Property (Local geometry of NNK): The local geometry oftheNNKneighborhoodwithGaussiankernel(2.1), foragivenqueryq, isaconvexpolytope 27 around the query (Figure 2.3b). The optimal solution to the NNK objective (2.8) with θ = [θ P , θ ¯ P ] ⊤ , where θ P >0 and θ ¯ P =0 satisfies (A) K ⊤ P,j θ P − K q,j ≥ 0 ⇐⇒ ∃i ∈ P : K q,i K q,j ≥ 1 K i,j (B) K ⊤ P,j θ P − K q,j <0 ⇐⇒ ∀i ∈ P : K q,i K q,j < 1 K i,j Proof. See Section 2.8.5.2. The conditions in Corollary 2.1.2 directly result from applying Corollary 2.1.1 to a series of points. Gathering successive conditions leads to forming a polytope, as illustrated in Fig. 2.3b, and to the optimality conditions of the active set Proposition 2.3. In words, the NNKneighborhoodalgorithmconstructsapolytopearoundaqueryusingselectedneighbors whiledisconnectinggeometricallyredundantdatapointsoutsidethepolytope. Thissuggests that the local connectivity of NNK neighbors is a function of the local dimension of the data manifold [92]. We note that corollaries 2.1.1 and 2.1.2 apply only to kernels monotonic with respect to the distance between data points. However, equivalent geometric conditions based on the KRI condition can be obtained for other kernels. For example, in the case of linear kernels, itcanbeshownthattheNNKgeometryobtainedisthatofaconvexconewitheachselected neighbor corresponding to a hyperplane passing through the origin [86] (see Figure 2.1). 2.4.3 NNK Geometry in kernel space In some cases, such as bio-sequences or text documents, it may be difficult to represent the input as explicit feature vectors using a distance-based kernel, such as the Gaussian kernel in (2.1). Then, alternative kernel similarity measures need to be employed [93]. However, the RKHS space associated with these alternative kernels still possesses an inner product and norm. Thus, it is valuable to consider the geometry of the kernel space to understand neighborhood algorithms, even when the input space geometry is not explainable. To this 28 end, we now derive properties of NNK neighborhoods in terms of the distances and angles between the RKHS mappings of the input data. Proposition 2.4. Points in an RKHS associated with any kernel function κ with range in [0,1] are characterized by an RKHS distance given by, ˜ d 2 (i,j)=∥ϕ i − ϕ j ∥ 2 =2− 2κ (x i ,x j ). (2.16) Proof. See Section 2.8.6 Theorem 2.2. Given distance (2.16), the necessary and sufficient condition for two points i,j to be NNK neighbors to a query q is θ i ̸=0 ⇐⇒ ˜ d 2 (q,j)+ ˜ d 2 (i,j)− ˜ d 2 (q,i)> ˜ d 2 (q,j) ˜ d 2 (i,j) 2 θ j ̸=0 ⇐⇒ ˜ d 2 (q,i)+ ˜ d 2 (i,j)− ˜ d 2 (q,j)> ˜ d 2 (q,i) ˜ d 2 (i,j) 2 Proof. See Section 2.8.7 Corollary 2.2.1. Using the law of cosines, θ i ̸=0 ⇐⇒ cosα> ˜ d(q,j) ˜ d(i,j) 4 (2.17) θ j ̸=0 ⇐⇒ cosβ > ˜ d(q,i) ˜ d(i,j) 4 (2.18) where α,β are the angles subtended by the chords joining (ϕ i ,ϕ j ) and (ϕ q ,ϕ j ), (ϕ q ,ϕ i ) re- spectively as in Figure 2.4a. Proof. Theprooffollowsdirectlyfromtheapplicationofthelawofcosinesontheconditions obtained in Theorem 2.2. 29 Corollary 2.2.1 allows us to provide a bound for the angle subtended and the length of chords for an NNK neighbor, namely, θ i ̸=0 ⇐⇒ π 3 <α< − π 3 (2.19) where we use the fact that the RKHS distance (2.16) is is upper bounded by 2. A similar result holds for j and β . (a) (b) Figure 2.4: (a) Geometric setup of NNK in RKHS. (b) Problem setup at the 2 nd step of pursuit in MP, OMP neighborhoods. 2.4.4 Geometry of MP, OMP neighborhoods WenowshowthatTheorem2.1canalsobeappliedtomatchingpursuits-basedneighborhood definitions. We note that the results in this section apply to other problems with non- negative basis pursuits [94,95]. The geometric conditions presented here provide a potential approach to efficiently screen and select atoms in iterative basis pursuits for non-negative approximations with large dictionaries. 30 Assume atom ϕ i is selected in the first step of either the MP or the OMP algorithm (Algorithm 2). Then, the approximation residue is given by ϕ r 1 = ϕ q − K q,i ϕ i . Now, observe that ϕ ⊤ r 1 ϕ q =1− K 2 q,i ≥ 0 and (2.20) ϕ ⊤ r 1 ϕ i =K q,i − K q,i K i,i =0, sinceK i,i =1. (2.21) Equation (2.20) shows that the residue at the end of the first step can be exactly zero if and only if K q,i = 1. However, the residue at the end of the first step is orthogonal to the selected atom (ϕ i ) as shown in (2.21). For the second step of the algorithm, consider two points j 1 and j 2 as in Fig. 2.4b. By Theorem 2.1, we expect j 2 to be not selected and j 1 to be selected. The equations below show that the pursuit algorithms carry out a selection process where points that do not satisfy the KRI conditions are indeed not selected for representation. ϕ ⊤ r 1 ϕ j 2 =K q,j 2 − K q,i K i,j 2 ≤ 0 ⇐⇒ K q,i K q,j 1 ≥ 1 K i,j 1 ϕ ⊤ r 1 ϕ j 1 =K q,j 1 − K q,i K i,j 1 >0 ⇐⇒ K q,i K q,j 1 < 1 K i,j 1 The weights after selection for the set S = {i,j 1 } in the case of MP is given by θ (MP) S = [K q,i K q,j 1 − K q,i K i,j 1 ] ⊤ , while in OMP, θ (OMP) S is computed by minimizing (2.8). In MP, the updated residue is orthogonal only to the last selected data point j 1 , ϕ (MP)⊤ r 2 ϕ i =K q,i − K q,i K i,i − (K q,j 1 − K q,j 1 K i,j 1 )K i,j 1 =− (K q,j 1 − K q,j 1 K i,j 1 )K i,j 1 ̸=0 (2.22) ϕ (MP)⊤ r 2 ϕ j 1 =K q,j 1 − K q,i K i,j 1 − (K q,j 1 − K q,j 1 K i,j 1 ) =0 (2.23) 31 whileinOMP,theresidueisorthogonaltoallselectedatomsasguaranteedbythefirst-order optimality condition of the objective (2.8), i.e, K S,S θ (OMP) − K S,q =0. ϕ (OMP)⊤ r 2 ϕ i =K q,i − θ (OMP)⊤ S K S,i =0 (2.24) ϕ (OMP)⊤ r 2 ϕ j 1 =K q,j 1 − θ (OMP)⊤ S K S,j 1 =0 (2.25) This scenario repeats in the subsequent pursuit steps where selected neighbors in MP are orthogonal only to the last neighbor selected, while in OMP to all selected neighbors. The iterative selection ends when the maximum number of allowed neighbors is reached or when no positively correlated atom is available for representing the residue i.e, ∀m / ∈S K q,m − θ S K S,m ≤ 0 Theabovecriteriapreventpursuitalgorithmsfromselectingpointsthatdonotcontributeto the non-negative approximation as these correspond to points outside the polytope (Corol- lary 2.1.2). Note that, unlike unconstrained pursuit where adding more atoms often leads to a better approximation, increasing the number of selected atoms does not necessarily correspond to better representation in non-negative pursuit [95]. 2.5 Experiments and Results We demonstrate the benefits of the proposed NNK framework in neighborhood-based classi- fication,graph-basedlabelpropagation,anddimensionalityreduction. Codeforexperiments is available at github.com/STAC-USC. 32 2.5.1 NNK neighborhoods We study the proposed NNK neighborhood for classification (binary and multi-class) using a plug-in classifier I( ˆ f(x q )>0) based on the neighborhood estimate ˆ f(x q )= X i∈N(xq) θ i P j∈N(xq) θ j y i , (2.26) where N(x q ) is the defined neighborhood for query x q . θ i and y i are the weight and the one-hot encoded label associated with the neighbor x i . Experiment setup: We compare the NNK neighborhood against two baselines, the standard weighted kNN and an adaptive neighborhood approach k ∗ NN [51]. We use the Gaussian kernel (2.1) for neighborhoods defined with kNN and NNK algorithm. Two groups of datasets are considered: (i) AR face [96], Extended YaleB [97], Isolet, and USPS datasets [98] with their standard train/test split, and (ii) 70 datasets from the OpenML repository[99]. Thedatasetsareselectedbasedonthenumberofdimensions(d∈[20,1000]), the number of samples (N ∈[1000,20000]), and with no restriction on the number of classes as in [100]. Consequently, these datasets represent a wide range of settings with varying dimensions, class imbalances, and non-homogeneous data features. All datasets are stan- dardized using the empirical mean and variance estimated with training data. We repeat each experiment 10 times and report average performances. AR,YaleB,Isolet, USPS: Weusethesedatasetstoperformablationand5-foldcross- validationstudiesonparametersk,σ usedinkNNandNNK,andL/C valueink ∗ NN.Wecon- sidervaluesofkintheset{10,20,30,40,50}andvaluesofσ,L/C intheseet{0.1,0.5,1,5,10} as in [51]. Table 2.2 presents the classification performance on the test split of datasets with different neighborhood classifiers using training data. Reported performance corresponds to the best set of hyperparameters (obtained with cross-validation) for each method and demonstrates the gains with the NNK algorithm relative to the other two baselines. 33 Dataset N tr ,N te d N class kNN k ∗ NN NNK AR 2000,600 540 100 43.68 50.5 21.68 YaleB 1216,1198 504 38 25.88 25.46 10.18 Isolet 6238,1559 617 26 6.50 8.08 4.61 USPS 7291,2007 256 10 6.33 13.45 4.49 Table 2.2: Test classification error (in %, lower is better) using local neighborhood methods on datasets with train (N tr ) and test (N te ) sample size, dimension (d), and classes (N class ). NNK-based classification consistently outperforms baseline methods while being robust to k,σ choices. Figure 2.5 shows the cross-validation and test classification performance for various pa- rameterchoices. Wenotethattheperformanceofk ∗ NNisheavilyinfluencedbythechoiceof the hyperparameter L/C. kNNbased classifierachieves maximum performancefor anarrow choice of σ values, as larger σ leads to higher weights assigned to farther away neighbors. In contrast, the NNK algorithm produces a robust performance for a broad range of values for the hyperparameter σ . Further, it can be seen that NNK estimates saturate with increasing k. The NNK estimation is more stable because each neighbor is selected only if it belongs to anewdirectioninspacenotpreviouslyrepresented. NotethatNNKestimationoutperforms kNN and k ∗ NN, in these datasets, even for sub-optimal choice of kernel parameter σ and number of neighbors k. OpenMLdatasets: Eachdataset’strain/testsplitisobtainedbyrandomlydividingthe data into two halves with hyperparameters σ or L/C obtained via cross-validation as in the previoussetofexperiments. ThevalueofkinkNNandNNKisfixedto30. Table2.3provides a summary of the classification performance on all 70 datasets. We observe that NNK achieves better overall performance in comparison to baselines. k ∗ NN, though successful in some cases, falls short in several datasets, indicative of its inability to adapt to varied datasetscenarios. Incontrast, theNNKmethoddemonstratesflexibilitytofeatureandclass disparities in several datasets. The failure cases of NNK, such as the binarized version of the regression problems designed in [101] (11 of 17 losses), correspond to datasets with non- informative feature dimensions containing only noise. In such scenarios, redundancy in the 34 0 2 4 6 8 10 L/C (k*NN) | (kNN, NNK) 30 40 50 60 70 80 90 Avg. 5-CV error (%) 0 2 4 6 8 10 L/C (k*NN) | (kNN, NNK) 10 20 30 40 50 60 70 80 90 Avg. 5-CV error (%) 0 2 4 6 8 10 L/C (k*NN) | (kNN, NNK) 5 10 15 20 25 30 35 Avg. 5-CV error (%) 0 2 4 6 8 10 L/C (k*NN) | (kNN, NNK) 0 20 40 60 80 Avg. 5-CV error (%) kNN k*NN NNK 10 20 30 40 50 Number of neighbors (k) 20 25 30 35 40 45 50 55 Test error (%) (a) AR Face 10 20 30 40 50 Number of neighbors (k) 10 15 20 25 Test error (%) (b) YaleB Face 10 20 30 40 50 Number of neighbors (k) 4.5 5 5.5 6 6.5 7 7.5 8 8.5 Test error (%) (c) Isolet 10 20 30 40 50 Number of neighbors (k) 4 6 8 10 12 14 Test error (%) (d) USPS Figure 2.5: Top: Average cross-validation error for different values of hyperparameters associated with k ∗ NN , kNN, and our NNK method. The value of k in kNN and NNK is set to 10 in these experiments. Bottom: Test classification error of kNN and NNK for different values of k with hyperparameter σ selected using cross-validation. Here we include k ∗ NN for reference. We see that the proposed NNK approach produces better performance for a wide range of values for the hyperparameter σ with improvements in the estimation as k increases. In contrast, we observe that kNN and k ∗ NN methods require a careful selection of the hyperparameters for a good classification which is still inferior to NNK. Method Average error (%) Wins/Ties/Losses Average Rank kNN 18.06 (0.83) 7/24/39 2.0571 k ∗ NN 20.71 (1.04) 19/10/50 2.2714 NNK 16.26 (0.74) 44/9/17 1.6714 Table 2.3: Performance summary of neighborhood methods on 70 OpenML datasets. Listed metricsinclude(i)Averageclassificationerrorandstandarddeviation(inparentheses)across all datasets, (ii) the number of datasets where a method performs better than the other two (Win), performs within one standard deviation of the best performance (Tie), and performs poorly in comparison (Loss), and (iii) average rank of the method based on the classification (1 st ,2 nd ,3 rd ). 35 representation, as in kNN, can provide a better estimate. We leave this NNK limitation for future study. 2.5.2 NNK graphs In this section, we compare NNK graphs (Algorithm 3) and their basis pursuit variants (MP, OMP from Algorithm 2) with graph construction methods based on weighted kNN, LLE based graphs [53], AEW [54], and Kalofolias [55]. We measure the goodness of the graphsintermsoftheirsparsity,runtime,anddownstreamperformance. Asinneighborhood experiments, we use the Gaussian kernel defined in (2.1). However, we set σ parameter such that the kNN approach can assign a nonzero weight to its k neighbors by ensuring the k th neighbor distance is within 3∗ σ . For a fair comparison, we use the same σ value for AEW, MP, OMP, and NNK graph constructions. (a) kNN Graph (b) LLE Graph (c) NNK Graph Figure 2.6: Graph constructions in synthetic data settings. Unlike kNN and LLE, NNK achieves a sparse representation where the relative locations of nodes in the neighborhood are key. In the seven-node example (top row), the node in the center has only three direct neighbors, which in turn are neighbors of nodes extending the graph further in those three directions. Graphs are constructed with k=5 and Gaussian kernel (σ =1). Edge thickness indicates neighbor weights normalized globally to have values in [0,1]. 36 10 50 100 Number of neighbors (K) 0 10 20 30 40 50 60 No. of edges / No. of nodes KNN LLE AEW Kalofolias MP OMP NNK 10 50 100 Number of neighbors (K) 0 10 20 30 40 50 60 70 80 No. of edges / No. of nodes Figure 2.7: Ratio of the number of edges (with weights ≥ 10 − 8 ) to the number of nodes. Left: Noisy Swiss Roll (N=5000, d=3), Right: MNIST (N=1000, d=784). The sparsity of NNK, MP, and OMP graphs saturates for increasing k and is reflective of the intrinsic dimensionality of the data. 2.5.2.1 Sparsity and runtime complexity Graph sparsity, the ratio of the number of edges to the number of nodes, is desirable for scaling downstream graph analysis in large data sets [102]. Unlike a kNN graph where sparsity is controlled by the choice of k, an NNK graph corresponds to an adaptive sparse representation. Figure 2.6 presents two examples to illustrate the obtained sparsity in the NNK graphs with that of the kNN and LLE graph. In Figure 2.7, we show the effect of parameter k on the sparsity of various graph methods on a synthetic and real dataset. We see that MP, OMP, and NNK algorithms lead to the sparsest graph representations. Note that LLE graphs suffer from unstable optimization due to the large dimensionality of the features in the real dataset. Thus, the observed sparsity, in this case, is not reflective of a good graph and is only a shortcoming of the optimization (weights <10 − 8 ). As shown in Table 2.4, the NNK framework outperforms other methods in terms of both sparsity and the time required to construct the graphs. An OMP graph construction produces the same graph as NNK at a 10× slower runtime. MP method provides a faster alternativetoOMPbutstillremainsslowwithruntime2.5× thatofNNK.Notethat, unlike the AEW and Kalfolias constructions, the NNK optimization can be solved in parallel and 37 can be faster. In summary, the NNK approach offers a sparse construction while having a runtime that is similar to that of kNN, the most widely used approach in practice. kNN LLE AEW Kalofolias MP OMP NNK T 0.16 0.83 3.3× 10 3 0.26 0.79 2.55 0.33 |ξ | 21410 7554 21410 13385 7475 7827 7827 Table 2.4: Computational runtime T (in seconds) and number of edges |ξ | (> 10 − 8 ) for various graph constructions on subset of MNIST data (N=1000, d=784) with k=30. 2.5.2.2 Label Propagation We evaluate the quality of the graphs constructed by observing their performance in down- streamsemi-supervisedlearning(SSL)taskusinglabelpropagation[79]. Weconsidersubsets of digits datasets such as USPS and MNIST for the experiment. For USPS, we sample each class non-uniformly based on its labels (2.6c 2 c ={1,2,...10}) as in [79], [72] producing a datasetofsize1001. ForMNIST,weuse100randomlyselectedsamplesforeachdigitclassto create one instance of a 1000 sample dataset (100 samples/class× 10 digit classes). Fig. 2.8 shows the average classification error with different graph constructions for an increasing percentage of available labels (randomly selected) and different choices of parameter k. We see that basis pursuit and NNK perform best with both combinatorial (L) and symmetric normalized Laplacians (L) in label propagation for all settings. In particular, we note that NNK and OMP-based constructions result in the same graph structure and, consequently, performance, as identified in our analysis in Sections 2.3.4.1 and 2.4.4, but with a significant difference in runtimes (NNK approach requires much less time compared to OMP). 2.5.2.3 Laplacian Eigenmaps We demonstrate the effectiveness of the NNK graph construction in manifold learning us- ing Laplacian Eigenmaps [103] using two synthetic datasets (Figure 2.9). The embedding obtained using NNK is compared to that of kNN, LLE [53] graphs, and that of the original 38 1 2 3 4 5 6 7 8 9 10 Percentage of labeled nodes 10 20 30 40 50 60 70 Percentage Error KNN (t=0.19 s, 20784 edges) LLE Graph (t=0.51 s, 6973 edges) AEW (t=1018.29 s, 20775 edges) Kalofolias (t=0.31 s, 14112 edges) MP (t=0.73 s, 7892 edges) OMP (t=2.48 s, 8371 edges) NNK (t=0.34 s, 8371 edges) KNN LLE AEW Kalofolias MP OMP NNK 15 20 25 30 35 40 45 Percentage Error Combinatorial Laplacian Normalized Laplacian 1 2 3 4 5 6 7 8 9 10 Percentage of labeled nodes 20 30 40 50 60 70 80 Percentage Error KNN (t=0.16 s, 21410 edges) LLE Graph (t=0.83 s, 7554 edges) AEW (t=3271.09 s, 21410 edges) Kalofolias (t=0.26 s, 13385 edges) MP (t=0.79 s, 7475 edges) OMP (t=2.55 s, 7827 edges) NNK (t=0.33 s, 7827 edges) KNN LLE AEW Kalofolias MP OMP NNK 20 25 30 35 40 45 50 55 Percentage Error Combinatorial Laplacian Normalized Laplacian Figure 2.8: SSL performance (mean over 10 different subsets with 10 random initialization of available labels) with graphs constructed by different algorithms. Left column: Clas- sification error at different percentages of labeled data on USPS ( top row) and MNIST (bottom row) dataset with k = 30. The time taken and the sparsity of the constructed graphs is indicated as part of the legend. Right column: Boxplots showing the robustness of graph constructions usingL andL Laplacians for different choices of k (10 ,15,...,50) in SSL task with 10% labeled data on USPS and MNIST. 39 Figure 2.9: Synthetic datasets used in graph experiments. Left: Swiss roll (N=5000,d=3), Right: Severed sphere (N=3000, d=3). The color map corresponding to the location on one axis is used to identify the relative position of the points in dimensionality reduction experiments. -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 KNN (t=4.0s, 28578) -4 -2 0 2 -3 -2 -1 0 1 2 3 LLE (t=5.2s) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 LLE Graph(t=26.6s, 11116) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 NNK (t=4.5s, 10244) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 KNN (t=4.1s, 55547) -2 -1 0 1 2 -3 -2 -1 0 1 2 3 LLE (t=6.0s) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 LLE Graph(t=63.1s, 20891) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 NNK (t=5.1s, 10967) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 KNN (t=4.2s, 109123) -2 -1 0 1 2 -4 -3 -2 -1 0 1 2 LLE (t=8.0s) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 LLE Graph(t=147.3s, 31142) -0.02 -0.01 0 0.01 0.02 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 NNK (t=6.3s, 11078) -0.04 -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 KNN (t=1.7s, 17981) -2 -1 0 1 2 -3 -2 -1 0 1 2 LLE (t=2.1s) -0.04 -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 LLE Graph(t=3.4s, 16981) -0.04 -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 NNK (t=2.0s, 8418) -0.04 -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 KNN (t=1.8s, 34560) -2 -1 0 1 2 -2 -1 0 1 2 LLE (t=2.6s) -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 LLE Graph(t=5.0s, 32272) -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 NNK (t=2.3s, 9569) -0.04 -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 KNN (t=1.7s, 66888) -2 -1 0 1 2 -2 -1 0 1 2 3 LLE (t=3.4s) -0.04 -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 LLE Graph(t=7.1s, 61318) -0.02 0 0.02 0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 NNK (t=2.6s, 9916) Figure 2.10: Two-dimensional embedding learned by different algorithms. Left: Swiss roll, Right: Severed sphere. Each row corresponds to different choice of k = 10 ,20,40 (top to bottom). We include the time taken to construct the graph and the number of edges obtained for each method in the subplot title. We see that NNK graphs produce a robust embeddingforbothdatasets, withthesparsityofthegraphconstructedcorrelatingwiththe intrinsic dimension of the dataset. 40 LLE algorithm [67]. Figure 2.10 presents a visual comparison of the embedding obtained using different methods and their robustness to the choice of k. We see that NNK graphs have better sparsity, runtime, and robustness in the embeddings obtained relative to the baseline methods considered. Moreover, the eigenmaps computation using NNK graphs is significantly faster due to the sparsity of the graph constructed. We note that the sparsity in NNK graphs captures the intrinsic dimensionality of the data, where the ratio of the number of edges to the number of data points corresponds to the dimension of the manifold associated with the datasets, namely, 2 for Swiss roll corresponding to a plane and 3 for the severed sphere indicating that the data is close to a 3-dimensional surface. 2.6 Generalizations and Extensions TheproposedNNKframework, presentedsofar,isapplicabletoanydatamodalityprovided one has access to a similarity kernel that is normalized with values ∈ [0,1]. In this section, we first generalize the Kernel Ratio Interval (2.15) to all Mercer kernels. This analysis establishes the framework for neighborhood/graph construction on all data inputs with any (kernel) similarity function. We then present an analysis that allows for scaling NNK much like the approaches used for scaling kNN and conclude with a discussion about defining multi-scale neighborhoods in a principled manner. 2.6.1 General Kernel Ratio Interval Theorem 2.3. Given a three-node scenario as in Fig.2.2 and a Mercer kernel (Definition 2.1), the necessary and sufficient condition for two data points i and j to be connected to q in an NNK framework is K i,j K j,j < K q,i K q,j < K i,i K i,j (2.27) Proof. See Section 2.8.8. 41 Remark 2.1. Note that the general KRI condition in Theorem 2.3 is independent of the self- similarity or norm of the data to be represented, i.e., K q,q . This makes intuitive sense from a neighborhood definition or signal approximation point of view: the self-similarity of q does not change the projection order of neighbors (nearest to farthest) sinceK q,q is non-negative for Mercer kernels. K q,i K i,i ≤ K q,j K j,j ⇐⇒ K q,i K q,q K i,i ≤ K q,j K q,q K j,j Remark 2.2. Itshouldbealsonotedthatifthedatatobeselectedasneighbors(e.g.,i,j)have unit self-similarity, then the condition of Theorem 2.3 reduces to the simpler KRI condition of Theorem 2.1 irrespective of the self-similarity of the query point being approximated i.e., K q,q . 2.6.2 Neighborhoods in subvector spaces In this section, we present our extension on NNK neighborhood construction where input data isobtainedasaconcatenation ofsubvectoredspaces[104]. Weincludeheretheoretical resultsinvolvingtheKRI(Theorem2.1)anditsimpactonNNKneighborsineachsubvector and their concatenated data space. Note that such subvector spaces occur naturally in severalmachinelearningsystems, forinstance, inconvolutionalneuralnetworks[105], where the output at each convolution is the aggregation of the outputs of multiple channels or lower dimensional subvectors, each corresponding to the output of a learned image filter. Consider the case where an input data x∈ R D is divided into two subspaces 3 , namely x ⊤ = [x ⊤ 1 x ⊤ 2 ], wherex (1) ∈R D 1 andx (2) ∈R D 2 are the subspaces such that D 1 +D 2 =D. Let N denote the NNK neighbors obtained when considering the complete data x, while N (1) and N (2) denote the NNK neighbors obtained with respect to each subspace. In this setting, we make two observations presented in Propositions 2.5 and 2.6. These observations 3 The theoretical results presented here extend to an arbitrary number of subspaces and the choice of two is made for the sake of brevity. 42 allow us to solve for the NNK neighbors of a query by working on subset problems in lower- dimensional subspaces. This form of neighborhood search by partitioning input space, using data structures such as the k-d tree, is commonplace for scaling nearest neighbor search in large datasets, and with theoretical guarantees such as the ones presented here, a similar strategy can be employed for NNK neighborhood definitions. Proposition 2.5. If a data point j is an NNK neighbor in each subspace then it is an NNK neighbor in the input data space. ∀j∈N (1) andN (2) =⇒ j∈N i.e. N (1) ∩N (2) ⊆N (2.28) Proposition 2.6. If a data point k is pruned by an NNK neighbor j in both subspaces, then k is pruned by j in the concatenated input space for a given query x i . θ (1) ik =0|θ (1) ij >0 and θ (2) ik =0|θ (2) ij >0 =⇒ θ ik =0 (2.29) Wereferto[40,41,104]fordetailedanalysisandexperimentalevaluationsonNNKneigh- borhood definitions in subvector spaces. 2.6.3 Gaussian kernels and Multi-scale neighborhoods Despite providing flexibility to adapt to data distribution, finding an optimal bandwidth σ inGaussiankernels(2.1)isachallengingproblemwithseveralheuristicapproaches(median, local-scaling, average). Not surprisingly, different choice of σ leads to different results. We present an alternative perspective to bandwidth choice in terms of dictionary design and discuss its use for defining multi-scale neighborhoods where data points that have high similarity (equivalent to RKHS mapping to the same function) are iteratively merged. 43 Consider two extreme choices of σ and the corresponding local kernel matrix, namely σ = 0 and ∞. The corresponding kernel matrix for these choices is the identity matrix I (sparse, full rank) and the all-one matrix 11 ⊤ (dense, rank-1). From a neighborhood perspective,theidentitymatrixcorrespondstothecasewheretheneighborsofthequeryhave no overlap in space, while the other extreme is representative of a collapsed neighborhood. Thus, wepostulatethechoiceofσ determinestherepresentationscaleofthedictionaryused to define the neighborhood. An optimal choice of bandwidth ( σ ) trades of the coherence, µ in equation (2.30), of the local dictionary Φ S and the scale used to represent a query. µ =max i,j∈S ϕ ⊤ i ϕ j = max i,j∈S K i,j (2.30) Basedonthisobservation, weproposeanapproachthatsimultaneouslyincreases σ while mergingthecollapseddatapointsfordefiningmulti-scaleneighborhoods. EarlierkNNbased methods [106,107] relied on k for scale and representation while paying no heed to kernel parameters such as σ in Gaussian kernel. In contrast, our proposed approach makes distinct the role of the hyperparameters in multi-scale representations and produces neighborhoods that are adaptive to the curvature of the manifold where the observed data lies. Specifically, in cases where a kNN method overestimates the flatness or the intrinsic dimension of a manifold, the NNK approach can capture the geometry and outperform previous baseline data analysis methods. We refer the reader to [92] for algorithmic details and practical evaluation of the multi-scale NNK neighborhood approach. 2.7 Discussion and Open questions We introduced a novel view of neighborhoods where we show that neighborhoods are non- negative sparse signal approximations. We then use this perspective to propose an improved approach, Non-Negative Kernel Regression (NNK), for neighborhood and graph construc- tions. We characterize the geometry of proposed NNK neighborhoods via the kernel ratio 44 interval theorem and establish its connection to ℓ 1 -regularized and basis pursuit approaches. Our experiments demonstrate that NNK performs better than earlier methods in neigh- borhood classification and graph-based semi-supervised and manifold learning. Further, we show that our approach leads to the sparsest solutions and is more robust to hyperparame- ters involved in both neighborhood and graph constructions. NNK has a desirable runtime complexityandcanleverageneighborhoodsearchtoolsdevelopedforkNN.Moreover,thelo- calnatureoftheNNKalgorithmallowsforfurtherruntimeimprovementsviaparallelization and additional memory. Weconcludethischapterwithafewopenquestions. Inasense,thesechallengesguarantee that there will be interesting and important work in the future on this topic. • The view of neighborhoods as sparse approximation problems raise the question of whether one can leverage this view to revisit sparse approximation, theoretically and computationally, from the perspective of neighborhoods. In particular, understanding the conditions, such as appropriate kernel similarity and local linearity in transformed space, under which the two topics overlap is a potential avenue for research. • The NNK method is aimed at obtaining the optimal solution for given data and kernel similarity. Thecurrentformulationdoesnotaccountfornoiseineitherthesimilarityor theinputdata. Addressingtheimpactofnoiseusingredundancyandothertechniques, as in traditional sparse signal processing, is left open for future work. • Theselectionofkernelparameters,itsconsequenceintheneighborhooddefinition,and its geometry remain to be resolved. Our study on multi-scale neighborhoods presents a way to interpret these parameters in one kernel similarity function, namely, the Gaussian kernel. A better understanding and investigation are required to generalize the view of kernel parameters in neighborhood and graph constructions. • Kernels allow us to model non-linear relationships in the input data using NNK in a transformed space. However, there is still little known about how to factor properties 45 such as symmetry and global invariances into the NNK framework. Unifying the con- straints of spectral methods is an ongoing challenge, as it is not always obvious which constraint is a necessity in a particular problem. 2.8 Proofs 2.8.1 Karush-Kuhn-Tucker conditions of the NNK objective A solution to (2.8) satisfies the following Karush-Kuhn-Tucker (KKT) conditions: K S,S θ − K S,q − λ =0, (2.31a) λ ⊤ θ =0, (2.31b) θ ≥ 0, λ ≥ 0. (2.31c) We will use these conditions to analyze and prove the properties of the NNK framework in the following sections. 2.8.2 Proof of Proposition 2.1 Proof. An optimum solution at node q satisfies the KKT condition (2.31) where K S,S θ =K S,q and θ S ≥ 0. (2.32) This can be interpreted as a solution to a set of linear equations under constraint: ∀i∈S, X j∈S θ j κ (x i ,x j )=κ (x q ,x i ) s.t. θ ≥ 0. (2.33) 46 Local Linear Embedding (LLE) with positive weight constraint minimizes Z ⊤ Zw =1, whereZ =X S − x q s.t. X i∈S w i =1, w≥ 0. LetM =Z ⊤ Z, i.e., ∀i,j∈S, M i,j =(x i − x q ) ⊤ (x j − x q ). Then, the LLE objective corresponds to solving a set of equations with constraints: ∀i∈S, X j∈S w j (x i − x q ) ⊤ (x j − x q )=1 s.t X i∈S w i =1,w≥ 0. (2.34) An equivalent set of equations is given by: X j∈S µ j ∥x j − x q ∥∥x i − x q ∥ (x j − x q ) ⊤ (x i − x q ) 2∥x j − x q ∥∥x i − x q ∥ = 1 2 . Sinceeachweightw j ispositive,itcanbefactoredasaproductofpositiveterms,specifically, let w j =µ j ∥x j − x q ∥∥x i − x q ∥. Further, using the constraint that weights add up to one, (2.34) is rewritten as ∀i∈S, X j∈S w j (x j − x q ) ⊤ (x i − x q ) 2∥x j − x q ∥∥x i − x q ∥ =1− X j∈S w j 2 ⇐⇒ X j∈S w j 1 2 + (x j − x q ) ⊤ (x i − x q ) 2∥x j − x q ∥∥x i − x q ∥ =1 s.t. w≥ 0. (2.35) 47 CombinedwithLemma2.3.1,weseethat(2.35)isexactlytheNNKobjectivewiththekernel at node q defined as in (2.9), i.e., ∀i∈S, X j∈S w j κ q (x i ,x j )=κ q (x q ,x i ) s.t. w≥ 0. (2.36) Lemma 2.3.1. lim h→0 κ q (x q +h,x i )=1 Proof. κ q (x q ,x i )= 1 2 + (x q − x q ) ⊤ (x i − x q ) 2∥x q − x q ∥∥x i − x q ∥ The above expression is indeterminate in its current form. Let us consider the limit of the expression, namely, lim h→0 κ q (x q +h,x i )= lim h→0 1 2 + (x q +h− x q ) ⊤ (x i − x q ) 2∥x q +h− x q ∥∥x i − x q ∥ = 1 2 + lim h→0,α →0 ∥x i − x q ∥∥h∥cosα 2∥x i − x q ∥∥h∥ , where α is the angle between (x i − x q ) andh. Thus, lim h→0 κ q (x q +h,x i )=1 since lim α →0 cosα =1. 48 2.8.3 Proof of Proposition 2.2 Proof. The Karush-Kuhn-Tucker (KKT) optimality conditions for equation (2.11) are K D,D θ ∗ − K D,q +η 1− λ =0, (2.37a) λ j θ ∗ j =0, (2.37b) and θ ∗ ≥ 0, λ ≥ 0. (2.37c) ThenonnegativityofbothK D,D (kernelsarenon-negativebydesign)andθ ∗ (byconstraint) ensures K D,D θ ∗ ≥ 0. This implies that the stationarity condition (2.37a) is satisfied only when η 1− K D,D − λ ≤ 0. Combining with the slackness condition (2.37b) we have the results of the proposition, i.e., η − K q,j >0 =⇒λ j >0, since η − K q,j − λ j ≤ 0, =⇒θ ∗ j =0 since λ j θ ∗ j =0. 2.8.4 Proof of Proposition 2.3 Proof. Under the partition, the objective of NNK (2.8) can be rewritten using block parti- tioned matrices as K S,S = K P,P K P, ¯ P K ⊤ P, ¯ P K¯ P, ¯ P , K S,q = K P,q K¯ P,q . The optimization for the sub-problem corresponding to indices in θ P is min θ P :θ P ≥ 0 1 2 θ ⊤ P K P,P θ P − K ⊤ P,q θ P . (2.38) 49 An optimal solution to this sub-objective (θ ∗ P ) satisfies the KKT conditions (2.31) namely, K P,P θ ∗ P − K P,q − λ P =0, λ ⊤ P θ ∗ P =0, and θ ∗ P ≥ 0, λ P ≥ 0. Specifically, given θ ∗ P >0, the solution satisfies the stationarity condition (2.31a), i.e., K P,P θ ∗ P − K P,q =0 since λ P =0. (2.39) Thus, the zero augmented solution withθ ¯ P =0 is optimal for the original problem provided K P,P K P, ¯ P K ⊤ P, ¯ P K¯ P, ¯ P θ ∗ P 0 − K P,q K¯ P,q − λ P λ ¯ P =0, λ P λ ¯ P θ ∗ P 0 =0, and θ ∗ P 0 ≥ 0, λ P λ ¯ P ≥ 0. Given the optimality conditions onθ ∗ P , the conditions for [θ ∗ P θ ¯ P ] ⊤ to be optimal requires K ⊤ P, ¯ P θ ∗ P − K¯ P,q − λ ¯ P =0 with λ ¯ P ≥ 0. =⇒ K ⊤ P, ¯ P θ ∗ P − K¯ P,q ≥ 0. (2.40) 50 2.8.5 Proof of Theorem 2.1 Proof. An exact solution to objective (2.8) without the constraint on θ at a data point q and set S ={i,j} satisfies 1 K i,j K i,j 1 θ i θ j = K q,i K q,j (2.41) ⇐⇒ θ i +θ j K i,j =K q,i , θ i K i,j +θ j =K q,j . Taking the ratio of the above equations θ i +θ j K i,j θ i K i,j +θ j = K q,i K q,j . (2.42) Now, 0 ≤ K i,j ≤ 1, where K i,j = 1 if and only if the points are same. Without loss of generality, let us assume points i and j are distinct. Then, K i,j <1≤ K q,i K q,j ⇐⇒ K i,j < θ i +θ j K i,j θ i K i,j +θ j ⇐⇒ θ i +θ j K i,j >θ i K 2 i,j +θ j K i,j ⇐⇒ θ i >θ i K 2 i,j ⇐⇒ θ i >0 ∴ θ i >0 ⇐⇒ K i,j < K q,i K q,j . (2.43) 51 Similarly, K q,i K q,j < 1 K i,j ⇐⇒ θ i +θ j K i,j θ i K i,j +θ j < 1 K i,j ⇐⇒ θ i K i,j +θ j >θ i K i,j +θ j K 2 i,j ⇐⇒ θ j >θ j K 2 i,j ⇐⇒ θ j >0 ∴ θ j >0 ⇐⇒ K q,i K q,j < 1 K i,j . (2.44) Equations (2.43) and (2.44) combined give the necessary and sufficient condition of (2.15). K i,j < K q,i K q,j < 1 K i,j . 2.8.5.1 Proof of Corollary 2.1.1 Proof. Let j be a node beyond the plane as in Figure 2.3a. For a Gaussian kernel with bandwidth σ ,K q,i >K q,j . Thus, condition corresponding to (2.43) is satisfied, K q,i K q,j >1>K i,j =⇒ θ i >0. Let∥x q − x i ∥=a, ∥x q − x j ∥=b and α be the angle between the difference vectors. Then, K q,i K q,j =exp 1 2σ 2 (b 2 − a 2 ) , K i,j =exp − 1 2σ 2 ((bcosα − a) 2 +(bsinα ) 2 ) =exp − 1 2σ 2 (b 2 +a 2 − 2abcosα ) ⇐⇒ 1 K i,j =exp − 1 2σ 2 (b 2 +a 2 − 2abcosα ) . 52 Now, the condition on nonzero θ j (2.44) can be rewritten as follows. θ j ̸=0 ⇐⇒ K q,i K q,j < 1 K i,j ⇐⇒ exp 1 2σ 2 (b 2 − a 2 ) <exp 1 2σ 2 (b 2 +a 2 − 2abcosα ) ⇐⇒ exp − a 2 2σ 2 <exp 1 2σ 2 (a 2 − 2abcosα ) ⇐⇒ exp 1 2σ 2 abcosα <exp a 2 2σ 2 ⇐⇒ bcosαa, which describes the half-plane not containing the query q. 2.8.5.2 Proof of Corollary 2.1.2 Proof. Statements A and B are related in that one is the contra-positive of the other. Here we prove statement A. Equation (2.13) can be interpreted as the boundaries of a convex polytope (P) formed by the half planes, boundary(P)={K ⊤ P,i θ P =K q,i i∈P}. (2.45) The interior of the polytope formed by the above half-planes is given by interior(P)={p : K ⊤ P,p θ P − K q,p <0}. (2.46) 53 Thus, K ⊤ P,j θ P − K q,i ≥ 0 (2.47) corresponds to a point outside the polytopeP. Now, to prove ∃i∈P : K q,i K q,j ≥ 1 K i,j , let us assume there exists no such i. Then ∀i∈P : K q,i K q,j < 1 K i,j ⇐⇒K q,i K i,j <K q,j ⇐⇒ K ⊤ P,i θ P K i,j <K q,j since i∈P ⇐⇒ (K P,i K i,j ) ⊤ θ P <K q,j since i∈P. (2.48) Using the triangle inequality corresponding to kernels for each term in P, we have K ⊤ P,j θ P <K q,j sinceK P,q ≤ K P,i K i,j . (2.49) Equation (2.49) contradicts our earlier statement, (2.47), on point j, namely, j lies outside P and hence does not belong to the interior as defined in equation (2.46). Thus, ∃i ∈ P : K q,i K q,j ≥ 1 K i,j . 54 2.8.6 Proof of Proposition 2.4 Proof. ˜ d 2 (i,j)=(ϕ i − ϕ j ) t (ϕ i − ϕ j ) =ϕ t i ϕ i − ϕ t i ϕ j − ϕ t j ϕ i +ϕ t j ϕ j =2− 2ϕ t i ϕ j ∵∥ϕ i ∥=∥ϕ j ∥=1 =2− 2 κ (x i ,x j ). 2.8.7 Proof of Theorem 2.2 Proof. The proof follows from the kernel ratio interval Theorem 2.1 and Proposition 2.4. Consider the condition for θ j ̸= 0: K q,i K q,j < 1 K i,j ⇐⇒K q,i K i,j <K q,j ⇐⇒ " 1− ˜ d 2 (q,i) 2 #" 1− ˜ d 2 (i,j) 2 # < " 1− ˜ d 2 (q,j) 2 # ⇐⇒1− ˜ d 2 (q,i) 2 − ˜ d 2 (i,j) 2 + ˜ d 2 (q,i) ˜ d 2 (i,j) 4 <1− ˜ d 2 (q,j) 2 ⇐⇒ ˜ d 2 (q,i)+ ˜ d 2 (i,j)− ˜ d 2 (q,j)> ˜ d 2 (q,i) ˜ d 2 (i,j) 2 . The condition for θ i can be derived with similar steps. 55 2.8.8 Proof of Theorem 2.3 Proof. A solution to the NNK objective without the constraint onθ at data point q and set S ={i,j} satisfies K i,i K i,j K i,j K j,j θ i θ j = K q,i K q,j (2.50) ⇐⇒ θ i K i,i +θ j K i,j =K q,i , θ i K i,j +θ j K j,j =K q,j . Taking the ratio of the above equations, θ i K i,i +θ j K i,j θ i K i,j +θ j K j,j = K q,i K q,j . (2.51) In the following statements, we will show that under conditions of (2.27), the solutions will be positive and vice versa. Now, K 2 i,j ≤ K i,i K j,j . This is because kernels with RKHS property are positive definite functions, with equality iff the points i and j are the same. Without loss of generality, let us assume points i and j are distinct. Then, K i,j K j,j < K q,i K q,j ⇐⇒ K i,j K j,j < θ i K i,i +θ j K i,j θ i K i,j +θ j K j,j ⇐⇒θ i K i,i K j,j +θ j K i,j K j,j >θ i K 2 i,j +θ j K j,j K i,j ⇐⇒θ i K i,i K j,j >θ i K 2 i,j ⇐⇒θ i >0 since K 2 i,j <K i,i K j,j . Therefore, θ i >0 ⇐⇒ K i,j K j,j < K q,i K q,j . (2.52) 56 Similarly, we can show that θ j >0 ⇐⇒ K q,i K q,j < K i,i K i,j . (2.53) Equations (2.52) and (2.53) combined give the necessary and sufficient condition and com- plete the proof: K i,j K j,j < K q,i K q,j < K i,i K i,j . 57 Chapter 3 Applications in graph-based learning and analysis Our coverage of general results in the previous chapter demonstrates the usefulness of the NNK approach and the view of neighborhoods as sparse representation problems. However, applying the framework in specific domains allows for additional insights and improvements. In this chapter, we present three application domains – image processing, representation learning, and interpolative estimators – where we adapt and analyze, theoretically and via experiments,theproposedNNKneighborhoodsandtheirextensionsforthespecificusecase. 3.1 Image graphs A modern trend in image processing has been to move from simple non-adaptive filters to image-dependent filters such as the bilateral filter [108], non-local means [109], block matching [110], kernel regression methods [111], expected patch likelihood maximization (EPLL) [112] or window nuclear norm minimization (WNNM) [113]. One shortcoming of theseadaptivefiltersisthattheycannotbedescribedusingtraditionalimagedomainFourier techniques since these models are highly nonlinear. As an alternative, graph-based perspec- tives have been introduced to analyze data-dependent image processing models [114,115]. In the graph formulation, pixels correspond to the nodes of a graph and are connected with edges having edge weights capturing pixel similarity. [114,116] shows the correspondence 58 between window-based filters and graph-based filtering, where the graph is formed by con- necting each pixel (node) to only those within a window centered at the pixel. In this graph constructionapproach, aw× w filtercorrespondstoak-NearestNeighbor(kNN)graphwith k = w 2 , where neighbors are selected based on their spatial location relative to the pixel at the window center. Thus, the choice of window size w, similar to the choice of k in kNN graphs, can be considered a hyperparameter offering very coarse control of the sparsity and complexity of the graph representation. The computational complexity of operations, such as spectral graph wavelets and other graph signal processing tools [19,117,118], on image graphs scales with the number of edges in the graph O(mnw 2 ), where m× n is the image size. This is a significant hurdle for applying several graph tools and approximate graph filters for image processing. The study of graph construction for image representation is a specific application often overlooked in data-driven graph learning methods [58]. As previously discussed in Section 2.5.2.1, the NNKframeworkleadstoaprincipledwaytoobtainsparsegraphs, consequentlyallowingfor efficient application of graph operations. In this section, we investigate the use of the NNK graph construction (Section 2.3.3) to represent images and, in particular, leverage image properties to reduce the runtime complexity of the NNK algorithm. The pixel position regularity and specific characteristics of kernels used for image filtering allow us to construct image graphs quickly and efficiently, scaling the NNK framework to graphs with millions of nodes, as often required by image processing applications. Experimentally, our image- specific simplifications lead to average speed-ups in graph construction of at least a factor 10 for w =11, relative to the original NNK algorithm. Our proposedgraph construction andfiltering are developed by takingthe bilateral filter graph as starting point. However, similar constructions can be adapted to other image processing models with a graph interpretation (e.g., those described in [114]). Of particular relevance to our work is [119], where the authors construct sparse graph alternatives by approximating the inverse of the bilateral filter matrix. The authors motivate the idea by 59 drawing parallels to the graph construction methods that estimate sparse inverse covariance or precision matrix of a Gaussian Markov Random Field model [120]. The authors note that this method is expensive and they resort to a heuristic algorithm that can only be used for small images. In contrast, our proposed method is scalable and applies to images of much larger sizes. We combine our method with spectral graph wavelets (SGW) [8] to illustrate its benefits for image representation, showing that our graphs have 90% fewer edges than window-based bilateral filter graph construction while offering better low-frequency representation and improved performance in the context of a simple denoising task. The runtime of SGWs and other graph filter operations for images are notably reduced due to the sparse nature of our graphs (e.g., 15 times faster than the same algorithm using BF graph). This section is an extension of our conference paper [34]. 3.1.1 Background 3.1.1.1 Bilateral Filter The bilateral filter (BF) can be interpreted as a graph filter on a dense, image-dependent graph, with edge weights between nodes (pixels) i and j given by the kernel K i,j =exp − ∥x i − x j ∥ 2 2σ 2 d exp − ∥f i − f j ∥ 2 2σ 2 f ! , (3.1) where x i and f i denote the pixel position (row, column coordinate) ∈R 2 and the intensity of pixel i (scalar for grayscale images, a 3-dimensional vector for RGB/YUV color images), respectively. Thebilateralfilteroperationonagraphsignal f canbewritteninmatrixform as D − 1 Kf, where D is the degree matrix of the graph. With this interpretation, it is also possible to develop alternative graphs via symmetrization of D − 1 K [121] or sparsification of the original BF graph [119]. 60 Note that the pixel positions x and the distances∥x i − x j ∥ 2 in the BF kernel (3.1) are image-independent and are known in advance. Thus, unlike in the general NNK graph con- struction, wherealldatadimensionsareirregularandunknown, pixellocationsareregularly spaced and known beforehand in image graphs. This observation allows us to reduce the KRI conditions (Theorem 2.1) to simple intensity thresholding rules for removing neighbors in images (Section 3.1.2.1). 3.1.1.2 Spectral Graph Wavelet (SGW) Transforms The Graph Fourier Transform (GFT) [19,21] is defined as the expansion of the graph signal intermsoftheeigenvaluesandeigenvectors(λ l ,e l )ofthechosengraphoperator, suchasthe combinatorial graph Laplacian (L). For example, the GFT at frequency (eigenvalue) λ l for a graph signal f is given by ˆ f(λ l ) = P i f(i)e l (i), where e l corresponds to the eigenvector associated with eigenvalue λ l . We refer the reader to Section 2.2.1 for details on the graph notations used in this section. SGWs [8] are based on defining a scaling operator in the GFT domain. SGWs are obtained using a graph operator based on a function g asT g =g(L) whose application on a signalf is obtained as the modulation of the GFT, namely, d T g f(λ l )=g(λ l ) ˆ f(λ l ). (3.2) Scaling of the wavelets is defined in the spectral domain with the function g in (3.2) scaled accordingly. In other words, the wavelet coefficients for a given signal at scale s at a vertex i,W f (s,i), are obtained, as a function of a graph operator T g =g(L) as: W f (s,i)=T s g f(i)= |V| X l=1 g(sλ l ) ˆ f(λ l )e l (i) (3.3) 61 These coefficients can be computed with a fast algorithm based on Chebychev polyno- mials. We refer the reader to [8] for further details on the approximations for the practical realization of SGWs. 3.1.2 NNK image graphs Giventheneighborsetateachnode, theNNKgraphcanbeobtainedwithO(k 3 )complexity at each node, where k is the number of neighbors. In this section, we present image-specific simplifications to compute NNK graphs efficiently. 3.1.2.1 Kernel Ratio Interval for images TheKRIconditionof(2.15)allowsustoidentifyneighboringnodes(pixelsinaw× wwindow for images) that, given the nodes that are already connected, will have zero edge weights (Figure 2.3a). From an image point of view, this corresponds to removing graph edges to pixels that are farther away when there exist pixels with a similar intensity that are closer. More formally, we can reduce the graph construction complexity by pruning specific pixels (nodes) according to the following condition. Theorem 3.1. A pixel k does not have an edge to pixel i, given that i is connected to j, i.e., θ i,k =0|(θ i,j >0), if and only if (f j − f k ) ⊤ (f j − f i )< σ f σ d 2 (x k − x j ) ⊤ (x j − x i ) (3.4) Proof. See Section 3.5.1. 3.1.2.2 Sparse graph representation of images Intherighttermin(3.4),thekernelparameters(σ f ,σ d )areconstants,while(x k − x j ) ⊤ (x j − x i ) depends only on pixel locations (x) and can be determined for any given window size w and saved beforehand. As a further simplification, we need to consider only threshold 62 factors (∆ = ( x k − x j ) ⊤ (x j − x i )) that are positive, corresponding to regions along the same direction as the connected pixel j (see Figure 3.1). This is intuitive as the KRI plane (Figure 2.3a) would not influence the selection of pixels on the other side of the window. Thesesimplificationsreducetheruntimeofsteps4 − 9inAlgorithm4wheretheonlyunknown 7 8 9 10 11 12 13 Window Size 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Avg. time per pixel (ms) Vanilla NNK Simplified NNK Figure 3.1: Left. A simple scenario of 4-connected neighbors and remaining pixels in a 7× 7 windowwiththeirassociatedthreshold factor (∆). Forexample,givenpixel j isconnectedto i, the proposed graph construction eliminates all pixel intensities in the green region (right of i), which satisfy the condition in Theorem 3.1. The threshold factor corresponding to each connected pixel is independent of pixel intensity and can be computed offline once for a givenwindowsize. Thealgorithmcontinuespruningbymovingradiallyoutwardsconnecting pixels that are not pruned and removing ones that are to be pruned based on proposition untilnopixelisleftunprocessed. Right. Averageprocessingtimeperpixelforourproposed simplified NNK and the original NNK construction. We observe a similar trend on all our test images, with the difference widening further for increasing window sizes. is the imagef. Further, given the set of neighbors for each pixel after pruning, we assign edge weights using the bilateral kernel (3.1). This can be justified by Proposition 3.1 and the fact that both NNK and original BF kernel weights maintain the same relative order of importance and would serve as reasonable approximations. 63 Proposition 3.1. The edge weight between nodes i and j assigned in the NNK graph con- struction is upper bounded by the kernel similarity between the two nodes: θ i,j <K i,j . (3.5) Proof. See Section 3.5.2. Now,weshowthattheNNKweightsmaintainthesamerelativeorderastheedgeweights assignedusingthebilateralkernel(3.1). Considerthesimplecasewhereonlytwoneighboring nodes remain around the query after pruning. The NNK weights for the two nodes are obtained as θ i,j θ i,k = 1 1− K 2 j,k K i,j − K j,k K i,k K i,k − K j,k K i,j . Thus, θ i,j − θ i,k =(K i,j − K i,k ) " 1+K j,k 1− K 2 j,k # . (3.6) The fractional term on the right, which only depends onK j,k , is strictly positive. Thus, if j is less similar to i than k (K i,j <K i,k ), then the order of the NNK weights will be the same (θ i,j <θ i,k ). 3.1.3 Representation and filtering with NNK image graphs We experimentally validate the effectiveness of our proposed NNK construction compared totheBFgraphintermsofenergycompactionanddenoisingperformance. Thesourcecode for experiments is available at github.com/STAC-USC/NNK Image graph. 64 Algorithm 4: NNK Algorithm for Image Graphs Function Precompute(w): x=pixel positions in w× w, S={pixels in w× w} window center =i, S=S−{ i} S= sortS by∥x j − x i ∥ 2 ∀j∈S for j,k in S do ∆ j,k =(x k − x j ) ⊤ (x j − x i ) end return Ordered window pixelsS, Threshold factor ∆ Input: Imagef ∈R m× n , window size w,σ f ,σ r Function NNK Image Graph(): µ = σ f σ d 2 , [S ∗ , ∆]= Precompute (w) for each pixel i do S=S ∗ , P={} for j in S do /* pick neighbors in spatial distance sorted order */ if j in P then continue // skip if pruned P=P+{j} for pixel k in S with ∆ j,k ≥ 0 do /* consider pixels in the same direction as j */ if (f j − f k ) ⊤ (f j − f i )≤ µ ∆ j,k then P=P+{k}, S=S−{ k} end end W i,S =K i,S , W i,S c =0 end return Graph AdjacencyW 65 3.1.3.1 Energy compaction In this section, we evaluate our graph construction for image representation. The variance of the image (graph signal) in the wavelet domain is an indicator of information content corresponding to the spectral frequency band of the wavelet, i.e., the energy in an SGW sub-band depends on the energy that the original image has in the GFT domain in the frequencies corresponding to the passband for that sub-band. As shown in Figure 3.2, SGW corresponding to the NNK graph have significantly less information in the higher frequency bands,whichisnaturalforimagesastheyareinherentlysmooth. Anotherfeaturetoidentify a good representation of images is the fraction of image energy captured by each wavelet, i.e., (∥f w ∥ 2 /∥f∥ 2 ). Figure 3.3 shows that NNK graphs capture much of the image energy earlier than the BF graph corresponding to compact support in the wavelet domain. Figure 3.2: (Top: 11× 11 Bilateral Filter Graph vs Bottom: Proposed NNK Graph Con- struction) Energy compaction using spectral graph wavelets. NNK graphs capture most of the image in the lower bands. 3.1.3.2 Image Denoising We consider the problem of image denoising to evaluate the performance of filters based on ourproposedgraphconstruction. Weconsider12standardimages(256× 256)usedinimage processing(seeFigure3.4)withGaussiannoiseat5differentvariances( σ =10,15,20,25,30). We use an 11× 11 window for constructing the graphs with hyperparameters chosen as in[122]. Denoisingisperformeddirectlyinthespectralgraphwaveletcorrespondingto each 66 0 2 4 6 0 1 2 3 Poly. degree = 4 0 2 4 6 0 1 2 3 Poly. degree = 6 0 2 4 6 0 1 2 3 Poly. degree = 12 0 2 4 6 0 1 2 3 Poly. degree = 15 Figure 3.3: The energy captured by BF Graph (blue) and NNK graph (red) for different polynomial degree approximations of SGW. The wavelets were designed with frame bounds A = 1.71,B = 2.35 as in [8]. NNK consistently captures the image content better than the BF graph, irrespective of the Chebychev polynomial degree. Figure 3.4: Images used for denoising experiment. 67 5 10 15 20 25 30 35 Noise Level 20 22 24 26 28 30 32 34 36 38 PSNR (dB) Original BF BM3D SGW + BF Graph SGW + NNK Graph 5 10 15 20 25 30 35 Noise Level 0.4 0.5 0.6 0.7 0.8 0.9 1 SSIM Figure 3.5: Denoising performance using SGW on BF graph and proposed method with comparisons to original BF and BM3D algorithms. NNK graphs significantly improve over the BF Graph version in SSIM and PSNR. Our method improves the SSIM of the output while matching PSNR performance with the original BF. The BM3D method included for completeness shows that the proposed graph method with BF kernel achieves comparable SSIM measures. graph construction by retaining all scaling and low pass wavelet coefficients corresponding tothelargestscales(inourexperiments, weselect3outofthe7bands). Wereferthereader to [123] for details on implementing the denoising algorithm using SGWs. The average performance and quantiles for original BF [108], BM3D [110], and SGW based on the BF graphandproposedNNKareshowninFigure3.5. Akeythingtonoticeisthatperformance worsens when using SGW with the BF graph. We attribute this to the fact that the BF graph construction leads to a dense graph. Consequently, higher degree polynomials of the BF adjacency used in SGW lead to averaging over a larger window, resulting in increased blurring. 3.2 Data summarization Massive high-dimensional datasets are becoming an increasingly common input for system design. Whilelargedatasetsareeasiertocollect,themethodsforexploratory(understanding orcharacterizingthedata)andconfirmatory(confirmingthevalidityandstabilityofasystem 68 designed using the data) analysis are not as scalable and require new techniques that can cope with big data sizes [124,125]. Data summarization or sparse representation learning methodsaimtorepresentlargedatasetsbyasmallsetofelements,whichcanprovideinsights to organize the dataset into clusters, classify observations to its clusters, or understand the memorization/generalization of models learned with the data [126,127]. In datasets with label information, a summary can be obtained for each class. Still, summaries are generally unsupervised and decoupled from downstream data-driven system designs, thus different from coresets and sketches [128,129]. ClusteringmethodssuchaskMeans[125,130],vectorquantization[131]andtheirvariants [132], are among the most prevalent approaches to data summarization [133]. A desirable propertyforsummarization,whichcanbeobtainedwithclusteringmethods,isthegeometric interpretability of elements in the summary. For example, in kMeans, the elements in the summary are centroids, which are obtained by averaging points in the input data space, and thus are themselves in the same data space so that one can associate properties (e.g., labels) to the summary points based on the data points these are derived from. In kMeans, each dataset point can be considered a 1-sparse representation based on the nearest cluster center. This leads to a “hard” partitioning of the input data space, which suggests that better summarization is possible if the optimization allows for points to be approximated by a sparse linear combination of summary points. Inthissection,weinvestigatedatasummarizationusingadictionarylearning(DL)frame- workwherethesummary,ordictionary,isoptimizedformaximumk-sparsity,withk >1,i.e., each data point is represented adaptively by a sparse combination of elements (atoms) from a learned set, the dictionary. The additional flexibility in the sparse representation relative to kMeans, which is akin to 1-sparse approximation, allows for each summary to be involved in the approximation of larger regions, in contrast with kMeans where only points in the Voronoiregionaroundanatom(clustercenter)arerepresentedbythatatom(seeFigure3.8). 69 It is important to note that using previous DL schemes, such as the method of optimal di- rections (MOD) [134], the kSVD algorithm [135], and their kernel extensions [136–138], for DL-based data summarization is not possible for several reasons. (i) Existing DL methods learn dictionary atoms that are optimized to represent data and their approximation resid- uals [139]. Therefore, atoms in the dictionary are not guaranteed to be on or near the input data manifold and do not have geometric properties as those of cluster centers in kMeans. (ii) Although DL methods perform well in signal and image processing tasks, their appli- cation to machine learning problems is largely limited to learning class-specific dictionaries that can be later used for classification [140,141]. This is because the individual (residual) atoms learned by DL cannot be directly associated with labels or other properties of the data and can only be assigned labels if separate class-wise dictionaries are learned. (iii) Moreover, current DL schemes are impractical, even for datasets of modest size [142,143], andarethusunsuitableforsummarizationinvolvinglargedatasets. Thisisduetotheexpen- sive, iterative search over the entire dictionary to achieve sparse approximation in DL-based representations. Toovercometheselimitationsandlearndictionarieswithatomsthatcanbeusedfordata summarization, we leverage the sparse approximation view of neighborhoods and the NNK algorithm from Section 2.3. The proposed NNK-Means algorithm performs non-negative sparse coding using NNK, where each selected neighbor corresponds to a geometrically non- redundant dictionary atom or cluster center that is learned as part of the procedure. Our experiments show that the NNK-Means (i) selects atoms for summarization that are sim- ilar to the input data, (ii) outperforms DL methods in terms of downstream classification using class-specific summaries on several datasets (USPS, MNIST, and CIFAR10), and ( iii) achieves runtimes similar to kernel kMeans, while being 67× and 7× faster training and testingruntimes, respectively, thankernelkSVD.Finally, wepresentastudyonusingNNK- Means for learning with summarized data, where we compare our method with a recent coreset approach [144] and present strategies to enhance the learning. 70 Figure 3.6: Left: Proposed NNK-Means. The algorithm alternates between sparse coding (W)usingNNKanddictionaryupdate(A)untileitherthedictionaryelementsconverge(up toa givenerror)oragivennumberofiterationsisachieved. Middle: Duringsparsecoding, kMeansassignseachdatapointtoitsnearestneighborwhileNNKrepresentseachdatapoint inan adaptively formedconvexpolytopemadeofthedictionaryatoms. Right: Comparison betweenexistingdictionarylearningmethodsandtheproposedNNK-Means. kMeansoffersa 1-sparsedictionarylearningapproach. kSVDoffersamoreflexiblerepresentation, wherethe sparse coding stage uses a chosen, fixed k 0 -sparsity in its representation but lacks geometry. NNK-Means has adaptive sparsity that relies on the relative positions of atoms around each data point to be represented. 3.2.1 Background 3.2.1.1 Sparse Dictionary Learning Given a dataset of N data points represented by a matrix X ∈R d× N , the goal of DL is to findadictionary D∈R d× M withM <<N, andasparsematrixW ∈R M× N thatoptimizes data reconstruction. ˆ D, ˆ W = argmin D,W:∀i ||W i || 0 ≤ k ||X− DW|| 2 F (3.7) where the ℓ 0 constraint on W corresponds to the sparsity requirements on the columns of the reconstruction coefficients W i ∈ R M and || . || F represents the Frobenius norm of the reconstruction error associated with the representation. WhilekMeanscanbewrittenintermsoftheDLobjective(3.7)witha1-sparseconstraint on the sparse coding, i.e., each column of W can have only one nonzero value, there are several important differences between the two problems. In particular, we can see that: ( i) The coefficients involved in the sparse coding of kMeans are non-negative, (ii) in kMeans, the sparse coding is based on the proximity of the data to the atoms (i.e., cluster centers), whereas in kSVD or MOD, coding is done by searching for atoms that maximally correlate 71 withtheresidual,and(iii)thechoiceofdictionaryupdateleadstodictionarieswithdifferent atoms. Thus, in kMeans, atoms are obtained via weighted averages. In contrast, in kSVD, atom updates aim to minimize the residue in representation. Kernelized extensions of DL objective (3.7) involve replacing the input data with their respective Reproducing Kernel Hilbert Space (RKHS) representation (see Section 2.2.2 for more details on kernels). A kernelized DL objective is thus obtained as ˆ D, ˆ W = argmin D,W:∀i ||W i || 0 ≤ k ||Φ − DW|| 2 F , (3.8) whereΦ =ϕ (X)correspondstotheRKHSmappingofthedata. Thekernelallowsforrepre- sentingdatathatbelongtoanon-linearmanifoldbylearningdictionariesinthetransformed, possibly infinite-dimensional, RKHS associated with the kernel. However, for the problem (3.8), theinitialization and dictionary learningcannot leverage the kernel trick [59,62], where one does not require the explicit mapping of the data in RKHS. This is because the term DW involving the dictionary is not written in terms of innerproductsontheinputdata. Toovercomethisproblem,[137]suggestssolvingamodified version of objective (3.7), namely, ˆ A, ˆ W = argmin A,W:∀i ||W i || 0 ≤ k ||Φ − Φ AW|| 2 F . (3.9) Here, one learns a dictionary (D =Φ A) via the coefficient matrix A∈R N× M , so that the kernel trick can be used in inner products arising in the resulting optimization. 3.2.1.2 Related work Non-negative, Kernelized Dictionary Learning: In all DL methods, as in ours, the learning is done by alternating two minimization steps, namely sparse coding and dictionary update, with differences between the methods arising given the choice of constraints or optimization in these two steps. The main novelty in NNK-Means comes from using a non-negative 72 sparse coding procedure in kernel space that can be described, similar to kMeans, in terms of the local data geometry. Non-negative DL and kernelized DL were separately studied in [1,145,146] and [137,138], respectively. Closest to our work are [147,148], where dictionaries are learned in kernel space with non-negative sparse coding performed using optimization schemes,suchasanℓ 1 -constrainedquadraticsolverwithmultiplicativeupdates[94,148,149], or make use of iterative thresholding approaches with non-negativity constraint [1,145,146]. In contrast, our approach uses a geometric sparse coding approach based on local neighbors, apreviouslyunexploredprocedureinDL.Thesparsityoftherepresentationinourapproach depends on the relative position of the data and atoms (i.e., the data geometry) and is thus interpretable and adaptive. Consequently, unlike earlier DL methods, individual atoms learned by NNK-Means have explicit geometric properties, with representations obtained as averages of input data examples similar to kMeans. These can be associated with data properties, such as class labels. This makes the atoms learned by our approach suitable for data summarization. Note that previous DL methods and sparse coding approaches lacked such properties, so the proposed NNK-Means is the first DL framework to study sparse coding and dictionary update, emphasizing geometry, for data summarization. Thus, our method keeps the desirable properties of kMeans (geometric interpretation of data via cluster centers, neighborhood-based coding) while allowing for flexibility in the representation sparsity in the coding procedure. Moreover, the computation complexity of NNK-Means is similar to that of kMeans. This is in stark contrast to previous DL approaches, which are impractical for large datasets. This makes our scheme suitable for datasummarization, whichwasnotpossiblewithpreviouskernelizedDLmethodswithnon- negative constraints. A summary of the relationship between our method and previous DL methods is tabulated in Table 3.1 along with a visual comparison in Figure 3.7. Distillationapproaches: Meta-learningorgradient-basedlearningmethodsaimtosynthe- size a small training set that can be used to train neural networks [150–152]. These methods optimize the synthetic data generated such that the difference in training loss between the 73 Ker- nel Non- negative coding Sparse coding Spar- sity Dictionary update Ge- om- e- try [134] No No Basis pursuit Fixed Least squares solution No [137, 138] Yes No Kernel basis pursuit Fixed Kernel SVD of residuals No [1, 145, 146] No Yes Non-negative basis pursuit/ Multiplicative update with thresholding Fixed/ ℓ 1 norm SVD of residuals/ Optimization with multiplicative updates No [147, 148] Yes Yes Non-negative kernel basis pursuit/ Multiplicative update with thresholding Fixed/ ℓ 1 norm Optimization with multiplicative updates and thresholding No Ours Yes Yes Nearest neighbor based Adap- tive Least squares solution (Weighted averaging like kMeans) Yes Table 3.1: Key differences between earlier dictionary learning (DL) methods and NNK- Means. All methods iteratively use two steps: sparse coding and dictionary update. Our sparse coding procedure is explicitly based on local neighbors, with the sparsity defined adaptively based on the relative position of the data and atoms (i.e., the data geometry). Further, the atoms obtained with our method are points in the input space and correspond to a smooth partition of the data space. In contrast, atoms obtained in previous approaches need not belong to the input space as they strive to represent both the inputs and their approximation residuals [1,2]. (a) kMeans (b) Unconstrained (c)W ≥ 0, ℓ 1 reg. (d)D &W ≥ 0 (e) NNK-Means Figure 3.7: 100 atoms for MNIST digits (contrast scaled for visualization) obtained using kMeans (a), DL methods and their constrained variants (b, c, d), NNK-Means (e) using cosine kernel. Unlike earlier DL approaches, the proposed NNK-Means learns individual atoms that are linear combinations of the digits and can be associated geometrically with the input data. Such explicit properties of atoms are lost when working with ℓ 1 -regularized or thresholding-based sparse coding methods used previously in DL. We also observe that atoms learned by NNK-Means are more visually similar to the input data. 74 synthetic and real input data is minimized. More recently, [153] proposes to condense the trainingdatasetbymatchingthegradientsfromthedatasetwiththoseofasmallersynthetic set. Distillationmethodscanincludedataaugmentationsandaimtolearnsummarizeddata for a downstream task and network architecture that is known in advance. Further, these methods usually do not provide guarantees on the quality of their summarization and may requireexperttuningofhyperparametersinvolvedinthetrainingprocedure. Incontrast,our proposed summarization method is unsupervised and does not target a specific downstream task. Coresets: The idea of coresets originated in computational geometry and optimization, whereacoresetisdefinedasadatasetsummary,often(butnotalways)requiredtobeasubset of the original dataset, such that it can be used as a proxy for the entire dataset [128,129]. Recent coreset methods for machine learning are studied in a supervised setting. Here, one makes use of proxy models such as neural tangent kernels [154] for obtaining subsets of data that can be used for subsequent learning [144,155]. These methods are expensive in terms of runtime and do not scale well with an increase in the size of the coreset or the dataset. Our proposed NNK-Means is conceptually similar to early unsupervised coreset solutions that aimed at capturing the distribution of a dataset [156,157]. We leave for future work exploring this relationship to obtain approximation guarantees for the summaries learned in our method. 3.2.2 NNK-Means This section presents our proposed method for data summarization, NNK-Means. We use a two-stage learning scheme where we solve sparse coding and dictionary update until con- vergence or until a given number of iterations or a reconstruction error is reached. The two steps, the respective optimization involved, interpretation, and runtime complexity, are described below. 75 Sparse Coding Given a dictionary, A, in this step, we seek to find a sparse matrix W that optimizes data reconstruction in kernel space. We will additionally require the entries of W to be non-negative, with at most k nonzero entries. Thus, the objective to minimize at this step is ˆ W = argmin ∀i W i ≥ 0,||W i || 0 ≤ k ||Φ − Φ AW|| 2 F = argmin ∀i W i ≥ 0,||W i || 0 ≤ k N X i=1 ||ϕ i − Φ AW i || 2 2 , (3.10) where ϕ i corresponds to the RKHS representation of data x i . Solving for each W i in equation (3.10) involves working with an N × N kernel matrix, leading to run times that scale poorly with the size of the dataset. However, the geometric understanding of the NNK objective in [32] allows us to solve for the sparse coefficients ( W i ) efficiently for each data point by selecting and optimizing starting from a small subset of data points, namely, the k-nearest neighbors. Objective (3.10) can be rewritten for each data point and solved with NNK to obtainW i as ˆ W i,S =argmin θ i ≥ 0 ||ϕ i − Φ A S θ i || 2 2 and ˆ W i,S c =0, (3.11) where the set S corresponds to the subset of indices corresponding to selected dictionary atoms. ThisobjectivecanbesolvedefficientlyusingthemethodsdevelopedforNNKgraphs (Section 2.3.3). Sparse coding using NNK allows us to explain the obtained sparse codes, leverage nearest-neighbor tools to scale to large datasets and analyze the obtained atoms geometrically. This is similar to kMeans, but each data point here is represented by an adaptive set of non-redundant neighbors rather than just 1 as in kMeans. This step includes aneighborhoodsearchandanon-negativequadraticoptimizationwithruntimecomplexities O(NMd) andO(Nk 3 ). 76 Dictionary Update Assuming that the sparse codes for each training data, W, are cal- culated and fixed, the goal is to update A such that the reconstruction error is minimized, i.e., ˆ A=argmin A ||Φ − Φ AW|| 2 F . (3.12) Here, we propose an update similar to that in MOD, where the dictionary matrix A is obtained based onW as ˆ A=W ⊤ (WW ⊤ ) − 1 . (3.13) The runtime complexity associated with this step isO(M 3 +NMk), where we use the fact that W has at most Nk non-zero elements. In our experiments, we observe that using k- nearestneighbordirectlyforsparsecodinglacksadaptivityandoptimalityandleadstoover- smoothing and instabilities at the dictionary update stage. Thus, kNN would be unsuitable for DL as well as data summarization using DL. Algorithm 5: NNK-Means Algorithm Input : DataX, Max. sparsity k, Max. iteration T, Tolerance ξ , Kernel κ (x,y)∈[0,1] Initialize: SetA to select M random points of input data for t=1,2,...T do for i=1,2,...N do S ={k nearest neighbors of node i in dictionary} θ S =min θ ≥ 0 1 2 θ ⊤ K S,S θ − K ⊤ S,i θ E i =1− 2K ⊤ S,i θ S +θ ⊤ S K S,S θ S W i,S =θ S , W i,S c =0 end if P N i=1 E i ≤ Nξ then break // reconstruction error small A=W ⊤ (WW ⊤ ) − 1 end Output: Dictionary matrixA, Sparse CodeW 77 Proposition3.2. The dictionary update rule in (3.13) reduces to the kMeans cluster update A=W ⊤ Σ − 1 whenW consistsofN columnsfrom(e 1 ...e M ),wheree m isthemthcanonical basis vector, i.e., e mi = 0 ∀ i ̸= m and e mm = 1 and Σ ∈ R M× M is a diagonal matrix containing the degree or number of times each basis vector e m appears in W. Proof. See Section 3.5.3. Proposition 3.2 shows that the proposed method reduces to the kMeans method when the sparsity of each column in W is constrained to be 1 and is a DL generalization that maintains the geometric and interpretable properties of kMeans. One can easily verify that our iterative procedure for DL, alternating between sparse coding and dictionary update, does converge (Theorem 3.2) and produces atoms that belong to the input data manifold. Theorem 3.2. The residual||Φ − Φ AW|| 2 F decreases monotonically under the NNK sparse coding step (3.11) for W given matrix A . For a fixed W, the dictionary update (3.13) for A is the optimal solution to min A ||Φ − Φ AW|| 2 F . Thus, NNK-Means objective decreases monotonically and converges. Proof. See Section 3.5.4. 3.2.3 Classification with summarized data In this section, we validate the properties of NNK-Means that make it suitable for data summarization. We focus on a standard experiment setting in DL, namely DL-based classi- fication[137,143],tocompareNNK-MeanswithpreviousDLapproaches. Notethatlearning a good summary leads to better classification. Since existing DL methods cannot associate labels directly to the atoms obtained, our experiments are constrained to learning a dictio- naryforeachclass({A i } C i=1 )inthetrainingdata. Thesedictionariesarelaterusedtoclassify queries based on the class-specific reconstruction error {e i } C i=1 , i.e., we sparse code a query x q using each dictionaryA i and assign the query to the class (c) with lowest reconstruction error (e c ). NNK-Means outperforms all other methods consistently in classification while 78 having desirable runtimes relative to kMeans, kSVD, and their kernelized versions 1 in both synthetic and real datasets. We normalize the datasets using the empirical mean and vari- ance obtained from the training data and use a Gaussian kernel κ (x,y)=exp(||x− y|| 2 2 /2). The results reported are the average over 10 runs for all experiments. The source code for all the experiments is available at github.com/STAC-USC/NNK Means. (a) Input data (b) kMeans(0.85) (c) K-kMeans(0.8) (d) kSVD(0.24) (e) K-kSVD(0.98) (f) Proposed(1.0) Figure3.8: Visualizingpredictions(eachcolorcorrespondstoaclass)obtainedusingvarious dictionary-based classification schemes on a 4-class synthetic dataset ( a). Each method learns a 10 atom dictionary per class based on given training data (N =600) with a sparsity constraint,whereapplicable,of5. Thelearnedperclassdictionaryisthenusedtoclassifytest data(N =200),withaccuracyindicatedinparenthesisforeachmethod. In(a),trainingand test data are denoted by times and bullet, respectively. We see that the kSVD (d) approach cannot adapt to the nonlinear structure of the data, and adding a kernel (e) is crucial in such scenarios. We see that NNK-Means (f) is more adaptive to the data geometry. Also, we observe that NKK-Means has a runtime comparable to kMeans (b,c) while having 4× and 2× faster train and test times than kernel kSVD. Synthetic dataset We consider a 4-class dataset consisting of samples generated from a non-linear manifold and corrupted with Gaussian noise (as in Figure 3.8). Since the data corresponding to each class have similar support, namely the entire space R 2 , dictionaries learned using kSVD are indistinguishable for each class and lead to at-chance performance intheclassificationoftestqueries. Onthecontrary,akernelizedversionofkSVDcanhandle thenon-linearityofthedatamanifoldbutatthecostofincreasedcomputationalcomplexity. Interestingly, we observe that a non-negative neighborhood-based sparse coding can adapt to input space non-linearity even when constrained to 1-sparsity (Kernel kMeans), which indicates the importance of non-negativity and geometry in data summarization. 1 Weusetheefficientimplementations, asin[143], fromomp-boxandkSVD-boxlibraries[158]andKernel kSVD code of [137]. 79 50 100 200 Number of atoms per class 0.92 0.93 0.94 0.95 0.96 0.97 Test Accuracy (%) 50 100 200 Number of atoms per class 0 20 40 60 80 100 120 140 160 Train Time (sec) 50 100 200 Number of atoms per class 0 5 10 15 20 25 Test Time (sec) kMeans Kernel-kMeans kSVD Kernel-kSVD NNK-Means Figure3.9: Testclassificationaccuracy( Left),Traintime(Middle),andTesttime(Right) asafunctionofthenumberofdictionaryatomsperclassontheUSPSdatasetforvariousDL methods. Each method is initialized similarly and is trained for a maximum of 10 iterations with a sparsity constraint, where applicable, of 30. The plots demonstrate the benefits of NNK-Means in classification accuracy and runtime. The major gain in runtime for NNK- Means comes from the pre-selection of atoms in the form of nearest neighbors, which leads to fast sparse coding (as can be seen via from the test time plot, which performs only sparse coding) relative to the kSVD approaches that sequentially perform a linear search for atoms thatcorrelatewiththeresidueatthatstage. TrainingtimeinthekSVDapproachesdecreases as the number of atoms increases since the sparse coding stage requires fewer atom selection steps than for smaller dictionaries. Method MNIST-S MNIST CIFAR-S CIFAR kMeans 94.89 96.34 83.88 84.91 K-kMeans 91.56 93.19 84.22 85.06 kSVD 95.53 95.86 86.01 86.28 K-kSVD 96.45 − 86.71 − NNK-Means 96.70 97.79 86.95 87.21 Table 3.2: Classification accuracy (in %, higher is better) on MNIST, CIFAR10 and their subset (S, 20% of randomly sampled training set). Each method learns a 50-atom dictio- nary per class, initialized randomly, with a sparsity constraint, where applicable, of 30 and solved for at most 10 iterations. NNK-Means consistently produces better test classification accuracy while having a reduced runtime compared to kSVD approaches and comparable to kMeans. Kernel k-SVD produces comparable performance but at the cost of 67× and 7× slower train and test time relative to NNK-Means. 80 USPS,MNIST,CIFAR10 Wenowevaluateourmethodonhigh-dimensionalrealdatasets: USPS, MNIST, and CIFAR10. We use as features the pixel values of the images for the USPS (d = 256) and MNIST (d = 784) datasets. For CIFAR10, we train a self-supervised model using SimCLR loss [159] on unlabelled training data to obtain features (d=512) for our experiment. We use the train/test split provided with each dataset and normalize the featurevectorsusingthemeanandvarianceofthetrainingset. Wereportheretheresultsof DL with a subset of the training data, namely MNIST-S and CIFAR10-S, for a fair compari- son with kernel kSVD. We note that kernel kSVD scaled poorly with dataset size and timed out when working with the entire training set of MNIST and CIFAR10. NNK-Means can efficiently learn a compact set of atoms that are capable of representing each class which in turn provides a better classification of test data in all settings as made evident in Figure 3.9 and the results in Table 3.2. 3.3 Neighborhood based interpolation Machine learning, particularly deep learning, has led to significant advances with transfor- mativeapplicationsinvariousareasinrecentyears. Thestandardpracticeindesigningthese modern systems is to first collect significant amounts of data and then train a model, with a number of parameters several orders of magnitude larger than the dataset size, to achieve zero or near-zero error on the available training data. A model achieving zero loss on a training dataset is referred to as being interpolative. Formally, an estimator is a valid interpolation function when it fits the dataset exactly, i.e., theestimator,whenappliedtotrainingdata,estimatestheproperties,suchaslabels,ofthat dataset perfectly, thus achieving zero error. Conventionally, machine learning models were trained to be not interpolative on the training dataset as interpolative models were thought to generalize poorly (a large gap between test and training error). However, the empirical 81 successofmodernlearningsystems[160,161]suggeststhatatleast some interpolative modes can indeed generalize well, i.e., perform well on unseen data. In a series of papers, Belkin et al. [162,163] investigate neighborhood-based estimators that are interpolative on a given dataset, such as simplicial interpolation and singular-kernel weighted k-nearest neighbors (wiNN), possess good generalization properties and these con- cepts (interpolation, generalization) can occur simultaneously. (a) (b) (c) Figure3.10: (a)Comparisonofsimplicialandpolytopeinterpolationmethods. Inthesimplex case, the node x i label can be approximated based on different triangles (simplices), one of which must be chosen. With the chosen triangle, two out of the three points are used for interpolation. Thus, in this example, only half the neighboring points are used for interpolation. Instead, NNK interpolation is based on all four data points, which together form a polytope. (b) KRI plane (dashed orange line) corresponding to chosen neighbor x j . NNK will not select data points to the right of this plane as neighbors of x i . (c) KRI boundary and associated convex polytope formed by NNK neighbors at x i . Consider the example of Figure 3.10a: a simplicial interpolation constrains itself to a simplexstructure, triangles inR 2 , where theverticesofthesimplex correspondtotheneigh- bors of the test query, but requires choosing a simplex among several available ones. Thus, in Figure 3.10a, only one triangle can be used, and only two of the 4 points in the neighbor- hood contribute to the interpolation. This situation, where several simplices exist around a given query, becomes increasingly common in high dimensions. A kNN-based estimator in such a setting can possibly make use of all 4 data points, provided k is set to 4, but require singular neighborhood weights, i.e., for the case of where query coincides with a neighbor the neighborhood weight should be such that only that neighbor is used for es- timation. In contrast, an NNK-based neighborhood estimator formulates the interpolation 82 using convex polytope structures, generalizing the simplicial interpolation, where the shape of the polytope (the number of neighbors) is determined adaptively based on the observed data positions. Note that, a kNN estimator is generally not interpolative on training data (unless singularly weighted or k = 1) despite having access to all the training data points, while an NNK based estimator is interpolative irrespective of the choice of the kernel and k. In summary, NNK neighborhood-based estimation combines some of the best features of existing methods, providing a geometrical interpretation and performance guarantees as the simplicial interpolation [162] while having a complexity comparable to that of kNN-based schemes. 3.3.1 Local Polytope Interpolation NNK starts with k nearest neighbors. However, instead of directly using these points for estimation,itoptimizesandreweighstheneighborssuchthatonlyaselectedsetofpointsare identified as neighbors. This section first shows that NNK neighborhood-based estimation is interpolative (Proposition 3.3). Then, similar to the simplicial interpolation of [162], we use the local geometry of NNK to obtain theoretical guarantees on the generalization of the NNK estimator for regression and classification (Theorem 3.3). Proposition 3.3. Given a queryx q , and dataset{(x 1 ,y 1 ),(x 2 ,y 2 )...(x N ,y N )}, the follow- ing NNK estimate at x q is a valid interpolation function: ˆ η (x q )=E(Y|X =x q )= X i∈NNK poly (xq) θ i y i P j∈NNK poly (xq) θ j (3.14) 83 where NNK poly (x q ) is the polytope formed by ˆ k neighbors identified by NNK and θ denotes a ˆ k length vector of non-zero values obtained from the solution to the NNK objective (3.15), namely, θ = min θ S ≥ 0 ∥ϕ (x q )− Φ S θ S ∥ 2 = min θ S ≥ 0 K q,q − 2θ ⊤ S K S,q +θ ⊤ S K S,S θ S (3.15) where Φ S = [ϕ (x 1 )...ϕ (x k )] corresponds to the kernel space representation of an approxi- mate set of neighbors (say kNN) ofx q . K S,q corresponds to the kernel evaluated between the neighbors (set S) and x q . Proof. See Section 3.5.5. The NNK interpolator of (3.14) can be adapted to classification scenarios ( y ∈ {0,1}) using a plug-in classifier with indicator function I, namely,I(ˆ η (x q )>0.5). One consequence of Proposition 3.3 is that unlike earlier kNN based algorithms such as wiNN [14,162] and DkNN [164,165] that rely on hyperparameters such as k,ϵ , which directly impact explain- ability and confidence characterizations, our approach adapts to the local data manifold by identifying a robust set of ˆ k training instances that most influence an estimate. Moreover, kNN-based interpolators, such as wiNN [162], rely on a suitable choice of the kernel to be interpolative. In contrast, NNK-based estimation is interpolative for all choices of kernels. 3.3.2 Theoretical analysis of NNK interpolation 3.3.2.1 A general bound on the NNK classifier WestudythegeneralizationriskassociatedwiththeNNKestimatorof(3.14)underageneral assumption of smoothness. Our analysis follows a similar setup and proof style as the simplicial interpolation analysis in [162], but adapted to NNK interpolation. Note that simplicial interpolation [162] is impractical for high dimensional data, a typical setting in 84 modern machine learning, while a simpler method such as kNN does not have the geometric properties required for our analysis. We first study NNK in a regression setting and then extend the results for classification. Let D train = {(x 1 ,y 1 ),(x 2 ,y 2 )...(x N ,y N )} be the training data made available to NNK. Further, assume each y i is corrupted by independent noise and hence can deviate from the Bayesoptimalestimateη (x i ). Notethatourresultdoesnotmakespecificassumptionsabout y and holds for any signal (class label, cluster, or set membership) associated with each data pointx. In a regression setting, the generalization error of function ˆ η is given by the mean squared error, i.e., R gen (ˆ η ) = E[(ˆ η (x)− y) 2 ]. Statistically, the Bayes estimate correspond- ing to the conditional mean η (x) is the optimal predictor and bounds other estimators as E[R(ˆ η, x)−R (η, x)] ≤ E[(ˆ η (x)− η (x)) 2 ]. In Theorem 3.3, we present a data-dependent bound on the excess risk of NNK as compared to the Bayes estimator in a non-asymptotic setting. Theorem 3.3. For a conditional distribution ˆ η (x) obtained using unbiased NNK interpo- lation given training data D train = {(x 1 ,y 1 ),(x 2 ,y 2 )...(x N ,y N )} in R d × [0,1], the excess mean square risk is given by E[(ˆ η (x)− η (x)) 2 |D train ]≤ E[µ (R d \C)]+A 2 E[δ 2α ]+2A ′ E δ α ′ ˆ k+1 +2E (Y − η (x)) 2 ˆ k+1 (3.16) under the following assumptions 1. µ is the marginal distribution of X ∈R d and C = Hull(ϕ (x 1 ),ϕ (x 2 )...ϕ (x N )) is the convex hull of the training data in the transformed kernel space. 2. The conditional distribution η is Holder (A,α ) smooth in kernel space. 3. The conditional variance var(Y|X =x) satisfies (A ′ ,α ′ ) smoothness condition. 85 4. The maximum diameter of the polytope formed with NNK neighbors for any data inC is represented as δ =max x∈C diam(NNK poly (x)), where NNK poly (x) denotes the convex polytope around x formed by ˆ k neighbors of NNK. Proof. See Section 3.5.6. Remark 3.1. The first term in the bound corresponds to extrapolation, where the test data falls outside the interpolation area (i.e., outside of the convex hull of points, C), while the last term corresponds to label noise. The remaining terms capture the dependence of the interpolation on the size of the polytope obtained from test data and on the smoothness of the y i ’s over this region (i.e., within each polytope). Note that a smaller δ , arising when test samples are covered by smaller polytopes, leads to risk closer to optimal. This is important becauseNNKleadstoapolytopehavingthesmallestdiameterorvolumeamongallpolytopes obtained with exactly ˆ k points chosen from the k nearest neighbors (as guaranteed by the conditions of (2.15)). From the theorem, this corresponds to a better risk bound. The bound associated with simplicial interpolation is a special case, where each simplex enclosing the data point is a fixed-size polytope containing d + 1 vertices. Thus, in our approach, the number of points (neighbors) forming the polytope is variable (dependent on local data topology). In contrast, in the simplicial case, it is fixed and depends on the dimension of the space, i.e., ˆ k = d+1. Though the simplicial bound seems better for high- dimensional datasets, the diameter of the simplex increases exponentially with dimension d making the excess risk sub-optimal compared to NNK. This is because the ratio of diameter to dimension is smaller for NNK. Corollary 3.3.1. Based on an additional assumption that support of µ belongs to a convex and bounded region of R d , the excess mean square risk is asymptotically upper-bounded: limsup N→∞ E[(ˆ η (x)− η (x)) 2 ]≤ E[(Y − η (x)) 2 ]. (3.17) Proof. See Section 3.5.6.1. 86 Remark 3.2. The asymptotic excess risk of the NNK interpolation method in the regression setting is bounded by the Bayes risk, similar to the 1-nearest neighbor method. Note that the rate of convergence of the NNK estimator to this asymptotic bound depends on the kernel function used to define the neighborhood: how similar two data points need to be for them to be indistinguishable depends on the parameters chosen for the kernel. Now, we focus on abinary classification setting, where the domain of Y is reduced to {0,1}. The risk associated to a classifier ˆ f is defined as R gen ( ˆ f)=E[P( ˆ f(x)̸=y)]. (3.18) Similar to regression, this risk can be associated with that of the Bayes optimal classifier f ∗ =I(P(Y =1|X =x)>0.5) as E[R(ˆ η, x)−R (f ∗ (x),x)]≤ E[P( ˆ f(x)̸=f ∗ (x))] (3.19) . Corollary 3.3.2 presents a bound on the excess risk associated with the plug-in NNK classifier ˆ f(x) = I(ˆ η (x) > 0.5) for D train = {(x 1 ,y 1 ),...(x N ,y N )} in R d ×{ 0,1} using Corollary 3.3.1 and the relationship between classification risk and regression risk [14]. Corollary 3.3.2. A plug-in NNK classifier under the assumptions of Corollary 3.3.1 has excess classifier risk bounded as limsup N→∞ E[R( ˆ f(x))−R (f(x))]≤ 2 p E[(Y − η (x) 2 ] (3.20) Proof. See Section 3.5.6.2. Remark 3.3. The classification bound presented here makes no assumptions on the margin associated to the classification boundary and is thus only a weak bound. The bound can 87 be improved exponentially as in [162] with stronger assumptions such as the h-hard margin boundary condition [166]. 3.3.2.2 Leave one out stability The leave one out (LOO) procedure (also known as the deleted estimate or U-method) is an importantstatisticalmeasurewithalonghistoryinmachinelearning[167]. Unlikeempirical error,itisalmost unbiased [168]andisoftenusedformodel(hyperparameter)selection. The LOO error associated with NNK interpolation is given by: R loo (ˆ η |D train )= 1 N N X i=1 l(ˆ η (x i )|D i train ,y i ), (3.21) where forx i the NNK interpolation estimator is based on all training points except x i . We focus our attention on LOO in the context of model stability and generalization as defined in [13,167]. Definition 3.1. The stability of a symmetric 2 learning algorithm ˆ η is β (ˆ η, D) stable when E D |l(ˆ η (x)|D,y)− l(ˆ η (x)|D i ,y)| ≤ β (ˆ η, D) (3.22) This stability definition allows us to relate the generalization error to that of the LOO estimate by |E D (R gen (ˆ η |D)−R loo (ˆ η |D))|≤ β (ˆ η, D) (3.23) In a more stable learning system, the LOO estimate will be closer to the generalization performance. InTheorem3.4, weboundtheβ (ˆ η, D)ofNNKestimatorsusingearliertheoreticalresults by Rogers, Devroye, and Wagner [169,170]. In particular, in our method, the number of 2 A symmetric algorithm is one whose outcome is independent of the order of the training data. 88 neighbors ˆ k around each data point is dependent on the local distribution of data and replaces the fixed k in earlier results by an expected value, E[ ˆ k]. Theorem 3.4. The leave-one-out performance of the NNK interpolation classifier given γ , the maximum number of distinct points in a given discrete set that can have the same nearest neighbor, is bounded as P(|R loo (ˆ η |D train )−R gen (ˆ η )|>ϵ )≤ 2e − Nϵ 2 /18 +6E h e − Nϵ 3 /(108 ˆ k(2+γ )) i Proof. See Section 3.5.7. Remark 3.4. Thevalueofγ inourLOOboundisrelatedtothepackingnumberofthespace and is dependent on the dimension of the space where the data is embedded. The exact evaluation of γ is difficult in practice but bounds do exist for this measure in the sphere covering literature [171,172]. The theorem allows us to relate the LOO risk of a model to the generalization error. Unlike the bound based on hyperparameter k in kNN methods, the bound for NNK is adaptive to the training data, capturing the distribution characteristics of the dataset. Morepractically, tocharacterizethesmoothnessinaclassificationproblem, weintroduce variation or spread in the LOO interpolation score of the training dataset as ∇(x i )= 1 ˆ k ˆ k X j=1 ˆ η (x i |D i train )− ˆ η (x j |D j train ) 2 , (3.24) where ˆ k is the number of non-zero weighted neighbors identified by NNK and ˆη (x) is the unbiased NNK interpolation estimate of equation (3.14). A smooth interpolation region will have variation∇(x) in its region close to zero while a spread close to one corresponds to a noisy classification region. 89 3.3.3 DeepNNK: Neural Networks + NNK Interpolation The modern view of interpolation suggests that overfitting and generalization can happen simultaneously. This view has spurred renewed interest in estimators with more parameters than the dataset used to train them [173,174]. However, there has been a limited discussion of interpolative estimators integrated with a complete neural network. This is due in part to their complexity. For instance, when considering the interpolative methods proposed by [162]: working with d+1-simplices would be impractical in terms of complexity in data spaces with high dimension d, such as those encountered in neural networks. In contrast, a simpler method such as kNN does not have the same geometric properties (which allows for better theoretical analysis) or robustness as the simplex approach 3 . In the following subsections, we integrate our NNK interpolator with deep neural net- works to revisit several machine-learning problems. For clarity, we briefly group nearest neighbor methods in machine learning into three categories based on the type of data (Labeled,UnLabeled)atthedecisionpointxandassociatedneighborsformingthepolytope NNK poly (x) in Table 3.3. We provide experimental results demonstrating the advantages of NNK over kNN and other methods in each setting with representative tasks from each 4 . Through these experiments, we aim to show that simple local methods can achieve good results, and with additional training or parameter tuning can be made competitive with the state-of-the-art. We perform NNK interpolation on features of the data obtained at the embedding space ofadeepneuralnetwork(DNN),whilerelyingonself-supervisedorsupervisedlossfunctions for training the network. This strategy of using a different estimator at the final layer is not uncommon in deep learning [175–177] and is motivated by the intuition that each layer of a neural network corresponds to an abstract transformation of the input data space optimized for the task at hand. Note that, unlike the explicit parametric boundaries defined in deep 3 We refer to a robust estimator as one whose properties are minimally affected by the values of chosen hyperparameters associated with it 4 Source code for all experiments is available at github.com/STAC-USC 90 Data− Neighbor Applications Reference L− L Generalization, robustness analysis, Model selection, Curriculum learning 3.3.2.1, 3.3.2.2, 3.3.3.1 UL− L Semi-Supervised Learning, Transductive learning, Explainable predictions 3.3.3.2, 3.3.3.3, 3.3.3.4 UL− UL Clustering, Two sample statistic, Distance between datasets 3.3.3.5 Table 3.3: Overview of local neighborhood methods based on the availability of labels (Labeled, UnLabeled) at data point and corresponding neighbors with few applications in each category. The last column in the Table links to relevant materials in our work cor- responding to the setting in each group. neural networks trained for classification, local interpolation methods produce boundaries that are implicit, i.e., based on the relative positions of the training data in a transformed space. In other words, the proposed DeepNNK procedure allows us to characterize the deep network by the output classification space rather than relying on a global boundary defined on the space. Similar to [178], we combine kernel definitions with neural networks to take advantage of the expressive power of neural networks. We first transform the input using the non-linear mapping h w corresponding to the neural network parameterized by w and then apply the cosine kernel on the transformed input: K i,j = h w (x i ) ⊤ h w (x j ) ∥h w (x i )∥∥h w (x j )∥ . (3.25) Note that we assume the embeddings are obtained after a ReLU activation and thus the range of the cosine similarity in the space is∈[0,1]. 3.3.3.1 Predicting generalization (L − L) In this experiment, we evaluate DeepNNK for predicting the generalization capability of the model using LOO. We then show that this can be used for model selection, where one can choose a model with better generalization. 91 Experimental Setup: We consider a simple 7-layer network comprising 4 convolution layers with ReLU activations, 2 max-pool layers, and 1 full connected softmax layer to demonstrate model selection. We evaluate the test data performance and LOO stability on trainingdatafortheproposedmethodandcompareittoweightedkNN(wiNN)fordifferentk and to a cross-validated (5-fold) linear SVM 5 for three network settings, (i) Regularized: We used 32 filters for each convolution layer with dropout (probability 0.1) at each convolution layer. The data is augmented with random horizontal flips. (ii) Underparametrized: We keep the regularized model structure but reduce the number of filters to 16, equivalently the number of parameters of the model by half. (iii) Overfit : To simulate overfitting, we remove data augmentation and dropout regularization while keeping the network structure as in regularized model. We use the CIFAR-10 dataset (50000 training, 10000 test images of size 32× 32) to validate our analysis and intuitions. All models in the experiment are trained for 20 epochs using stochastic gradient descent with momentum (0.9) and learning rate 1e − 3 . We do not perform any hyperparameter tuning and the choice of k and 5-fold cross-validation in the experiment is made arbitrarily 6 to illustrate the behavior of each approach. Figure 3.11 shows the difference in performance between our method and weighted kNN (wiNN). In particular, while DeepNNK improves marginally with larger values of k, the performance of the wiNN approach degrades larger k. This can be explained by the fact that NNK accommodates new neighbors only if they belong to a new direction in space that improves the approximation of the query, unlike its kNN counterparts which naively estimates based on all k neighbors, irrespective of the choice of k. On the other hand, the cross-validated linear SVM model achieved sub-par performance in all network settings, achieving an error that follows the same trend as the deep learning model. This suggests that the linear SVM cannot capture the complexity of the labels associated with input data representations in 5 Similartoneighborhoodmethods,thelastlayerisreplacedandtrainedforeachfoldusingaLIBLINEAR SVM [179] with minimal ℓ 2 regularization. We use the default library setting for other parameters of the SVM. 6 We observe similar behavioral trends in performance for other choices 92 2 4 6 8 10 12 14 16 18 20 epochs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. classifier error NNK (k=25) KNN (k=25) NNK (k=50) KNN (k=50) NNK (k=75) KNN (k=75) SVM (5-Fold CV) Model 2 4 6 8 10 12 14 16 18 20 epochs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. classifier error 2 4 6 8 10 12 14 16 18 20 epochs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. classifier error 2 4 6 8 10 12 14 16 18 20 epochs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. classifier error 2 4 6 8 10 12 14 16 18 20 epochs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. classifier error (a) Underparameterized 2 4 6 8 10 12 14 16 18 20 epochs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. classifier error (b) Regularized 2 4 6 8 10 12 14 16 18 20 epochs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Avg. classifier error (c) Overfit Figure 3.11: Misclassification error ( ξ ) using fully connected softmax classifier model and interpolating classifiers (weighted kNN, NNK) for different values of k parameter at each training epoch on CIFAR-10 dataset. The training data (Top) and test data (Bottom) performance for three different model settings is shown in each column. NNK classification consistently performs as well as the actual model with classification error decreasing slightly as k increases. On the contrary, a weighted kNN model error increases for increasing k showing robustness issues. The classification error gap between the DNN model and leave- one-out NNK interpolator for train data is suggestive of underfitting ( ξ NNK < ξ model ) and overfitting ( ξ NNK > ξ model ). We claim that models are good when their performance on the training data agrees with that of the local NNK model. 93 0.0 0.1 0.2 0.3 0.4 0.5 Interpolation score spread 0 2 4 6 8 10 12 Underfit Regularized Overfit Figure 3.12: Histogram (normalized) of leave-one-out interpolation score (3.24) after 20 epochs with k = 50 on CIFAR-10. While the network performance on the training dataset is considerably different in each setting, we see that the change in the interpolation (classifi- cation) landscape associated with the input data is minimal, and, consequently, all networks have a similar test dataset performance. However, the interpolation score spread is shifted towards zero in a regularized model, indicating a relatively better generalization or classifi- cation performance. the deeper layers. In contrast, while the NNK method performs on par with, if not better than, the deep learning model on the test dataset, its LOO performance on training data is a good indicator of the model’s generalization. One can clearly identify that a model is well regularized and stable by observing the performance obtained with the deep learning model and the LOO estimate using NNK interpolation – both models achieve similar performance. This observation also suggests that alternative classifiers, such as our proposed DeepNNK interpolator, canboostthemodel’sperformanceduringinferenceevenwhennotusedduring the model’s training. In Figure 3.12, we observe that the histogram of interpolation score spread (3.24) for the regularized model is shifted towards zero relative to the under-parameterized and overfit models. Thisisindicativeofthemodel’sgeneralization. Aminimalshiftintheinterpolation score spread is expected as the difference in test error associated with each model is also small. 94 3.3.3.2 A simple few shot framework (UL − L) In few-shot learning (FSL), one is given a set of base data D base = {(x 1 ,y 1 ),...(x N ,y N )} wherey i ∈{1,2,...C base } and a support dataD sup forC novel classes witha small number, m (e.g., m = 1 or 5), of training examples for each class. Few-shot learning aims to construct a model that can perform well on the C novel classes. This setup is called m-shot C novel - way learning system. Due to the limited availability of data in the novel classes, an FSL model needs to exploit the base dataset for training so that it can be successful in transferring to the novel class with good classification performance. One approach to the problem adapted byseveralFSLsystems[3,4,180]istotrainamodelwithD base forfeatureextractionfollowed by a 1-nearest neighbor classification, or other simple linear classifiers, on features extracted onD sup . Inthiswork,westudyasimplefew-shotlearnerbasedonlocalNNKinterpolation,where each unlabeled data point is classified using labeled data neighbors obtained using a deep feature extractor trained on the base dataset. We focus on the transductive FSL setting where unlabeled test data is available during model construction. We iteratively refine the predictions on the unlabeled test data by selecting for each point a pseudo label, i.e., the label for which the prediction has the most confidence, and using these pseudo labels as additional support data. Note that this process does not involve expensive training of the neural network model or fine-tuning of additional parameters to improve performance. Al- gorithm 6 describes the proposed method for transductive classification of test data queries X Q using NNK. For comparison, we also evaluate the proposed algorithm by replacing the NNK interpolation classifier in the framework with a locally weighted kNN classifier. The proposed FSL framework can be adapted to a semi-supervised inductive classifier setting by pseudo-labeling the available unlabeled data (instead of the test data as done for augmenta- tion in the transductive case) followed by the classification of queries using this augmented support dataset. 95 Algorithm 6: NNK Transductive Few Shot Learning Input : Neural Networkh w , DatasetsD base ,D sup , Test queriesX Q , No. of Neighbors k Trainh w usingD base whileX Q not empty do forx i ∈X Q do /* N x i :h w (x i ) neighbors in D sup */ y i = Labelx i usingN x i in NNK interpolator (3.14) end /* Pseudo label confident predictions in each class - (X ∗ Q ,Y ∗ Q ) */ D sup =D sup ∪(X ∗ Q ,Y ∗ Q ) X Q =X Q − X ∗ Q end Output: Class predictionsY Q Experiment Setup We apply our proposed FSL framework on two standard benchmark datasets mini-Imagenet [180] and tiered-Imagenet [181], which are subsets of the ImageNet [182] dataset with 100 and 608 classes respectively. All images are resized to 84× 84 via rescaling and cropping. For few-shot evaluation, we follow a setting similar to [3–7] where we draw random samples for 1- and 5-shot 5-way tasks: each task has 5 novel classes with 1 or 5 labeled (support) data points for each class. The model is then tested on 15 queries for each class. We use a wide residual network architecture [183] as our model backbone with 28 convo- lutional layers and a widening factor 10. We do not perform any hyperparameter search and restrict ourselves to the settings from [3] for training the model using D base . The network is trained in batches of 256 for 90 epochs with data augmentation from [105] and an initial learning rate of 0.1 which is reduced by a factor of 10 at fixed schedules. We perform early stopping using 1-nearest neighbor classification on a randomly sampled set of 5-validation classes. For both NNK and weighted kNN classifier, we set a max k value and use the entire support set when the number of examples inD sup is smaller than k. Table 3.4 presents our results using local neighborhood-based FSL compared to recent FSL learners in the literature. We do not compare to approaches that are semi-supervised 96 Method mini-ImageNet tiered-ImageNet 1-shot 5-shot 1-shot 5-shot SimpleShot [3] 63.50 80.33 69.75 85.31 kNN (k=5) 74.73 81.29 76.39 84.32 NNK (k=5) 73.25 80.88 79.86 86.42 kNN (k=20) 66.67 76.83 70.19 79.21 NNK (k=20) 74.44 85.09 80.64 88.41 kNN (k=50) 51.59 63.43 55.36 65.92 NNK (k=50) 74.99 85.05 80.73 88.61 Requiring extra training / hyperparameter tuning Fine-tuning [4] 65.73 78.40 73.34 85.50 EPNet [5] 70.74 84.34 78.50 88.36 LaplacianShot [6] 74.86 84.13 80.18 87.56 PT+MAP [7] 82.92 88.82 85.41 90.44 Table 3.4: 1-shot and 5-shot accuracy (in %, higher is better) for 5-way classification on mini-ImageNetandtiered-ImageNetaveragedover600runs. ResultsfromkNN,NNKtrans- ductiveclassificationarecomparedtoaninductivemethodSimpleShot(CL2N)[3]andlisted performances from recently studied transductive methods such as [4–7]. We see that NNK outperforms kNN as k increases while achieving robust performance. Further, our simple framework is comparable to, and often better than, recent and more complex transductive FSL algorithms that require additional training, fine-tuning of hyperparameters, or prepro- cessing as in [7]. 97 (extra unlabeled data per class inD sup ) or perform data augmentation, as such approaches makeuseofadditionaldatastatisticsorinducespecificbiasthroughaugmentation. Further, we note that prior methods report results with various network architectures; to eliminate the effect of network backbone in FSL models, we compare our framework only with models using the Wide-ResNet-28-10 backbone. 3.3.3.3 Instance based explainability (UL − L) deer, deer deer, 3.46e-01 deer, 2.34e-01 deer, 1.97e-01 deer, 1.51e-01 deer, 3.91e-02 cat, 3.28e-02 auto, auto auto, 2.59e-01 auto, 2.34e-01 auto, 1.60e-01 auto, 1.48e-01 auto, 1.35e-01 auto, 6.34e-02 Figure3.13: Twotestexamples(firstimageineachset)withidentifiedNNKneighborsfrom CIFAR-10fork=50. Weshowtheassignedandpredictedlabelforthetestsample, andthe assigned label and NNK weight for neighboring (and influential) training instances. Though we were able to identify the correct label for the test sample, one might want to question such duplicates in the dataset for downstream applications. Wenextpresentafewinterpretabilityresults,showingourframework’sabilitytocapture traininginstancesthatareinfluentialinprediction. Neighborsselectedfromtrainingdatafor interpolation by DeepNNK can be used as examples to explain the neural network decision. This intepretability can be crucial in problems with transparency requirements since they allowanobservertointerprettheregionaroundatestrepresentationaspossibleexplanation of the decision by the model. In Figure 3.13, we show examples in the training dataset that are responsible for a pre- diction using the simple regularized model defined previously. Machine models may contain biases if the datasets used for training contain biases, such as repeated instances of data points with small perturbations, often introduced to obtain class-balanced datasets. These repeated instances are often undesirable for applications where fairness is important. The DeepNNK framework can help understand and eliminate sources of bias by allowing practi- tionerstoidentifysomeoftheselimitations,suchasduplicatetraininginstancesandspurious correlations, oftheirmodel. Figure3.14showsanotherapplicationofNNKwherethefragile 98 bird, plane plane, 1.73e-01 plane, 1.52e-01 plane, 1.30e-01 plane, 1.12e-01 plane, 1.02e-01 plane, 7.75e-02 plane, 7.22e-02 plane, 5.14e-02 plane, 4.11e-02 plane, 3.54e-02 plane, 3.09e-02 plane, 2.22e-02 dog, horse horse, 1.57e-01 horse, 1.55e-01 horse, 1.35e-01 deer, 1.28e-01 horse, 9.07e-02 horse, 7.00e-02 horse, 6.23e-02 horse, 5.34e-02 horse, 5.03e-02 horse, 2.51e-02 horse, 2.18e-02 horse, 1.84e-02 horse, 1.82e-02 horse, 8.58e-03 horse, 4.01e-03 horse, 2.96e-03 Figure3.14: Twotrainingsetexamples(firstimagesineachset)observedtohavemaximum classification error in LOO NNK interpolation score, and their respective neighbors, for k = 50. We show the assigned and predicted label for the image being classified, and the assigned label and NNK weight for the neighbors. These instances exemplify the possible brittleness of the classification model, which can better inform a user about the limits of the trained model. nature of a model over certain training images is brought to light using interpolation spread of equation (3.24). These experiments show the possibility of the DeepNNK framework being used as a debugging tool in deep learning. Finally, we present an experimental anal- plane, bird bird, 3.86e-01 bird, 2.57e-01 bird, 1.43e-01 bird, 1.34e-01 bird, 7.99e-02 truck, frog frog, 5.36e-01 frog, 2.19e-01 frog, 1.86e-01 frog, 4.45e-02 frog, 1.46e-02 Figure 3.15: Selected black box adversarial examples (first image) and their NNK neighbors fromCIFAR-10trainingdatasetwithk=50. Thoughchangesintheinputimageareimper- ceptible to a human eye, one can better characterize a prediction using NNK by observing the interpolation region of the test instance. ysis of generative and adversarial images from the perspective of NNK interpolation. We study these methods using our DeepNNK framework applied on a Wide-ResNet-28-10 [183] architecture trained with auto augment [184] 7 . Generative and adversarial examples leverage pockets in the classification space where a model (discriminator in the case of generative images or a classifier in the case of black-box attacks) is influenced by a smaller number of neighboring points. This is made evident in Figure 3.16 where we see that the number of neighbors in the case of generative and adver- sarial images is on average smaller than that of real images. We conjecture that this is a property of the interpolative nature of deep learning models, where realistic images can be 7 DeepNNK achieves 97.3% test accuracy on CIFAR-10 similar to that of the original network. 99 5 10 15 No. of neighbors 0.0 0.2 0.4 0.6 Real Generated (a) 5 10 15 No. of neighbors 0.0 0.2 0.4 0.6 Real Adversarial (b) Figure3.16: Histogram(normalized)ofthenumberofneighborsfor(a)generatedimages[9], (b) black box adversarial images [10] and actual CIFAR-10 images. We see that generated and adversarial images on average have fewer neighbors than real images, suggestive of the facttheseexamplesoftenfallininterpolatingregionswherefewtrainimagesspanthespace. An adversarial image is born when these areas of interpolation belong to unstable regions in the classification surface. obtained in neighborhoods with compact support where perturbations along extrapolating or mislabeled sample directions lead to adversarial images. Though the adversarial pertur- bations in the input image space are visually indistinguishable, the change in the embedding of the adversarial image in the classification space is significantly larger, in some cases as in Figure 3.15, belonging to regions completely different from its class. 3.3.3.4 Evaluating self-supervised learning models (UL − L) Standard protocols for benchmarking self-supervised learning models involve using a linear probe or weighted kNN on features extracted from the learned model. However, both eval- uations are sensitive to hyperparameters, complicating the evaluation and comparison. For example, inlinearevaluations, oneoftenappliesselectedaugmentationstotheinputtotrain the linear classifier on top of the feature extractor in addition to hyperparameters used for training the classifier. A weighted kNN classifier, on the other hand, avoids data augmenta- tionsbutstillsuffersfromtheselectionofhyperparameterk. Oneobservesalargevariancein accuracy in both these evaluation frameworks based on the hyperparameter/augmentations 100 chosen, making the comparison of feature extractors ambiguous. Ideally, one would want an evaluation protocol that does not require augmentations or hyperparameter tuning and can be run quickly, given the features of the data. Method Linear kNN NNK Supervised 76.1 74.5 75.4 MoCov2 [185] 71.1 60.3 64.9 DCv2 [186] 75.2 65.8 70.7 SwAV [186] 74.3 63.2 68.7 DINO [11] 75.3 65.6 71.1 Table 3.5: Top-1 Linear, weighted kNN and NNK classification accuracy on ImageNet. We report performance on the validation dataset with ResNet-50 models trained using different self-supervised training strategies. The kNN and NNK evaluations were done on the VISSL framework using officially released model weights with the number of neighbors k = 50. We do not run linear evaluations but list the performance as reported on the corresponding works for comparison. Code for the experiment is available at github.com/shekkizh. Experiment Setup Table 3.5 lists Top-1 Linear, weighted kNN and NNK classification accuracy on ImageNet [182] using a fixed backbone architecture ResNet-50 [105] that was trained using different self-supervised training strategies. The evaluation protocol follows a standard setup where one evaluated performance on the validation dataset based on the labeled training dataset. The kNN and NNK evaluations were done using framework VISSL [187] with officially released model weights and setting the parameter k, number of neighbors, to 50. We see that NNK consistently outperforms weighted kNN in all models with comparable performance to linear classifiers. Note that linear classifiers in this setup require additional compute resources, fine-tuning, and data augmentations. To evaluate the robustness of NNK relative to kNN, we evaluated different values of k using a recently introduced self-supervised model, DINO (distillation with no labels) [11], as our baseline model and compared top-1 performance on ImageNet. As can be seen in Figure 3.17, NNK not only consistently outperforms a weighted kNN classifier but does so in a robust manner. Further, unlike kNN, whose performance decreases as k increases, NNK improves with k. This can be explained by the fact that NNK accommodates new neighbors 101 Figure3.17: WeightedkNNvsproposedNNKevaluationofself-supervisedmodelsfrom[11]. Self-supervised vision transformers and their distilled versions trained with patch sizes 8 (Left) and 16 (Right) are evaluated for different values of k. 0.000 0.015 0.030 0.045 0.060 0.075 0.090 0.105 0.120 0.135 NNK distance 0.6 0.7 0.8 0.9 1.0 Accuracy brightness contrast saturate jpeg_compression elastic_transform gaussian_noise impulse_noise defocus_blur motion_blur frost 0.00 0.04 0.08 0.12 0.16 0.20 0.24 0.28 0.32 0.36 NNK distance 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy Figure 3.18: (Left) CIFAR-10 and (Right) CIFAR-100. Wide-ResNet-28-10 model accuracy vs NNK distance (3.26) between the clean dataset and five different noise levels of various corrupted datasets. The dashed line denotes accuracy on the clean dataset, and the scatter point size corresponds to the standard deviation of the terms within the summation in each distance. WeseethattheproposeddistanceNNK(D clean |D corrupt )isindicativeofthemodel’s ability to transfer with performance decreasing with increasing distance. only if they belong to a new direction in space that improves the sparse approximation, whereas a kNN interpolates with all k neighbors. The NNK classifier in this setup achieves performance on par with, if not better than, the linear classifier model. The small ViT modelachievesImageNettop-1accuracyof79.8%,thebestperformancebyanon-parametric classifier in conjunction with self-supervised models when [11] was published. 3.3.3.5 Distance between datasets (UL − UL) The empirical performance of deep neural networks in transfer learning and domain adap- tation settings has generated renewed interest in the field [188,189]. In these scenarios, a model is trained on a (possibly labeled) dataset and then applied directly (or fine-tuned for) 102 unseen new data with characteristics different from those of the initial training set. In this context, it would be useful to develop a practical tool to identify in advance when and if a model will transfer well to a particular dataset. In this section, we introduce an asymmetric metric to characterize the distance between datasets as a first step toward capturing the likelihood of success in model transfer. The distance measure is label-independent and can be obtained for any two datasets (different modalities and domains), provided a kernel can bedefinedtoquantifythesimilarityofsamplesacrossdatasets. Inwords, thedistanceisthe error in approximating points in one dataset using its NNK neighbors from another dataset. The asymmetric nature of our distance is justified by the fact that transfer from simple to difficult data is harder than the other way around. Definition 3.2. Given dataset samples D 1 ={x 1 ,x 2 ...x M } and D 2 ={y 1 ,y 2 ...y N }, the NNK distance between the datasets for a given kernel K∈[0,1] is defined as NNK(D 1 |D 2 )= 1 M X x i ∈D 1 min θ S ≥ 0 K i,i − θ ⊤ S K S,i +θ ⊤ S K S,S θ S , (3.26) where set S ={y s 1 ,...y s k } corresponds to the set of k nearest neighbors ofx i fromD 2 . K S,i corresponds to the kernel similarity evaluated between the neighbors and x i . Let ϕ a (x i ) denote the approximation obtained with the data interpolation in NNK of (3.15). Then, the minimization objective J(x i ) associated with a data x i in (3.26) can be rewritten as J(x i )=K i,i − θ ⊤ S K S,i +θ ⊤ S K S,S θ S =∥ϕ (x i )− θ ⊤ S Φ S ∥ 2 =∥ϕ (x i )− ϕ a (x i )∥ 2 (3.27) where setS corresponds to setof neighborsfromD 2 thatis usedto estimatex i inD 1 . Thus, NNKdistanceistheerrorbetweentheactualobservationandtheapproximatedvalueofthe 103 observationbasedonanothersetofsamples. Intuitively, theaveragevalueofJ(x i )captures the extent to which a secondary dataset fits the primary dataset. Proposition 3.4. The NNK distance from Definition Definition 3.2 is asymmetric, non- parametric and satisfies 1. Positivity: NNK(D 1 |D 2 )≥ 0 2. Identity: NNK(D 1 |D 2 )=0 ⇐⇒ D 1 ⊆D 2 3. Triangle Inequality: NNK(D 1 |D 2 )≤ NNK(D 1 |D 3 )+NNK(D 3 |D 2 ) Proof. See Section 3.5.8. Note that a similar definition of distance using kNN is not straightforward, given that k and the weights in kNN are not explicitly chosen to minimize an approximation objective. For example, consider a kNN distance definition where we use the interpolation of (3.27), by replacing the weights obtained from NNK optimization with the relative similarity between x i and its kNN neighbors i.e.,θ kNN =K S,i /(1 ⊤ K S,i ): kNN(D 1 |D 2 )= 1 M X x i ∈D 1 K i,i − 2θ ⊤ kNN K S,i +θ ⊤ kNN K S,S θ kNN . (3.28) kNN(D 1 |D 2 ) lacks basic properties of distance. For example, it is easy to see it grows as k increases (since we add additional terms corresponding to farther away points). Consider the case where both datasets are the same. Ideally, we would want the distance between them to be 0, but kNN(D 1 |D 2 ) is nonzero for all values of k > 1. This also hints that it may not be possible to design a suitable distance without optimizing the set of neighboring points and their weights (as is the case for NNK). ExperimentSetup WeevaluatetheproposeddistancemetricwithCIFAR-10andCIFAR- 100 datasets [190] and their corrupted variants from [191]. As in the FSL experiment, we 104 use a Wide-ResNet-28-10 architecture for constructing a deep learning model for each CI- FAR dataset. The network is trained using only training dataset with data augmentation in batches of 128 for 200 epochs with learning rate 0.1 and weight decay 5e − 4 as used in [184]. We use features extracted at the penultimate layer of the trained network to measure the NNK distance between the clean test dataset and various corrupted versions of the test dataset. We observe a power law relationship between the proposed distance metric and the model performance in both datasets with accuracy decreasing as NNK distance increases. Figure 3.18 shows the relationship between the defined NNK distance and the model’s performance on the datasets. The distance allows us to understand the predictive perfor- mance of a pre-trained model on a new dataset (corrupted test dataset D corrupt ), given its performance on the original dataset (D clean ). Such a measure could potentially improve the transferlearningprocess,whereonecanchooseaparticularmodelfromapoolofpre-trained models more adaptive to the new dataset for fine-tuning. 3.4 Discussion and Open questions We discussed theoretical and practical ideas, from image-graph representations and data summarizationtointerpolationandgeneralizationinnon-parametricestimation. Underlying these applications is a single common tool, a local polytope neighborhood definition, whose support is determined adaptively based on the input data. In particular, we investigated 1. an adaptation of the NNK graph framework for image representation in Section 3.1. The proposed approach leads to sparse and scalable image graphs with better energy compaction in their spectral bases than previously used window-based graph methods. Our experiments on benchmark images show that the NNK image graphs perform better with an average 20% gain in SSIM in images with significant noise. 2. a geometric dictionary learning framework that can be used for data summarization applications. The proposed NNK-Means overcomes limitations that prevented using 105 previous dictionary learning methods for summarization. NNK-Means learns atoms similartokMeanscentroidsandleveragesneighborhoodsearchtoolstoperformsparse coding efficiently and adaptively represent data using the learned summary. Experi- ments demonstrated that our approach has runtimes similar to kMeans while learning dictionaries that can provide better discrimination than competing methods. 3. the generalization and stability of interpolative estimation using NNK neighborhoods. Further, we demonstrated NNK interpolation for model selection, few-shot learning, and other machine learning tasks in conjunction with deep neural networks. We show that NNK offers competitive performance and provides a way to incorporate recent theoreticalworksoninterpolationintopracticalsettings. Thisapproachisinstrumental in characterizing deep networks from the point of view of the dataset it is trained on. We now conclude with a few open questions and thoughts. • NNKimagegraphspresentawaytoincorporatepriorknowledgeaboutthedatatode- fine local neighborhoods. This approach can be extended so that other image-filtering kernels, e.g., those used in non-local means, BM3D, and kernel regression, can be used in the graph definition. Moreover, the framework presents a useful way to think about image graphs via sparse representations, which could be considered in future research. For example, one can leverage progress in sparse approximation problems with cor- rupted data to define image graphs where the input image is corrupted or has missing pixel values. • With NNK-Means Section 3.2, we moved from defining graphs and neighborhoods based on the entire dataset to an adaptive, representative data subset. This gener- alization of the NNK algorithm from Section 2.3 allows us to explore ideas of repre- sentation learning through neighborhoods and sparse representations via dictionaries. In the future, we plan to study the trade-offs associated with the dictionary size and 106 representation capacity and the implications of the perspective in data-driven learning systems. • In Section 3.3, we used LOO as a form of perturbation to understand the stabil- ity of a learning system. We believe other kinds of perturbation could help better understand neural networks statistically and numerically. Prototypes, sub-sampling methods, and approximation schemes are crucial to scaling neighborhood-based inter- polationtodatasetsofsizegreaterthan10 6 . Asanopenquestion,weleavethestudyof NNK-Means-basedinterpolationforascalablegeometricunderstandingoflargeneural networks and the properties of the model that are affected during learning. 3.5 Proofs 3.5.1 Proof of Theorem 3.1 Proof. Denote d i,j = ∥x i − x j ∥ 2 and similarly f i,j = ∥f i − f j ∥ 2 . Thus the bilateral filter weights can be rewritten as: K i,j =exp − d 2 i,j 2σ 2 d exp − f 2 i,j 2σ 2 f ! =exp − d 2 i,j 2σ 2 d − f 2 i,j 2σ 2 f ! . (3.29) Now, the contraposition of KRI theorem (2.15) gives a necessary and sufficient condition for an edge (θ i,k ) to be disconnected as K i,j K i,k ≥ 1 K j,k . (3.30) 107 Substituting for the bilateral weight kernel K i,j K i,k = " exp − d 2 i,j 2σ 2 d − f 2 i,j 2σ 2 f !, exp − d 2 i,k 2σ 2 d − f 2 i,k 2σ 2 f !# =exp − (d 2 i,j − d 2 i,k ) 2σ 2 d − (f 2 i,j − f 2 i,k ) 2σ 2 f ) ! . (3.31) Thus, condition (3.30) is simplified as exp − (d 2 i,j − d 2 i,k ) 2σ 2 d − (f 2 i,j − f 2 i,k ) 2σ 2 f ) ! ≥ exp d 2 j,k 2σ 2 d + f 2 j,k 2σ 2 f ! . Taking logarithms on both sides and rearranging terms corresponding to intensity and loca- tion, we obtain: − (d 2 i,j − d 2 i,k ) 2σ 2 d − (f 2 i,j − f 2 i,k ) 2σ 2 f ≥ d 2 j,k 2σ 2 d + f 2 j,k 2σ 2 f − (f 2 j,k +f 2 i,j − f 2 i,k ) 2σ 2 d ≥ (d 2 j,k +d 2 i,j − d 2 i,k ) 2σ 2 d f 2 i,j +f 2 j,k − f 2 i,k ≤− 2σ 2 f 2σ 2 d d 2 i,j +d 2 j,k − d 2 i,k . (3.32) Using the simplification from Lemma 3.4.1, to replace d 2 i,j +d 2 j,k − d 2 i,k and f 2 i,j +f 2 j,k − f 2 i,k leads to (3.4) and concludes the proof. 2(f j − f k ) ⊤ (f j − f i )≤− σ f σ d 2 2(x j − x k ) ⊤ (x j − x i ) (f j − f k ) ⊤ (f j − f i )≤− σ f σ d 2 (x j − x k ) ⊤ (x j − x i ). (3.33) 108 3.5.1.1 Lemma Lemma 3.4.1. d 2 i,j +d 2 j,k − d 2 i,k =2(x j − x k ) ⊤ (x j − x i ) (3.34) f 2 i,j +f 2 j,k − f 2 i,k =2(f j − f k ) ⊤ (f j − f i ) (3.35) Proof. d 2 i,j =∥x i − x j ∥ 2 =(x i − x j ) ⊤ (x i − x j ) =x ⊤ i x i − 2x ⊤ i x j +x ⊤ j x j . (3.36) Using the above l 2 -norm expression, we have d 2 i,j +d 2 j,k = x ⊤ i x i − 2x ⊤ i x j +2x ⊤ j x j − 2x ⊤ j x k +x ⊤ k x k d 2 i,j +d 2 j,k − d 2 i,k =− 2x ⊤ i x j +2x ⊤ j x j − 2x ⊤ j x k +2x ⊤ i x k =2(x ⊤ j (− x i +x j )− x ⊤ k (x j − x i )) =2(x j − x k ) ⊤ (x j − x i ). 3.5.2 Proof of Proposition 3.1 Proof. Equation (2.13) of Proposition 2.3 implies ∀j ∈ β X l∈β K j,l θ i,l − K i,j =0, (3.37) 109 where β is the set of NNK neighbors with non-zero neighborhood weights. Thus, θ i,j + X l∈β −{ j} K j,l θ i,l − K i,j =0 ∵K j,j =1 θ i,j =K i,j − X l∈β −{ j} K j,l θ i,l . Now, combinedwiththefactthattermsK j,l , θ i,l , andθ i,j arenon-negativequantitiesresults in the proposition, i.e., θ i,j <K i,j since θ i,l >0,K j,l >0. (3.38) 3.5.3 Proof of Proposition 3.2 Proof. LetW =[e i 1 ...e i N ]∈R M× N where indices i 1 ,...i N ∈{1,2,...M}. Now, rewriting the productWW ⊤ as a sum of outer products involving the columns of the matrix W, we have WW ⊤ = N X n=1 e in e ⊤ in = M X m=1 N X n=1 I(i n =m)e m e ⊤ m . LetΣ =diag(Σ 1 ,Σ 2 ...Σ M ), where Σ m denotes the number of columns inW with the basis vectore m . In other words, Σ m is the number of input data points that are represented using a given dictionary atom. Thus, we have WW ⊤ = M X m=1 Σ m e m e ⊤ m =Σ . (3.39) 110 Consequently, the dictionary update of equation (3.12) for 1-sparse coding representation is given by A=W(WW ⊤ ) − 1 =W ⊤ Σ − 1 . 3.5.4 Proof of Theorem 3.2 Proof. The NNK-Means algorithm minimizes the objective J =∥Φ − Φ AW∥ 2 F =Trace(K +W ⊤ A ⊤ KAW − 2KAW) = N X i=1 ∥ϕ i − Φ AW i ∥ 2 2 . For a fixed A, the minimization ofW for each data point is the same as in equation (2.6) of NNK.Thus,thesolutionobtainedusingsparsecodingviatheNNKalgorithm(Algorithm4) is guaranteed to minimize the objective as evidenced by the KKT conditions (Section 2.8.1). Now, given a sparse non-negative representation matrixW, the proposed dictionary update corresponds to the optimal solution to the objective minimization. To observe this, consider the KKT fixed point condition, i.e., ∇ A (J)=2KAWW ⊤ − 2KW ⊤ =0 =⇒ A=W ⊤ (WW ⊤ ) − 1 . (3.40) Thus the algorithm is monotonically decreasing (non-increasing) with both update steps of the algorithm. The minimization objective is bounded below by 0, and thus the algorithm converges. 111 3.5.5 Proof of Proposition 3.3 Proof of Proposition 3.3. Let Φ P correspond to the matrix containing the ˆ k neighbors with non-zero data interpolation weights and y P the associated labels. The kernel space linear interpolation estimator is obtained by solving ˜ α =argmin α ˆ k X i=1 (y i − α ⊤ ϕ (x i ))=K − 1 P,P Φ P y P . (3.41) Therefore,usingmatrixidentityandequation(3.42)resultingfromlemma3.4.2,theestimate y atx is obtained as y = ˜ α ⊤ ϕ (x)=y ⊤ P Φ ⊤ P K − 1 P,P ϕ (x)=y ⊤ P K − 1 P,P Φ ⊤ P ϕ (x)=y ⊤ P K − 1 P,P K P,∗ =y ⊤ P θ P = ˆ k X i=1 θ i y i . 3.5.5.1 Active set Lemma Lemma 3.4.2. Proposition 2.3 states that the NNK optimization problem (3.15) satisfies active constraints set, i.e., given a partition {θ P ,θ ¯ P }, where θ P > 0 (inactive) and θ ¯ P = 0 (active), the solution [θ P θ ¯ P ] ⊤ is the optimal solution provided: K P,P θ P =K P,∗ and K ⊤ P, ¯ P θ P − K¯ P,∗ ≥ 0. Moreover, the set P corresponds to non-zero support of the constrained problem if and only ifK P,P is full rank andθ P >0 [192]. Thus, the solution to (3.15) is obtained as θ S =[θ P θ ¯ P ] ⊤ =[(K P,P ) − 1 K P,∗ 0]. (3.42) 112 3.5.6 Proof of Theorem 3.3 Proof. Theprooffollowsasimilarargumentasinthesimplicialinterpolationboundin[162]. The expected excess mean squared risk can be partitioned based on disjoint sets as 8 E[(ˆ η (x)− η (x)) 2 ]=E[(ˆ η (x)− η (x)) 2 |X / ∈C]P(X / ∈C)+E[(ˆ η (x)− η (x)) 2 |X ∈C]P(X ∈C) ≤ E[(ˆ η (x)− η (x)) 2 |X / ∈C]P(X / ∈C)+E[(ˆ η (x)− η (x)) 2 |X ∈C]. (3.43) For points outside the convex hull, NNK extrapolates labels and no guarantees can be made on the regression without further assumptions. Thus, (ˆ η (x)− η (x)) 2 ≤ 1 which reduces the first term on the left of equation (3.43) to that of theorem. Letθ ˆ k bethesolutiontoNNKinterpolationobjective(3.15). Let w i = θ i Pˆ k i=1 θ i denotethe weight normalized values. The normalized weights follow a Dirichlet(1, 1 ...1) distribution with ˆ k concentration parameters. ˆ η (x)− η (x)= ˆ k X i=1 w i (y i − η (x))= ˆ k X i=1 w i (y i − η (x i )+η (x i )− η (x))= ˆ k X i=1 w i ϵ i + ˆ k X i=1 w i b i , (3.44) where ϵ i = y i − η (x i ) corresponds to Bayesian estimator errors in the training data and b i =η (x i )− η (x) is related to bias. By smoothness assumption on η we have |b i |=|η (x i )− η (x)|≤ A∥ϕ (x i )− ϕ (x)∥ α ≤ Aδ α . (3.45) Since b i and ϵ i are independent, we have E[(ˆ η (x)− η (x)) 2 |X ∈C]=E[ ˆ k X i=1 w i ϵ i 2 |X ∈C]+E[ ˆ k X i=1 w i b i 2 |X ∈C]. (3.46) 8 All expectations in this proof are conditioned on D train . We first derive a bound for a given X = x with an average over Y and later take expectation over X. For the sake of brevity, we do not make the conditioning and random variable associated with each expectation explicit in our statements. 113 By Jensen’s inequality, P ˆ k i=1 w i b i 2 ≤ P ˆ k i=1 w i b 2 i and bound in equation (3.45), E[ ˆ k X i=1 w i b i 2 |X ∈C]≤ E[ ˆ k X i=1 w i b 2 i |X ∈C]≤ E[ ˆ k X i=1 w i A 2 δ 2α |X ∈C]=A 2 δ 2α . (3.47) Let ν (x) = var(Y|X = x). Under independence assumption on noise, the term with ϵ in equation (3.46) can be rewritten as E[ ˆ k X i=1 w i ϵ i 2 |X ∈C]=E[ ˆ k X i=1 w 2 i ϵ 2 i |X ∈C]= ˆ k X i=1 E[w 2 i |X ∈C]E[ϵ 2 i |X ∈C] = 2 ( ˆ k+1)( ˆ k) ˆ k X i=1 ν (x i )≤ 2 ( ˆ k+1)( ˆ k) ˆ k X i=1 ν (x)+|ν (x i )− ν (x)|, whereweusethefactthatw i followsDirichletdistribution. Now,thesmoothnessassumption on var(Y|X) allows us to bound |ν (x i )− ν (x)|≤ A ′ ∥ϕ (x i )− ϕ (x)∥ α ′ ≤ A ′ δ α ′ (3.48) =⇒ E[ ˆ k X i=1 w i ϵ i 2 |X ∈C]≤ 2 ˆ k+1 ν (x)+A ′ δ α ′ . (3.49) Combining with equation (3.47), the risk bound for points within the convex hull of training data is obtained as E[(ˆ η (x)− η (x)) 2 |X ∈C]≤ A 2 δ 2α + 2 ˆ k+1 ν (x)+A ′ δ α ′ . (3.50) Taking expectation over X on Equation (3.50) along with the reduction for points outside the convex hullC obtained earlier gives the excess risk bound and concludes the proof. 114 3.5.6.1 Proof of Corollary (3.3.1) Proof. The nearest neighbor convergence lemma of [25] states that for an i.i.d sequence of random variablesD ={x 1 ,x 2 ...x N } inR d , the nearest neighbor of x from the setD con- verges in probability, NN(x) → p x. Equivalently, this would correspond to convergence in the kernel representation of the data points. Thus, the solution to the NNK data in- terpolation objective is reduced to 1-nearest neighbor interpolation with E K [ ˆ k] = 1 and limsup N→∞ δ = 0. Now, under the assumption that the supp(µ ) belongs to a bounded and convex region in R d , the first term on the right of equation (3.16) corresponding to NNK extrapolation vanishes i.e., limsup N→∞ E[µ (R d \C)]=0. Applying the asymptotic vanishing property of δ and E[µ (R d \C)] in Theorem 3.3 gives us the result of Corollary 3.3.1. 3.5.6.2 Proof of Corollary (3.3.2) Proof. The excess classification risk associated with a plug-in NNK classifier is related to the regression risk (see Theorem 17.1 in [14]) as E[R( ˆ f(x))−R (f(x))]≤ E[I( ˆ f(x)̸=f(x))]≤ 2E[|ˆ η (x)− η (x)|]. (3.51) From Corollary 3.3.1, we have limsup N→∞ E[(ˆ η (x)− η (x)) 2 ]≤ E[(Y − η (x) 2 ]. Using Jensen’s inequality, limsup N→∞ (E[|ˆ η (x)− η (x)|]) 2 ≤ limsup N→∞ E[(ˆ η (x)− η (x)) 2 ]. (3.52) Combining with equation (3.51) gives the required risk bound. 115 3.5.7 Proof of Theorem 3.4 Proof. The proof is based on the k-nearest neighbor result from Theorem 1 in [170] which states that P(|R loo (ˆ η |D train )−R gen (ˆ η )|>ϵ )≤ 2e − Nϵ 2 /18 +6e − Nϵ 3 /(108k(2+γ )) . As in [170], where the result is extended based on the 1-nearest neighbor, here it suffices to replace k by ˆ k with the argument that ˆ kγ +2≤ ˆ k(γ +2) and take expectation over the distribution of all data points X. 3.5.8 Proof of Proposition 3.4 The NNK distance between two datasetsD 1 andD 2 , for a kernelK∈[0,1], is defined as NNK(D 1 |D 2 )= 1 M X x i ∈D 1 min θ S ≥ 0 K i,i − θ ⊤ S K S,i +θ ⊤ S K S,S θ S = 1 M X x i ∈D 1 min θ S ≥ 0 ∥ϕ (x i )− θ ⊤ S Φ S ∥ 2 . The distance is a sum of non-negative terms and thus satisfies positivity. Now, the distance measure between the same dataset is given by NNK(D 1 |D 1 )= 1 M X x i ∈D 1 min θ S ≥ 0 ∥ϕ (x i )− θ ⊤ S Φ S ∥ 2 . Here, the set S contains the nearest neighbors of x i in D 1 and thus contains the data pointx i (the kernel self-similarity for any data points is κ (x i ,x i )=1). Thus, eachx i in the distancesummationcanbeperfectlyapproximated,resultinginzeroerrorand,subsequently, zero distance (Identity). To prove triangle inequality of the distance, we make use of the property of ℓ 2 norm and linearity of the summation as below. Let D 1 = {x 1 ,x 2 ...x N 1 }, 116 D 2 ={y 1 ,y 2 ...y N 2 }, andD 3 ={z 1 ,z 2 ...z N 3 }. Consider the distance betweenD 1 andD 2 , i.e., NNK(D 1 |D 2 )= 1 N 1 X x i ∈D 1 min θ S ≥ 0 ∥ϕ (x i )− θ ⊤ S Φ S ∥ 2 = 1 N 1 X x i ∈D 1 min θ ≥ 0 ∥ϕ (x i )− θ ⊤ ϕ (D 2 )∥ 2 , where we consider the set S to be the entire datasetD 2 . Now, NNK(D 1 |D 2 )= 1 N 1 X x i ∈D 1 min θ ≥ 0 ∥ϕ (x i )− θ ⊤ ϕ (D 2 )∥ 2 = 1 N 1 X x i ∈D 1 min θ ≥ 0 ∥ϕ (x i )− θ ⊤ ϕ (D 3 )+θ ⊤ ϕ (D 3 )− θ ⊤ ϕ (D 2 )∥ 2 ≤ 1 N 1 X x i ∈D 1 min θ ≥ 0 ∥ϕ (x i )− θ ⊤ ϕ (D 3 )∥ 2 +∥θ ⊤ ϕ (D 3 )− θ ⊤ ϕ (D 2 )∥ 2 ≤ 1 N 1 X x i ∈D 1 min θ ≥ 0 ∥ϕ (x i )− θ ⊤ ϕ (D 3 )∥ 2 + 1 N 3 X z j ∈D 3 min θ ′ ≥ 0 ∥ϕ (z j )− θ ′⊤ ϕ (D 2 )∥ 2 =NNK(D 1 |D 3 )+NNK(D 3 |D 2 ), where we make use of the triangle property of ℓ 2 norm, Proposition 3.1, and the range of the kernel. 117 Chapter 4 Geometry of deep learning Deep learning approaches have achieved unprecedented performance success in many appli- cation domains. One of the driving factors in their success is the use of massive models, whose number of parameters often exceeds the size of the training dataset. It is well un- derstood that in this overparameterized regime, deep neural networks (DNNs) are highly expressive and have the capacity to (over)fit arbitrary training data and, surprisingly, still exhibit good generalization, i.e., performance on unseen data [193,194]. Often, the main (and in some cases only) justification for a specific choice of model in a system is simply that it works well (in terms of accuracy or other performance metrics) with data selected for evaluation. While this is a very practical perspective that has led to significant advances, a better understanding of deep learning systems is needed, not only for applications where safety is critical (e.g., self-driving vehicles) but also to understand their limitationsin thereal world, where thesesystems canbe exposed todata very differentfrom what was available at training [161,195]. Conventionaltheoreticalframeworkscannotexplainthisgapinunderstanding[196–198], which arises from design choices that are hard to explain. This lack of understanding con- sequently causes reliability and trustworthiness concerns. To develop network architectures and training procedures that overcome these challenges, it is therefore critical to develop new theories of deep learning, for example, one that makes explicit the properties of the models [154,162,199]. 118 Figure 4.1: Left: Data-driven view of the geometry of the embedding manifold using a graph. Because neighborhoods and graphs are intrinsically independent of the exact data position,wecancomparefundamentallyheterogeneousobservations(suchasrepresentations from different dimensional spaces or models). Right: Progressive transformation of input space over successive layers of a DNN. The samples in the dataset are the same, and thus their attributes (e.g., labels) are the same, but their position in feature space, and hence the graph constructed, changes as the model is optimized for a particular task. Inthissection, weintroducea data-driven frameworkforstudyingdeeplearningbychar- acterizingthegeometryofthedatamanifoldintheembeddingspaces,usingcomputationally efficientandprincipledgraphconstructionsfromdata(seeChapter2). Ourproposedframe- workismotivatedbytheobservationthat,whileDNNsinvolvecomplexnon-linearmappings, the induced transformations (from one layer to the next) and the structure of the represen- tation space can be inferred from the data samples by observing their relative positions in that space (Figure 4.1). Under this perspective, we propose to develop a set of Manifold Graph Metrics (MGMs) that can provide intuition and quantitative data to characterize the data geometry at any layer of the DNN. Animportantaspectofthisapproachisthatitenablesustocomparethespacesurround- ing a given data point in different layers and at any stage (training, inference, or transfer). Moreover, this can also be used to compare the representations given by different models, which could have different characteristics (e.g., providing representations with different di- mensions). In contrast to existing techniques that consider mathematical approximations of the parameters, individual components, or optimization landscape of a DNN [200,201], our framework explores the geometric properties of the transformations and attributes of the model using data, i.e., we abstract the explicit functions that induce the mappings in a DNN and understand its properties through the effect those functions have on the data. 119 4.1 Background Graph and local neighborhood methods have played a significant role in machine learning tasks such as manifold learning [202], semi-supervised learning [98,203], and, more recently, graph-based analysis of deep learning [164,177,204]. Typically their use in this context is motivated by their ability to represent data with irregular positions in space (x i ) N i=1 rather than directly modeling the distribution of the data P(x). This type of DNN analysis starts by constructing a good neighborhood representation [23,24], which is also a crucial first step in our framework. The most common approaches to defining a neighborhood are k-nearest neighbor (kNN) and ϵ -neighborhood. However, these approaches select points in a neighborhood based on the distance to the query point and do not consider their relative positions. They also rely on ad hoc procedures to select parametervalues(e.g., k orϵ ). Forthisreason, weusenon-negativekernelregression(NNK) to define neighborhoods and graphs for our manifold analysis (see Section 2.3). Unlike kNN, which can be seen as thresholding approximation, NNK can be interpreted as a form of basis pursuit [205], which leads to better neighborhood construction (See Section 2.3.4) with robust local estimation performance in several machine learning tasks [37,38]. OfparticularimportanceforouranalysisframeworkisthefactthatNNKneighborhoods haveageometricinterpretation. AnNNKneighborhoodcanbedescribedasaconvexpolytope approximation ofx i , denotedP(x i ), the size and shape of which is determined by the local geometry of the data available around x i (see Section 2.4). This geometric property is particularly important for non-uniformly sampled data that lies on a lower dimensional manifold in high dimensional vector space, which is typical for feature embeddings in DNN models. Note that NNK uses kNN as an initialization step, after which it has only a modest additional runtime requirement Section 2.3.3. Thus, we can scale our analysis to large datasets by speeding up the initialization using computational tools developed for kNN [90,206]. We use cosine kernels for similarity with our NNK construction. Since the encoder 120 Figure 4.2: MGMs in feature space are based on NNK polytopes. We display data samples (red and blue dots) and two NNK polytopes (P(x i ),P(x j )) in the encoder space that approximate the underlying manifold of the deep learning model (gray surface). Our proposedMGMscaptureinvariance(polytopediameter), manifoldcurvature(anglebetween neighboringpolytopes), andlocalintrinsicdimension(numberofverticesinpolytope)ofthe output manifold for a given DNN. representations are obtained after ReLU, the similarity kernel takes values in the interval [0,1]. 4.2 Manifold Graph Metrics WeusetheNNKpolytope,denotedasP(x i ),tocharacterizetheneighborhoodofx i andthe localandglobalgeometryofthedatamanifolds. Ourapproach,summarizedinFigure4.4, is motivatedbytheobservationthatwhiletheDNNsinvolvecomplexnon-linearmappings,the induced transformations and the structure of the representation space can be inferred from thedatasamplesbyobservingtheirrelativepositionsinthatspace. Inparticular,wepropose a set of Manifold Graph Metrics (MGMs) that provide intuition and quantitative data to characterize the geometry of a DNN, Figure 4.2. We focus on manifold properties that we consider important for the understanding of feature embeddings and their generalization capabilities: (i) the smoothness or variation of a graph signal on the manifold, (ii) the level 121 of invariance (or equivariance) with respect to a given perturbation or augmentation, (iii) the curvature of the manifold, (iv) the local intrinsic dimension of the manifold. These are introduced in what follows. 4.2.1 Graph signal variation Givenagraphconstructedoninputdata{x i } N i=1 withedgeweightscapturedinitsadjacency matrix W and a signal y (e.g., the label of each data), the graph signal variation (GSV) [21,79] is defined as GSV({x i ,y i } N i=1 )=y ⊤ Ly = N X i=1 X j∈N(i) W i,j ∥y i − y j ∥ 2 , (4.1) which captures the smoothness of the signal on the graph. N(i) corresponds to the NNK neighborsofinputx i andLrepresentsthecombinatorialLaplacianofthegraph,L=D− W, whereD is a diagonal matrix with D i,i = P j∈N(i) W i,j , in (4.1). Note that a smaller value ofy ⊤ Ly corresponds to a smoother signal, i.e., for a label signal y, an increase in the label similarity of the data in a neighborhood is commensurate with a decrease in y ⊤ Ly and vice-versa. Wenowpresentatheoreticalresult(Theorem4.1)relatingthedifferenceinlabelsmooth- ness between the input and output features of a single layer in a neural network to the complexity of the layer measured by the ℓ 2 -norm [199,207]. Theorem 4.1. Consider the features corresponding to the input and output of a layer in a neural network denoted by x out = ϕ (Wx in ) where ϕ (x) is a slope restricted nonlinearity applied along each dimension ofx. Let us suppose that the smoothness of the labelsy in the feature space is proportional to the smoothness of the data x. Then, y ⊤ L out y≤ c∥W∥ 2 2 y ⊤ L in y (4.2) 122 where L corresponds to the graph laplacian obtained using NNK in the feature space. Note that c>0 depends only on constants related to data smoothness and the slope of the nonlin- earity. Proof. Let N (i) in ,N (i) out be the set of NNK neighbors of data point x (i) in the input-output feature spaces. Then, y ⊤ L out y = X i X j∈N (i) out θ ij |y (i) − y (j) | 2 . (4.3) Now, by assumption, the smoothness of the labels is proportional to the data smoothness. Thus, y ⊤ L out y = X i X j∈N (i) out c out θ ij ∥x (i) out − x (j) out ∥ 2 (4.4) where c out >0 is the proportionality constant. N (i) out corresponds to the optimal NNK neighbors, and therefore any other neighbor set will have a larger value and label smoothness, i.e., y ⊤ L out y≤ c out X i X j∈N (i) in θ ij ∥x (i) out − x (j) out ∥ 2 =c out X i X j∈N (i) in θ ij ∥ϕ (Wx (i) in )− ϕ (Wx (j) in )∥ 2 ≤ c out X i X j∈N (i) in βθ ij ∥Wx (i) in − Wx (j) in ∥ 2 whereβ > 0correspondstotheupperboundontheslopeofthenonlinearactivationfunction. Using the label smoothness expression computed with input features to the layer and gathering all the positive constants as c=β cout c in , we obtain y ⊤ L out y≤ c∥W∥ 2 2 y ⊤ L in y. (4.5) 123 Remark 4.1. Theorem 4.1 states that the change in label smoothness between the input and output spaces of a network layer is indicative of the complexity of the mapping induced by that layer, i.e., a big change in label smoothness corresponds to a large transformation of the features space. Remark 4.2. Theorem 4.1 does not make any assumption on the model architecture. Its main assumption is about the relationship between the smoothness of the data and labels. The slope restriction on the nonlinearity is satisfied by activation functions used often in practice. For example, the ReLU function is slope-restricted between 0 and 1 [208,209]. 4.2.2 Invariance to augmentations Given input data {x i } N i=1 , we apply the NNK neighborhood definition to the feature space representations (f(x i )) of the input to obtain N convex polytopes. We define the diameter of an NNK polytope as the maximum distance between the nodes forming the polytope, i.e., Diam(P(f(x i )))= max k,l∈N(f(x i ))) ˆ f(x k )− ˆ f(x l )) 2 , (4.6) where ˆ f correspondstothel 2 -normalizedfeatureembeddingofagiveninput. Thesepolytope diameters take values in the range [0,2] and quantitatively measure how much some related input samples have been contracted or dilated by a DNN backbone. Thus, a constant or collapsedmapping, wheremultiplex i aremappedtothesamef, wouldleadtoadegenerate polytope with a diameter equal to zero. Instead, a diameter of 2 corresponds to a mapping where the neighbors are maximally scattered. Now, by considering as input data points and theiraugmentedorperturbedversions, thediameteroftheNNKpolytopecapturesthelevel of invariance of the DNN to this specific augmentation or perturbation – a smaller diameter corresponds to greater invariance of the DNN mapping to that augmentation. In contrast, a larger diameter corresponds to the representations being not invariant (or equivariance). 124 Alternatively, suppose we constrain ourselves to input data with the same class label. In that case, the polytope diameter indicates the level of invariance of the representation to samples that belong to that specific class. 4.2.3 Curvature We study the curvature of a manifold by comparing the orientations of NNK polytopes corresponding to neighboring input samples. That is, given two samples x i and x j that are NNK neighbors, and their respective polytopes P(f(x i )),P(f(x j )), we evaluate the angle between the subspace spanned by the neighbors N(f(x j )),N(f(x i )) that make up the polytopes. Concretely, we use the concept of affinity between subspaces [210] to define the affinity between two polytopes as a quantity in the interval [0 ,1] given by Aff( N(f(x i )),N(f(x j )))= s cos 2 (θ 1 )+··· +cos 2 (θ n i n j ) n i n j , (4.7) whereθ k , k =1,...,n j n i aretheprincipalanglesbetweenthetwosubspacesspannedbythe vectorsN(f(x j ))andN(f(x i )),containingalltheNNKneighborsofx j andx i ,respectively. Intuitively, this metric equals zero when the subspaces are orthogonal (polytopes oriented in perpendicular directions), while a value of one corresponds to zero curvature or subspace alignment. 4.2.4 Intrinsic Dimension The local intrinsic dimension of a manifold can be estimated as the number of neighbors selected by NNK. It was shown in [32,41,92] that the number of neighbors per polytope, i.e., n i , correlates with the local dimension of the manifold around a data point i. This observation is consistent with geometric intuitions; the number of NNK neighbors is decided 125 based on the availability of data that span orthogonal directions, i.e., the local subspace of the manifold: ID(f(x i ))=Card(N(f(x i ))), (4.8) where Card denotes the cardinal operator. Validating MGMs with Synthetic Data This section presents a study on controlled manifold settings to demonstrate the proposed metric’s ability to capture the manifold’s geometry. Using data points (N = 1024) sampled from three 10-dimensional manifolds embedded in an 128-dimensional feature space (Figure 4.3). Further, we generate images corresponding to the manifold using an ImageNet pre-trained generative model BigGAN as in [211] to show that the estimated MGMs reflect the geometry of the manifold. We restrict the generation to a single ImageNet class, namely, basenji. The images generated are of size 128× 128× 3. This setup is equivalent to knowing the manifold and observing the images corresponding to this manifold. The three manifold settings used to generate the synthetic data are: • Random samples: we sample 1024 embeddings from a 10-dimensional multivariate Gaussian to simulate an embedding space without explicit geometry. • Linear subspace with uniformly spaced samples: We start with 8 seed embeddings (equivalent to unique image inputs) and then create 128 embeddings (equivalent to augmentations of the images) obtained by uniformly spaced translation of the seed embeddings. • Spherical linear interpolated (SLERP) samples: We start with pairs of 8 seed em- beddings and then create 128 augmentation embeddings obtained via a symmetric weighted sum of each pair of endpoints. 126 Note that in the latter two settings, the manifold, and the data obtained from the manifold, by design, have high affinity owing to the smooth curvature. Further, each data point from the manifold is uniformly sampled, so all samples have similar invariance (diameter). In contrast, the geometry of the randomly sampled data will have low affinity and high diameter. This is because the neighboring polytopes will not be aligned with neighbors within each polytope, possibly spanning the entire subspace 1 . Figure 4.3: BigGAN images corresponding to samples from three manifolds. Observed NNK diameter (Invariance) and angle between neighboring NNK polytopes (Affinity/Curvature) for three embedding manifold settings: Random (Left), Linear (Middle), and Spherical (Right). As expected, the randomly sampled examples produce neighboring polytopes ori- ented almost orthogonal with respect to each other, indicative of the absence of a smooth manifold. Further, thelargediameterinthissettingshowsthattheexamplesarelocallydis- tinct, i.e., there is no collapse in the representation of the neighbors that form the polytope. In contrast, we see that the embeddings from linear and spherical subspaces have polytopes that are closely related to each other and are locally invariant. The image generation using BigGAN presents a unique opportunity to test and observe therelationshipbetweentheembeddingmanifoldandtheimages. Ofcourse,typically,oneis interestedinthecase,whereonehasinputimagesandwouldliketoobservetheembeddings, 1 Inthisexperiment, wepresenttheanglebetweenthepolytopesinsteadoftheaffinity. Thesetwometrics are directly related in that one is the cosine of the other. This choice makes the local curvature property explicit for readers unfamiliar with the notion of affinity used in the subspace literature. 127 butthisexperimentallowsustoexplicitlystudytherelationshipbetweendifferentmanifolds and the estimated MGM metrics. 4.3 Case study: Self-supervised learning models Self-supervised learning (SSL) for computer vision applications has empowered deep neural networks (DNNs) to learn meaningful representations of images from unlabeled data [212– 216,216–222]. SSL models are trained to learn a feature space embedding invariant to data augmentations (e.g., cropping, translation, color jitter) by maximizing the agreement between representations from different augmentations of the same image. This is achieved using a cascade of two networks, the backbone encoder, from which the representation to be used for a downstream task is extracted, and the projection head or projector, from which the output is fed into the SSL loss function. The main risk in an SSL training approach is the so-called feature collapse phenomenon [223–225], where the learned representations are invariant to input samples that belong to differentmanifolds. Toreducetheriskoffeaturecollapse,multipleSSLalgorithmshavebeen proposed. Broadly,SSLmodelscanbecategorizedintocontrastive[159,213],non-contrastive [226,227], prototype/clustering [186,228]. Contrastive SSL methods (e.g., simCLR-v1 [159], simCLR-v2 [213], MoCo-v1 [229], MoCo-v2 [185], Infomin [222], PIRL [230], InsDis [231]) make use of negative pairs to force the embedding of data points in an unaugmented pair to be pushed away during training. Non-contrastive SSL methods (BYOL [215], DINO [11]) utilizeateacher-studentbasedapproachwheretheweightsofthestudentsaretheexponential movingaverageoftheteacher’sweights. Protopype/clusteringSSLmethods(SeLA-v2[232], DeepCluster-v2 [186], SwAV [186], PCL-v1 and PCL-v2 [228]) enforce consistency obtained from different transformations of the same image by comparing cluster assignments without negative pair comparisons. 128 Background Recently,researchershavefocusedondevelopinganunderstandingofspecific components of SSL models by studying the loss function used to train the models and its impactonthelearnedrepresentations. Forinstance,[223]analyzescontrastivelossfunctions and the dimensional collapse problem. [233] also analyzes contrastive losses and describes the effect of augmentation strength and the importance of a non-linear projection head. [234] quantifies the importance of data augmentations in a contrastive SSL model via a distance-based approach. [235] explores a graph formulation of contrastive loss functions withgeneralizationguaranteesontherepresentationslearned. In[236],theimportanceofthe temperatureparameterusedintheSSLlossfunctionanditsimpactonlearningisexamined. [224]performsaspectralanalysisofDNN’smappinginducedbynon-contrastivelossandthe momentum encoder approach. However, all these studies cannot provide a unified analysis of the myriad of existing SSL models. Besides, these theoretical approaches only provide insights into the embedding obtained after the projection head, while in practice, it is the mapping provided by the encoder that is actually used for transfer learning. Moreover, besidesthetraininglossfunction, severaldifferencesexistinSSLmodels, even among those from the same category. For example, popular SSL models available as pre- trained networks [187] can differ in terms of training parameters (loss function, optimizer, learning rate), architecture (DNN backbone, projection head, momentum encoder), model initialization (weights, batch-normalization parameters, learning rate schedule), etc. Our approach centers on understanding the properties through the analysis of data geometry, unlike mathematical tools that make specific approximations on the DNN mappings (e.g. [223,233]), and, thus, allow for comparison across all these different SSL model settings. Another data-driven approach [237] suggests the use of classification performance by an SSL model on the ImageNet dataset as an indicator of its transfer performance on new tasks and datasets. This idea is effective for transfer datasets similar to ImageNet, but it cannot be generalized to all transfer learning problems [238]. In fact, if ImageNet performance were highlycorrelatedtogeneraltransferlearningperformance,thenitisunclearwhySSLmodels 129 wouldbeneeded,asonecouldsimplyusesupervisedtrainingwithImageNettoobtainimage representations for transfer learning. Furthermore, these empirical evaluations only provide a somewhat coarse and partial understanding of SSL models. For example, they do not provide insights into the level of invariance to specific augmentation in an SSL model relates to its performance on a given downstream task. Since it has been observed that invariance to some augmentations can be beneficial in some cases and harmful in others [239], our goal is to develop a more direct and quantitative understanding of augmentation invariance and how this invariance determines performance. To achieve this goal, we propose a geometric perspective to understand SSL models and their transfer capabilities. Our approach analyzes the manifold properties of SSL models using a data-driven graph-based method to characterize data geometry and their augmen- tations, as illustrated in Figure 4.4 (left). Specifically, we leverage our proposed MGMs to quantify the geometric properties of existing SSL models – their similarities, differences (Figure 4.4, right), and ability to transfer based on the characteristics of the target task. Because our approach can be applied directly to sample data points and their augmenta- tions,ithasseveralimportantadvantages. First,itisagnostictospecifictrainingprocedures, architecture, and loss functions; Second, it can be applied to the data embeddings obtained at any layer of the SSL model, thus alleviating the challenge induced by the presence of projection heads; Third, it enables us to compare different feature representations of the same data point, even if these representations have different dimensions. Focus of study and Outcomes We are interested in using our approach to answer the following questions about SSL models: • What are the geometric differences between the feature spaces of various SSL models? • What geometric properties allow for better transfer learning capability for a specific task? 130 WeperformourMGMevaluationon14SSLmodelsunder5augmentationsettingsandshow their impact on 8 downstream tasks comprising of 18 datasets in total. We make use of use pre-trained SSL models based on ResNet50(1x) backbone encoder architecture as well as a supervised model available with the Pytorch library [240]. We also compare the geometrical differences between a ResNet50 and ViT architecture both at initialization and after SSL training using DINO loss [11]. Our conclusions from this study are summarized as follows: • MGMsarecapableofcapturingimportantgeometricalaspectsofSSLrepresentations,such as the degree of equivariance-invariance, the curvature, and the intrinsic dimensionality (section 4.2). • Theproposedmetricsallowustoexplorethegeometricdifferencesandsimilaritiesbetween SSLmodels. AsillustratedinFigure4.4(right),weshowthatSSLmodelscanbeclustered using these geometric properties into three main groups that are not entirely aligned with the paradigm upon which they were trained (auto4.3.1). • We analyze the geometric differences between a Vision Transformer (ViT) and a convo- lutional network (ResNet). We show that while ResNet is biased towards a collapsed representation at initialization, ViTs are not. This initialization bias leads to different ge- ometricalbehavior(attraction/repulsionofrepresentations)betweenthetwoarchitectures when training under an SSL regime (Section 4.3.1). • We demonstrate that the observed MGMs are a strong indicator of the transfer learning capabilitiesofSSLmodelsformostdownstreamtasks. Thisshowsthatspecificgeometrical properties are crucial for a given transfer learning task (Section 4.3.2). Experimental settings: For all SSL models, we analyze their equivariance-invariance, subspace curvature, and intrinsic dimension in a set of controlled and interpretable exper- iments. While in this work we consider as inputs the validation set of ImageNet [241], the framework applies to any dataset (with or without labels). Our experimental setup is as 131 Figure 4.4: (Left) Approach toward the analysis of SSL models. For each model, we use as input of the DNN the images in the validation set of ImageNet and their augmented versions. The output of the backbone encoder is used to quantify the properties of the man- ifold induced by the SSL algorithm. Specifically, we develop Manifold Graph Metrics (section 4.2) that capture manifold properties known to be crucial for transfer learning. The MGMs allow us to capture the specificity of each SSL model (Section 4.3.1) and to charac- terize their transfer learning capability (Section 4.3.2). (Right) We provide the dendrogram of the SSL models considered in this chapter based on our proposed MGMs. Although the underlying hyper-parameters, loss functions, and SSL paradigms are different, the manifolds induced by the SSL algorithms can be categorized into three types. An important observa- tion is that the resulting clusters are not necessarily aligned with the different classes of SSL algorithms. This result shows that although some training procedures appear more similar, one must provide a deeper analysis to understand in what aspect SSL models differ. follows: for each input sample, (i) select an augmentation setting, (ii) sample T = 50 aug- mentations of the image, (iii) compute the similarity graph using NNK neighborhoods, and (iv) extract the proposed MGMs (Sec.4.2). We limit ourselves to 5 augmentation types: (i) Semantic augmentations (Sem.) , where we consider all the samples belonging to each class as augmented versions of each other in the semantic direction of the manifold, (ii) Augmentations (Augs.) which corresponds to the sequential application of various augmen- tations used during most SSL training process (random horizontal flip, color jitter, random grayscale, Gaussian blur, and random cropping), (iii) Crop and (iv) Colorjitter (Colorjit.), specific augmentations that are part of the augmentation policies used to train the SSL, ( v) Rotate, an augmentation that was not used for training SSL but is considered important for some transfer tasks. 132 Therefore, for each sample and each MGM we obtain a value (local manifold analysis), while the collection of these per-sample MGMs for the entire validation set gives a distribu- tion (such as the one displayed in Figure 4.5). In order to extract differences between SSL models and highlight which geometric properties favor specific transfer learning tasks, we extract two statistics from these MGM distributions (global manifold analysis): the mean (denotedbythemetricname)andthespread(referredtoasmetricspread). Intotal,foreach SSL model, we obtain 26 geometric features, namely,{Sem., Augs., Crop, Colorjit., Rotate} ×{ Equivariance, Equivariance spread, Affinity, Affinity spread, Nb. of neighbors }, as well as the affinity and spread between Sem.-Augs, namely, { Sem.-Augs. Affinity, Sem.-Augs Affinityspread }. Additionaldetailsabouteachmetricandtheirvariabilityacrossallmodels are given in Appendix 4.5. In Section 4.3.2, we evaluate whether the MGMs can characterize the transfer learning capability of SSL models. While we highlight here the overall setting of the transfer learning task, details regarding the datasets can be found in Appendix 4.5. We consider down- stream tasks as in [238]: FewShotKornblith and ManyShotLinear correspond to few/many- shot classification using SSL features extracted on 11 image classification datasets [237]; ManyShotFinetune isrelatedtotheclassificationperformanceasintheprevioussetting, but with entire network along with the feature extractor updated; FewShotCDFSL corresponds to few-shot transfer performance in cross-domain image datasets such as CropDiseases, Eu- roSAT, ISIC, and ChestX datasets [242]; DetectionFinetune and DetectionFrozen refer to object detection task evaluated on PASCAL VOC dataset [243]; DenseSNE corresponds to dense surface normal estimation evaluated on NYUv2 [244]; and DenseSeg refers to dense segmentation task evaluated on ADE20K dataset [245]. 4.3.1 Geometry of SSL models In this section, we aim to characterize the geometric properties of the manifolds of various SSL models, as a way to highlight the differences and similarities between models. To 133 Table 4.1: (Left)Projection of SSL models MGMs onto principal components. We observe three distinct clusters based on the MGMs, which represent the geometric similar- ities and differences between the various models. Note that these clusters are not neces- sarily aligned with the underlying SSL training paradigm, i.e., contrastive, non-contrastive, prototype-clustering based. (Right) MGMs that make up the principal components and their values for each model, where we indicate by the two yellow boxes (a) and (b) the MGMs that capture the maximum variation of the models along the principal directions. do so, we extract the MGMs proposed in section 4.2 for 14 SSL models and use these MGMs to quantify model equivariance-invariance, curvature, and intrinsic dimension for each augmentation manifold. Clustering of SSL models: The similarity between SSL models leads to the clustering illustrated by the dendrogram of Figure 4.4. In Table 4.4 (Appendix 4.5) we provide the de- tailsofeachSSLmodelandhighlightthestructuraldifferencesthatleadtoourMGMs-based clustering. To further analyze these clusters, we consider the sparse principal component analysis (PCA) of SSL models having as features the 26 MGMs. We project the MGMs of each SSL model onto the two main components and observe three clusters (see Table 4.1). We also provide the MGMs that were selected by the sparse PCA and their associated importance in the principal components (see Figure 4.11 in Appendix 4.5). We note that both simCLR-v1 and simCLR-v2 have a large variation with respect to the features selected by the first principal component. Interestingly, the geometrical property 134 that characterizes both ssimCLR-v1 and simCLR-v2 is the angle between the semantic and augmented manifold (as indicated by the yellow box (a) in Table 4.1). This implies that their ability to project the augmented samples onto the data manifold varies significantly with respect to other SSL models. Specifically, while all the models have a low-affinity value between these two directions (orthogonality), both simCLR versions have a high-affinity score(≈ .8),meaningthatthesubspacesspanningthesemanticdirectionandtheaugmented direction are more aligned. This shows that the dimensional collapse effect observed and analyzedinSimCLR’sprojectionhead[223,233,234]appearstoimpactthebackboneencoder representationaswell. Specifically, simCLR-v1andsimCLR-v2aretheonlymodelsprojecting theaugmentationmanifoldontothedatamanifoldsuchthattheyarehardlydistinguishable. Another distinction of SimCLR models is the low intrinsic dimensionality (see Table 4.5), again showing that, compared to other SSL models, the SimCLR backbone encoder tends to project the data onto a lower-dimensional subspace. The two other clusters, composed of models based on different paradigms, are mainly characterizedbytheirdifferencesintermsofsemanticequivariancespread. Themodelsinthe green cluster (bottom) have a low semantic equivariance spread. This implies that InsDis, MoCO-v1, PIRL, SeLa-v2, BYOL, DC-v2, SwAV have the same invariance across the input data manifold, i.e., same across all classes. For instance, InsDis is highly equivariant to the semantic directions (semantic equivariance = 1.23, as shown in the 4 th column), therefore, given its semantic equivariance spread, we know that it is highly equivariant with respect to all ImageNet images. Another observation from Figure 4.1 is that when considering v1 and v2 versions of PCL, simCLR, and MoCO, only the two MoCo versions belong to different clusters. The main difference between these two models is their semantic equivariance spread. While MoCo-v1 hasalowspreadintermsofsemanticequivariance, MoCo-v2hasalargerspreadindicatingthe equivarianceisdependentonthedatasampleathand. Notethatthemaindifferencebetween MoCo-v1 and MoCo-v2 is the utilization of an MLP projection head for the latter. Therefore, 135 Figure 4.5: Histogram of observed equivariance (Semantic, Augmentation) for convolutional encoder backbone (ResNet50) and a vision transformer backbone (ViT-S) at (Left) initialization and (Right) after training with same SSL procedure (DINO [11]). We observethatatinitializationtheinductivebiasinconvolutionalnetworksleadstoaninvariant representation with respect to both semantic and augmentation manifold. However, a ViT leads to a more scattered representations of input images belonging to the semantic as well as augmentation manifolds. After training both architectures converge to a similar representation where the marginal variation in spread between the two models impact the performance in downstream tasks. thisobservationwouldsuggestthataddingaprojectionheadmakestheequivarianceproperty ofthebackboneencodermoredependentontheinputdata. Thisresultseemsintuitivegiven the fact that the projection head will absorb most of the invariance induced by the SSL loss function. We also note that some models are strongly invariant to rotations, e.g., simCLR-v1, simCLR-v2, while others are strongly equivariant, e.g., PCL-v1, PCL-v2. This is particularly surprising as rotation is typically not part of the augmentation policy used to train SSL models. Similarities in SSL models: From Table 4.1, we also deduce the geometric properties thatdonotvaryacrossSSLmodels. Inparticular,thesemanticandcolorjitteraffinities,i.e., the curvature of the data manifold along the semantic (label) direction and the color jitter augmentation direction, show similar behavior across all models studied. This observation shows that the linearization capability of different trained SSL models in these two manifold directions is very similar. We defer further analysis to Appendix 4.5 and summarize the observations in Figure 4.9. 136 ViT vs ResNet: Vision transformers (ViT) [246], built with an architecture inherited from natural language processing [247], have recently emerged as a desirable alternative to convolutional neural networks (CNN) [186] in SSL [11]. While ViTs are appealing, as they provideamoregeneral-purposearchitecture,thereasonsthatexplainthebenefitsoftheViT representationsoverthoseobtainedfromCNNsremainunclear. Wecomparetherepresenta- tions learned by these architectures, focusing on ViT-Small and ResNet50 architectures, as these share similar model capacity (21M vs 23M) and throughput (1007/s vs 1237/s). Fig- ure 4.5 depicts a geometric comparison of the two architectures using two of our proposed equivariance metrics at initialization and after SSL training using DINO [11]. We observe that the two architectures lead to very different representations at initialization on both the semantic and augmentation manifold directions. While ResNet50 is biased toward a col- lapsed representation for all inputs, ViT-S corresponds to a more spread-out distribution at initialization. However, aftertrainingthetwoarchitecturesconvergetoasimilarequivariant representation, with ViT-S showing better class separability on the semantic manifold. This observation shows that convolutional networks learn by repulsion, where representations are dispersed from a collapsed starting point. On the contrary, ViT shows a different behav- ior where it starts from a scattered representation organizing representations by attraction. This could explain the observed robustness and generality of the representation learned by ViT [11,248]. 4.3.2 Transfer performances of SSL models In this section, we investigate the question of determining which geometrical properties are crucial for SSL models to perform better on specific transfer learning tasks. First, we show that the clustering results of SSL models (Section 4.3.1) based on their geometry mainly coincidewiththeobservedper-clustertransferperformances. WethenshowthatourMGMs can provide intuition regarding which geometric properties are desirable to perform well on specific transfer learning tasks (e.g., classification tasks gain by having rotation invariant 137 Figure 4.6: Dendrogram of SSL models based on their transfer learning accu- racy(Left) and proposed MGMs (Right). We observe that the groupings obtained are highly correlated with the clustering performed using our MGMs, thus showing an intrinsic connection between the geometric properties of the SSL model and the transfer learning performances. representations). Finally, we explore which manifold properties are the most informative for the transfer performance of SSL models in a given task. Transfer accuracy based clustering of SSL models: In Figure 4.6 we depict the hierarchical clustering of SSL models based on their transfer learning accuracy in various tasks. SSL models are clustered if they achieve similar accuracy, good or bad, on a set of tasks. When comparing the clusters obtained using (i) MGMs (Figure 4.4), and (ii) transfer learning performances (Figure 4.6), we observe that PIRL, InsDis, and MoCo-v1 belong to the same group in both cases. Similarly, SwAV and BYOL, and SimCLR models are similar for both clustering results. This confirms our hypothesis that there is a correlation between the geometrical properties of the SSL models and their transfer learning accuracy. Recovering geometric properties for transfer learning: Intuitively, it is known that to generalize well on image recognition, an SSL model should be invariant to augmentations such as rotation, while to perform well on dense tasks, i.e., pixel-level prediction tasks, it should be equivariant to rotation [239]. Our proposed MGMs in Figure 4.7 confirm this 138 Figure 4.7: Correlation between equivariance and transfer learning performances. We display the Pearson coefficient ρ at the top left of each subplot. We observe that the equivariance to semantic (inputs with same class labels) and rotation augmentations of the SSL model are negatively correlated with its capability to perform well in few-shot learning tasks (small domain distance). However, these quantities are positively correlated with the accuracy of the DNN on a dense surface normal estimation task. This observation confirms common intuitions regarding the properties that an embedding should have to transfer accurately on these two tasks. The p− values for all results are≤ 0.01. intuition. We show that the semantic, rotation equivariances correlate as expected with the transfer accuracy for two tasks, namely, FewShotKornblith and DenseSNE. In particular, we find that for ( i) FewShotKornblith (few-shot learning with small domain distance), the greaterthemodelinvariancetosemanticdirectionandrotation,thebettertheaccuracy,and (ii) DenseSNE (dense surface normal estimation) the higher the equivariance of the model the better. Exploring geometric properties for transfer learning: We now propose to explore some possibly counter-intuitive manifold properties of the backbone encoder that correlate with specific transfer learning tasks. Our method is based on quantifying how well each MGM can explain the per-task transfer capability of SSL models by using simple regression methods capable of performing feature selection. For each task, we regress the MGMs onto the transfer learning accuracy. Details and illustrations of the results are shown in Appendix 4.5. We show in Figure 4.8 the correlation between the selected MGMs and the per-task transfer learning accuracy. We observe that in the case of many shot (linear and fine-tune), 139 Figure 4.8: Correlation between MGMs and transfer performances. We display the Pearson coefficient ρ at the top left of each subplot. We observe that for most transfer learning tasks, there exists an MGM that correlates with the per-task transfer performance. We recover some intuitive results, such as for many shot learning linear, higher invariance in the SSL model corresponds to better transfer learning capability. We also note that dense segmentation appears not to be correlated to the considered geometrical metrics of SSL models. The p− values for each result was ≤ 0.01 except for the last plot where the p-value=0.34. few shot (small domain distance and and large domain distance), there exist a negative cor- relation with intuitive geometric property (Figure 4.8 Top): semantic equivariance, rotation affinity, augmentation affinity, and spread of augmentation equivariance. For the cases of ManyShotLinear, ManyShotFinetune, and FewShotKornblith, we recover geometric proper- ties that were previously considered important for classification [210,249]. However, in the case of FewShotCDFSL, where the datasets used differ largely from ImageNet, the transfer performances are correlated with second-order statistics, i.e., the spread, and less on the differencesbetweentheaverageoftheMGMdistributions. In DetectionFinetune, Detection- Frozen, DenseSNE, and DenseSeg tasks, the geometric properties that are correlated with the performances are also second-order statistics (Figure 4.8 Bottom). Specifically, there ex- ists a positive correlation between the transfer performance and the spread of crop affinity, semantic equivariance, colorjitter affinity. Therefore, we observe an implicit relationship between the geometric properties of SSL models and their transfer learning capabilities. Specifically, for tasks with small transfer 140 distance relative to ImageNet, we observe that the performances of an SSL model are tied to well-known geometric properties, such as invariance and linearization capabilities. In contrast, in thecase oflarge transferdistance, we notethat thetransfer learning capabilities rely on higher-order statistics of the geometry of the backbone encoder, i.e., capturing how the model maps different inputs. 4.4 Discussion and Open questions We show that the geometry of DNN models can be efficiently captured by leveraging graph- based metrics. In particular, our data-driven approach provides a way to compare DNN models that can differ in architecture and training paradigms. Our analysis provides in- sights into the landscape of SSL algorithms, highlighting their geometrical similarities and differences. Further, the proposed geometrical metrics capture the transfer learning capa- bility of SSL models. Our approach motivates the design of transfer-learning-specific SSL training paradigms that take into account the geometric requirement of the downstream task. In this work, we performed our geometric analysis using the ImageNet validation dataset on SSL models with a ResNet50 backbone and one ViT architecture with approximately the same number of parameters. Our results are based on the feature space embedding of the models, and we do not foresee any issues when scaling up to larger models and datasets or even different modalities, but this remains to be tested and is left open for future work. We also restricted our analysis to 5 augmentation settings, which we considered to be of practical relevance, but further exploration is required to better understand other augmen- tation and perturbation properties of the embeddings. Although we empirically show the correlation between geometrical properties and several downstream tasks, we did not ex- plorethetheoreticalimplicationofhavingacertaingeometryanditsrelationshiptotransfer generalization. We emphasize that our formulation, unlike the accuracy metric previously 141 studied [237,238], is amenable to theoretical study using spectral and graph theoretical con- cepts. We note that our work can be leveraged in conjunction with previously developed approaches, such as [250,251], to induce a particular embedding geometry depending on the transfer learning application at hand. Our focus in this work was to provide a big picture tool for understanding features obtained with SSL models using geometry/graphs and hope it leads to new ideas for understanding and training deep learning models. 4.5 Appendix Algorithm 7: MGMs for SSL Input: DatasetD ={x i } N i=1 , pre-trained SSL backboneg, data augmentation policyT Output: MGMs for input SSL model and dataset 1: for i=1 to N do 2: E i ={} 3: for t=1 to T do 4: Perform random augmentations of x (t) i =T(x i ) 5: Extract representation of augmented data x (t) i : E i ∪{g(x (t) i )} 6: for t=1 to T do 7: Compute NNK graph of augmented data in encoder space: G (t) i =NNK(g(x (t) i ),E i ) 8: Compute local MGM values: Diam(G (t) i ), ID(G (t) i ), Subspace(G (t) i ) 9: Compute curvature MGM{ Aff Subspace(G (t) i ), Subspace(G (t ′ ) i ) , t=[1,...T],t ′ ∈G (t) i } 10: Return MGM(g,D,T) using local MGM obtained{ Diam i , ID i , Aff i , i=[1,...N]} In Algorithm 7, we present the pseudo-code for geometric evaluation of an SSL model using proposed MGMs as defined in Sec.4.2, namely, invariance, affinity, and the number of neighbors. We first obtain local MGM values computed using the NNK neighborhood G (t) i correspondingtoeachaugmentedinputdataformingthegraphG i . AsummaryoftheMGM, its geometric association, and estimation is presented in Table 4.3. The curvature MGM is evaluated foreachaugmenteddata andits neighbor andthusproduces onevalueper edge in thegraphwhileinvariance, dimensionmetricsresultinonevaluepergraphconstructed. For eachmetric(MGM),andeachdirection(T),weobtainadistributionovertheentiremanifold 142 MGM(g,D,T). We summarized this global manifold information by considering the first twomomentsofthemetricdistribution(meanandspread)obtainedacrossthedataset. Note that the type of augmentation considered reflects the manifold information corresponding to a specific direction in the feature space, i.e., using rotation augmentation allows us to characterizetheembeddingmanifoldinducedbytheaugmentation. Foreachdata,weobtain a metric that provides information regarding the geometric property of the manifold in the considered direction. Cross-augmentation MGMs (e.g. Sem.-Augs. affinity) are obtained by comparinggraphscorrespondingtodifferentaugmentationpoliciesinline9ofthealgorithm, i.e., Subspace(G (t) i ) obtained with augmentation policyT 1 and Subspace(G (t ′ ) i ) obtained with augmentation policy T 2 . Consequently, for evaluating cross-augmentation policies the loop over number of augmentations (lines 3-8) is done twice, one withT 1 and other withT 2 . In our experiments, we evaluate 5 augmentation policies (T) with T = 50 augmented samples for each image in the validation set of ImageNet (N = 50000 images). This corre- spondstoconstructing50000graphsperaugmentationpolicywitheachgraphcontaining 50 nodes except for Semantic setting where we obtain 1000 graphs with 50 nodes each. Note that each graph construction incorporates 50 NNK neighborhood optimization, one for each node. Data augmentations Wefollowthedataaugmentationpolicyusedin[11,252]which is a typical setting with most self-supervised learning models considered in this work. The aug- mentationsareperformedusingdefaultsettingsassociatedwiththeaugmentationfunctionin PyTorch[253]withthefollowingassignedparameters: randomcropping(size=224,interpola- tion=bicubic),Horizontalflip(p=0 .5),colorjitter(brightness=0.4,contrast=0.4,saturation=0.2, hue=0.1) randomly applied with p=0.8, grayscale (p=0.2), Gaussian blur (p=1.0), rotation (degrees=90). The augmentation composition for each setting used in our experiments are presented in Table 4.2. Manifold Graph Metrics 143 Setting RandomCrop HorizontalFlip Colorjitter GrayScale GaussianBlur Rotation Sem. Augs. ✓ ✓ ✓ ✓ ✓ Crop ✓ Colorjit. ✓ Rotate ✓ Table 4.2: Evaluation settingand composedaugmentationfunctions (sequentially applied in the order listed from left to right). All images are resized to 224 (bicubic interpolation) if not randomly cropped to the size and are mean-centered and standardized along each color channel based on statistics obtained with the ImageNet training set. Metric Geometric property Estimation Polytope diameter Invariance/Equivariance The maximum distance between the neighbors of an NNK polytope max k,l∈N(f(x i ))) ˆ f(x k )− ˆ f(x l )) 2 Affinity Local curvature The cosine similarity of principal components of two neighboring NNK polytopes q cos 2 (θ 1 )+··· +cos 2 (θ n i n j ) n i n j No. of neighbors Intrinsic dimension The nonzero weighted neighbors identified by NNK Card( N(f(x i ))) Table 4.3: Proposed Manifold Graph Metrics, their relationship to the geometric property of the manifold, and method of estimation using observed embeddings. Note that a similar diameter evaluation with a k-nearest neighbor will explicitly depend on the choice of k while using 1-nearest neighbor will reduce the local geometry to only one direction. Thus, the use of a neighborhood definition that is adaptive to the local geometry of the data is crucial to successfully observe the properties of the manifold. 144 Interpreting the MGM mean and spread: ThelocalMGMvaluesarecalculatedfrom a distribution over the entire dataset. As noted earlier, we compute the mean and the stan- dard deviation (referred to as the spread) of this distribution to capture the characteristics of an SSL model. The interpretation of the mean is straightforward and can be seen as the (average) equivariance, curvature, or dimension of the augmentation manifold embedded us- ingtheSSLmodelonadataset. Thespreadofthemetriccapturesvariationofthegeometric property with respect to differences in the input image. For example, an SSL model can be equallyinvarianttoallinputimagesinthedataset(equivariancespreadsmall)whileanother model might show better invariance with respect to some classes while not with some other classes(equivariancespreadlarge)buthavethesameaverageequivarianceasthefirstmodel across the dataset. In this work, we do not perform per-class-based analysis since our goal was to study the impact of global characteristics on downstream performance, which can often be characterized based on general coarse dataset-level measures. However, we believe that the MGMs proposed can be extended to specific class/attribute-level analysis. Further- more, we expect such studies can help capture representational heteroscedasticity in an SSL model or how the model adapts to domain shift during transfer (e.g., sketch images to real images of an object), to name a few. Variability in SSL models: In Figure 4.9, for each of the MGMs studied, we depict the observed variability in these metrics across all models. The affinity between the semantic manifold and augmentations manifold (Sem.-Augs.) is the MGM capturing the main dif- ferences across models. This metric can be intuitively understood as the angle between the natural image manifold and the manifold produced by the data augmentation policy. This implies that the ability of an SSL model to project the augmented samples onto the data manifold varies significantly. The second important feature capturing the dissimilarity be- tween SSL models is the equivariance spread along semantic and color jitter directions. It implies that while some SSL models have a lot of variation in their invariance, others show equal invariance across the entire manifold. Finally, another geometrical feature that highly 145 contributes to the dissimilarity between SSL models is the equivariance to rotation. Inter- estingly, some models will be strongly invariant to rotations while others will be strongly equivariant. Note that rotation is not part of the augmentation policy of the models we evaluate. We also visualize in Figure 4.9 the geometrical properties that do not vary across SSL models. The main ones are the semantic and color jitter affinities, i.e., the curvature of the data manifold along the semantic (label) direction as well as the color jitter augmentation direction. This observation shows that the linearization capability of SSL models regarding these two manifold directions is highly similar. Equivariance Equiv. Spread Affinity Affinity Spread # Neighbors Sem. Augs. Crop Rotate Colorjit. 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Sem. Augs. Crop Rotate Colorjit. Sem. Augs. Crop Rotate Colorjit. Sem.-Augs. Sem. Augs. Crop Rotate Colorjit. Sem.-Augs. Sem. Augs. Crop Rotate Figure 4.9: Manifold graph metrics variability across SSL models. We depict the variation in evaluated MGMs across 14 pre-trained SSL models considered in this paper. We note that the differences in the different SSL model embedding can be summarized using five manifold properties, namely, ( i) Sem. equivariance, (ii) Rotate equivariance, (iii) Sem. equivariance spread, (iv) Colorjit. equivariance spread, (v) Sem.-Aug. affinity. These propertieshighlightthekeymanifoldpropertieswiththelargestvariationacrossthedifferent SSL models. MGManalysis WeusedthePythonScipylibrary’slinkagemoduletoperformhierarchical agglomerativeclustering. Themethodweusedtocomputethedistanceisbasedonthe Voor Hees Algorithm. In Figure 4.6, we use the transfer learning accuracy of each model with respect to all the tasks defined in the experimental settings in Section 4.3. That is, the features of the clustering algorithms are the per-model accuracy on the task. The per-task performances are not mentioned here as it was already developed in referenced paper [238]. Weproposethisresulttodisplaythesimilaritybetweentheclustersbasedontransferlearning performances and the clusters based on the manifold graph metrics. 146 Sem.-Augs. Aff. Sem. Equiv. Spread Colorjit. Equiv. Spread Rotate Equiv. Sem. Equiv. Rotate Equiv. Spread Augs. Equiv. Spread Crop Equiv. Crop Equiv. Spread Augs. Equiv. Colorjit. Equiv. Sem. Nb.Neigh. Sem.-Augs. Aff. Spread Rotate Nb.Neigh. Rotate Aff. Spread Augs. Nb.Neigh. Crop Nb.Neigh. Sem. Aff. Spread Rotate Aff. Augs. Aff. Spread Colorjit. Aff. Spread Augs. Aff. Crop Aff. Crop Aff. Spread Colorjit. Aff. Sem. Aff. 0.0 0.1 0.2 0.3 Figure 4.10: (Sorted) Evaluation of manifold graph metrics variability across SSL model. We compute here the standard deviation of normalized manifold graph metrics to highlight the varying manifold properties across SSL models. We observe that five manifold properties explain most of the differences in SSL models: ( i) affinity between semantic and augmentations manifold , (ii) spread of semantic equivariance, (iii)spread of color jitter equivariance, (iv) rotation equivariance, (v) semantic equivariance 147 Figure4.11: Absolutevalueofprincipalcomponentsofper-featuremanifoldgraph metrics matrix. We display here the absolute value of the two principal components obtained after sparse PCA of the matrix of dimension number of models × number of graph manifold metrics. This corresponds to the principal components used to visualize the distributionoftheSSLmodelsinthetwo-dimensionalplaneasinFigure4.4. Weobservethat few manifold graph metrics encapsulate most of the variance contained across SSL models; namely: Semantic Equivariance, Augmentations Equivariance, Crop Equivariance, Rotation Equivariance,SemanticEquivarianceSpread,AugmentationsEquivarianceSpread,Rotation Equivariance Spread, Colorjitter Equivariance Spread, Semantic-Augmentations Affinity. 148 Figure 4.12: Dendrogram of SSL models. We compute the dendrogram of the MGM of SSL models. This shows us that there are in fact three main classes of SSL models in terms of manifold property as well as the proximity between different models. This result alsoconfirmstheclusteringbasedonPCAvisualizedinFigure4.4. Inparticular, simCLR-v1 and simCLR-v2appeartobethemostdistantmodelfromalltheSSLparadigmstestedhere. Interestingly, theclusteringobtainedheredoesnotcorrespondtothedifferentclassesofSSL algorithms: contrastive, non-contrastive, cluster-based, and memory-based. 149 Table 4.4: SSL model details and the associated dendrogram split learned by our geometricmetrics. WehighlightthestructuraldifferencesbetweenSSLmodelsthatwould possibly correspond to the geometric differences and similarities we observe. It appears that themaindifferencebetweenSimCLRmodelsandalltheotherSSLmodelscouldbeexplained by input normalization. Then, the batch size appears to also affect drastically the geometry. Finally, we observe on a finer scale that momentum encoder, the presence of projection head and memory bank could also lead to geometric differences. We believe that the current classification of SSL approaches (contrastive, non-contrastive, cluster-based) is not sufficient to capture their geometric differences. All the parameters described in here should be taken into account as we believe there exists a complex interplay between these choices and the induced geometry of the trained SSL model. Transfer performance and MGMs We present the MGM values for each SSL model evaluatedinourstudyinTable4.5. Wefurtheridentifyviadecisiontreeandsparseregression (LASSO) the MGMs that are able to best predict the transfer performance of a SSL model in each downstream task. We present our findings in Figures 4.13 and 4.14.s Table 4.5: Proposed MGMs observed for different SSL models. We display here the values of each MGM for each SSL model. These correspond to the 26 MGMs that we consider in all the analyses we provide in the paper. 150 Equivariance Equiv. Spread Affinity Affinity Spread # Neighbors Figure 4.13: Per transfer learning task MGM importance - Decision Tree. For each transfer learning task, we exploit the feature selection of decision trees (depth= 5) to visualize which MGM are crucial to characterize the transfer learning accuracy. To do so, we fit the MGMs with the transfer learning accuracies. In this figure, we highlight the MGMs that explain the most the per-task transfer learning accuracy. Note that we are not interested in the regression error but in the importance of each MGM for predicting each transferlearningtask. Whilebeingintuitive,thisresultshowsthatmainlytheinvarianceand curvature of the DNN characterize their transfer learning capability. The first observation is that the intrinsic dimension of the DNN (displayed in green as # Neighbors) does not allow one to characterize any task-specific transfer learning accuracy. Depending on the task, different geometrical properties matter. For many shot linear, the equivariance to semantic direction of the data manifold is crucial. For few-shot with small transfer domain distance, the linearization capability of the DNN with respect to the augmentations captures most of the information regarding the transfer learning capability. While most tasks appear to be explained by a single of few MGMs, in the case of frozen detection and dense segmentation, multiple MGMs are required to explain the transfer learning capability of SSL models. 151 Equivariance Equiv. Spread Affinity Affinity Spread # Neighbors Figure 4.14: MGM Importance Per Transfer Learning Task - Lasso. For each trans- ferlearningtask, weexploitthefeatureselectionofLASSOmethodtovisualizewhichMGM arecrucialtocharacterizethetransferlearningaccuracy. Asintuitivelyexpected,weobserve thatdependingofthetransferlearningtaskspecificmanifoldpropertiesaremoreimportant. For instance, for dense detection, the equivariance spread of the semantic manifold highly correlates with the transfer learning accuracy. Indicating that the variation of equivari- ance/invariance induced by the DNN manifold w.r.t. the input data is critical to accomplish dense detection. Similarly, for dense segmentation, the curvature of the semantic manifold appears to be the most important feature. 152 Datasets Many Shot Datasets FGVC Aircraft [254] , Caltech-101 [255], Stanford Cars [256], CI- FAR10 [190], DTD [257], Oxford 102 Flowers [258], Food-101 [259], Oxford-IIIT Pets [260], SUN397 [261] and Pascal VOC2007 [243]. Few Shot Datasets Few Shot-Kornblith corresponds to the same datasets used for many shot learning except for VOC2007. For CDFSL, we use the Few-Shot Learning benchmark introduced by [242]. It consists of 4 datasets that exhibit increasing dissimilarity to natural images, CropDiseases [262], EuroSAT [263], ISIC2018 [264,265], and ChestX [266]. Detection For detection, we use PASCAL VOC dataset [243]. 153 References [1] M. Aharon, M. Elad, and A. M. Bruckstein, “K-SVD and its non-negative variant for dictionary design,” in Wavelets XI, vol. 5914, Intl. Society for Optics and Photonics, 2005. [2] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding.,” J. of Mach. Learning Research, vol. 11, no. 1, 2010. [3] Y. Wang, W.-L. Chao, K. Q. Weinberger, and L. van der Maaten, “Simpleshot: Re- visiting nearest-neighbor classification for few-shot learning,” arXiv:1911.04623, 2019. [4] G. S. Dhillon, P. Chaudhari, A. Ravichandran, and S. Soatto, “A baseline for few-shot image classification,” in Int. Conf. on Learning Representations, 2020. [5] P. Rodr´ ıguez, I. Laradji, A. Drouin, and A. Lacoste, “Embedding propagation: Smoother manifold for few-shot classification,” in Proc. of the European Conf. on Computer Vision, 2020. [6] I.Ziko,J.Dolz,E.Granger,andI.B.Ayed,“Laplacianregularizedfew-shotlearning,” in Int. Conf. on Machine Learning, PMLR, 2020. [7] Y.Hu,V.Gripon,andS.Pateux,“Leveragingthefeaturedistributionintransfer-based few-shot learning,” arXiv:2006.03806, 2020. [8] D. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129– 150, 2011. [9] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in Advances in Neural Inf. Process. Syst., 2019. [10] F.CroceandM.Hein, “Reliableevaluationofadversarialrobustnesswithanensemble of diverse parameter-free attacks,” in Int. Conf. on Mach. Learn., 2020. [11] M. Caron, H. Touvron, I. Misra, H. J´ egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” arXiv:2104.14294, 2021. [12] S. Dasgupta, C. F. Stevens, and S. Navlakha, “A neural algorithm for a fundamental computing problem,” Science, vol. 358, no. 6364, pp. 793–796, 2017. 154 [13] L. Devroye, L. Gy¨ orfi, and G. Lugosi, A probabilistic theory of pattern recognition. Springer, 2013. [14] G. Biau and L. Devroye, Lectures on the Nearest Neighbor method. Springer, 2015. [15] G.ChenandD.Shah,Explaining the success of nearest neighbor methods in prediction. Now Publishers, 2018. [16] C. S. Calude and G. Longo, “The deluge of spurious correlations in big data,” Foun- dations of science, vol. 22, pp. 595–612, 2017. [17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015. [18] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation Learning on Graphs: Methods and Applications,” 2017. [19] A. Ortega, P. Frossard, J. Kovaˇ cevi´ c, J. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE, 2018. [20] I. Chami, S. Abu-El-Haija, B. Perozzi, C. R´ e, and K. Murphy, “Machine learning on graphs: A model and comprehensive taxonomy,” Journal of Machine Learning Research, vol. 23, no. 89, pp. 1–64, 2022. [21] A. Ortega, Introduction to graph signal processing. Cambridge University Press, 2022. [22] A.Hogan,E.Blomqvist,M.Cochez,C.d’Amato,G.D.Melo,C.Gutierrez,S.Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier, et al., “Knowledge graphs,” ACM Computing Surveys (Csur), vol. 54, no. 4, pp. 1–37, 2021. [23] M. Maiker, U. von Luxburg, and M. Hein, “Influence of graph construction on graph- based clustering measures,” Advances in Neural Info. Processing Systems, 2009. [24] C. A. R. De Sousa, S. O. Rezende, and G. Batista, “Influence of graph construc- tion on semi-supervised learning,” in Joint European Conf. on Machine Learning and Knowledge Discovery in Databases, pp. 160–175, Springer, 2013. [25] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. on Info. Theory, 1967. [26] E. A. Nadaraya, “On estimating regression,” Theory of Probability & Its Applications, 1964. [27] G. S. Watson, “Smooth regression analysis,” Sankhy¯ a: The Indian J. of Statistics, Series A, 1964. [28] C. J. Stone, “Consistent nonparametric regression,” The Annals of Statistics, 1977. [29] R. J. Samworth et al., “Optimal weighted nearest neighbour classifiers,” The Annals of Statistics, 2012. 155 [30] M.Hein,J.Y.Audibert,andU.vonLuxburg,“Graphlaplaciansandtheirconvergence on random neighborhood graphs,” Journal of Machine Learning Research, 2007. [31] D. Ting, L. Huang, and M. Jordan, “An analysis of the convergence of graph lapla- cians,” Intl. Conf. on Machine Learning, 2010. [32] S. Shekkizhar and A. Ortega, “Graph construction from data by Non-Negative Ker- nel Regression,” in Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020. [33] S.ShekkizharandA.Ortega,“Graphconstructionfromdatausingnonnegativekernel regression (nnk graphs),” 2019. [34] S. Shekkizhar and A. Ortega, “Efficient graph construction for image representation,” in Intl. Conf. on Image Processing (ICIP), pp. 1956–1960, IEEE, 2020. [35] S. Shekkizhar and A. Ortega, “NNK-Means: Data summarization using dictionary learningwithnon-negativekernelregression,”in30th European Signal Processing Con- ference (EUSIPCO), IEEE, 2022. [36] S. Shekkizhar and A. Ortega, “DeepNNK: Explaining deep models and their general- ization using polytope interpolation,” arXiv:2007.10505, 2020. [37] S. Shekkizhar and A. Ortega, “Model selection and explainability in neural networks using a polytope interpolation framework,” in 55th Asilomar Conf. on Signals, Sys- tems, and Computers, IEEE, 2021. [38] S. Shekkizhar and A. Ortega, “Revisiting local neighborhood methods in machine learning,” in Data Science and Learning Workshop (DSLW), IEEE, 2021. [39] K. Nonaka, S. Shekkizhar, and A. Ortega, “Graph-based deep learning analysis and instance selection,” in IEEE 22nd International Workshop on Multimedia Signal Pro- cessing (MMSP), IEEE, 2020. [40] D.Bonet,A.Ortega,J.Ruiz-Hidalgo,andS.Shekkizhar,“Channel-wiseearlystopping without a validation set via nnk polytope interpolation,” in Asia-Pacific Signal and Info. Processing Association Annual Summit and Conf. (APSIPA ASC), IEEE, 2021. [41] D. Bonet, A. Ortega, J. Ruiz-Hidalgo, and S. Shekkizhar, “Channel redundancy and overlap in convolutional neural networks with channel-wise nnk graphs,” in IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 4328–4332, IEEE, 2022. [42] R. Cosentino, S. Shekkizhar, M. Soltanolkotabi, S. Avestimehr, and A. Ortega, “The geometry of self-supervised learning models and its impact on transfer learning,” arXiv:2209.08622, 2022. [43] P. Das, S. Shekkizhar, and A. Ortega, “Towards a geometric understanding of spatio temporal graph convolution networks,” 2022. 156 [44] M. Pelillo, “Alhazen and the nearest neighbor rule,” Pattern Recognition Letters, vol. 38, pp. 34–37, 2014. [45] E.FixandJ.L.Hodges,“Discriminatoryanalysis-nonparametricdiscrimination: Small sample performance,” tech. rep., California Univ. Berkeley, 1952. [46] D.WettschereckandT.G.Dietterich,“Locallyadaptivenearestneighboralgorithms,” Advances in Neural Info. Processing Systems, pp. 184–184, 1994. [47] L. Baoli, L. Qin, and Y. Shiwen, “An adaptive k-nearest neighbor text categorization strategy,” ACM Trans. on Asian Language Info. Processing (TALIP), vol. 3, no. 4, pp. 215–226, 2004. [48] A. Balsubramani, S. Dasgupta, Y. Freund, and S. Moran, “An adaptive nearest neigh- bor ruleforclassification,” Advances in Neural Info. Processing Systems, vol.32, 2019. [49] A. K. Ghosh, “On nearest neighbor classification using adaptive choice of k,” Journal of computational and graphical statistics, vol. 16, no. 2, pp. 482–502, 2007. [50] S. Mullick, S. Datta, and S. Das, “Adaptive learning-based k-nearest neighbor classi- fierswithresiliencetoclassimbalance,” IEEETrans.onNeuralNetworksandLearning Systems, 2018. [51] O. Anava and K. Levy, “k ∗ -nearest neighbors: From global to local,” in Advances in Neural Inf. Processing Systems, 2016. [52] D. L. Donoho, “Compressed sensing,” IEEE Trans. on Info. Theory, vol. 52, no. 4, pp. 1289–1306, 2006. [53] F. Wang and C. Zhang, “Label propagation through linear neighborhoods,” IEEE Trans. on Knowledge and Data Engineering, 2008. [54] M.KarasuyamaandH.Mamitsuka,“Adaptiveedgeweightingforgraph-basedlearning algorithms,” Machine Learning, 2017. [55] V. Kalofolias and N. Perraudin, “Large scale graph learning from smooth signals,” in Intl. Conf. on Learning Representations, 2019. [56] S. Shekkizhar and A. Ortega, “NNK-Means: Data summarization using dictionary learningwithnon-negativekernelregression,”in30thEuropeanSignalProcessingConf. (EUSIPCO), IEEE, 2022. [57] R. Cosentino, S. Shekkizhar, M. Soltanolkotabi, S. Avestimehr, and A. Ortega, “The geometry of self-supervised learning models and its impact on transfer learning,” arXiv:2209.08622, 2022. [58] L. Qiao, L. Zhang, S. Chen, and D. Shen, “Data-driven graph construction and graph learning: A review,” Neurocomputing, 2018. 157 [59] T. Hofmann, B. Sch¨ olkopf, and A. J. Smola, “Kernel methods in machine learning,” The annals of statistics, 2008. [60] M. A. ´ Alvarez, L. Rosasco, and N. D. Lawrence, “Kernels for vector-valued functions: A review,” Foundations and Trends® in Mach. Learn., 2012. [61] F. Jakel, B. Scholkopf, and F. A. Wichmann, “Similarity, kernels, and the triangle inequality,” Jour. of Mathematical Psychology, 2008. [62] J. Mercer, “Functions of positive and negative type, and their connection with the theory of integral equations,” Philosophical Trans. of the Royal society of London. Series A, 1909. [63] N. Aronszajn, “Theory of reproducing kernels,” Trans. of the American mathematical society, 1950. [64] B. Kulis, “Metric learning: A survey,” Foundations and trends in machine learning, vol. 5, no. 4, pp. 287–364, 2012. [65] F.WangandJ.Sun,“Surveyondistancemetriclearninganddimensionalityreduction in data mining,” Data mining and knowledge discovery, vol. 29, no. 2, pp. 534–564, 2015. [66] A. Kapoor, H. Ahn, Y. Qi, and R. Picard, “Hyperparameter and kernel learning for graph based semi-supervised classification,” in Advances in Neural Info. Processing Systems, 2005. [67] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, 2000. [68] T. Jebara, J. Wang, and S. Chang, “Graph Construction and b-Matching for Semi- Supervised Learning,” Intl. Conf. on Machine Learning, 2009. [69] S. Daitch, J. Kelner, and D. Spielman, “Fitting a Graph to Vector Data,” Intl. Conf. on Machine Learning, 2009. [70] B.Cheng, J.Yang, S.Yan, Y.Fu, andT.S.Huang, “Learningwithl1-graphforimage analysis,” IEEE Trans. on Image Processing, 2010. [71] S. Han and H. Qin, “Structure aware l1 graph for data clustering,” in AAAI, 2016. [72] V. Kalofolias, “How to learn a graph from smooth signals,” in Artificial Intelligence and Statistics, pp. 920–929, PMLR, 2016. [73] J. Ham, D. Lee, S. Mika, and B. Sch¨ olkopf, “A kernel view of the dimensionality reduction of manifolds,” in Intl. Conf. on Machine Learning, p. 47, 2004. [74] D. Kong, C. Ding, H. Huang, and F. Nie, “An iterative locally linear embedding algorithm,” in Proceedings of the 29th Intl. Conf. on Machine Learning, pp. 931–938, 2012. 158 [75] K. Murphy, Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. [76] V. Patel and R. Vidal, “Kernel sparse subspace clustering,” in IEEE Intl. Conf. on Image Processing (ICIP), 2014. [77] P. Febrer, A. Zamora, and G. Giannakis, “Matrix completion and extrapolation via kernel regression,” IEEE Trans. on Signal Processing, 2019. [78] D.Romero, M.Ma, andG.Giannakis, “Kernel-basedreconstructionofgraphsignals,” IEEE Trans. on Signal Processing, 2017. [79] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” Intl. Conf. on Machine Learning, 2003. [80] I. Daubechies, Ten lectures on wavelets. SIAM, 1992. [81] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via or- thogonal matching pursuit,” IEEE Trans. on Info. theory, 2007. [82] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Reg- ularization, Optimization, and Beyond. MIT Press, 2001. [83] S. Chen and D. Donoho, “Basis pursuit,” in Proceedings of 28th Asilomar Conf. on Signals, Systems and Computers, vol. 1, pp. 41–44, IEEE, 1994. [84] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans. on Signal Proc., 1993. [85] D. L. Donoho, Y. Tsaig, I. Drori, and J. Starck, “Sparse solution of underdetermined systems of linear equations by stagewise orthogonal matching pursuit,” IEEE Trans. on Info. Theory, vol. 58, no. 2, pp. 1094–1121, 2012. [86] D. Gale, H. Kuhn, and A. W. Tucker, “Linear programming and the theory of games - chapter xii,” 1951. [87] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of statistics, vol. 32, pp. 407–499, 2004. [88] I. Daubechies, R. DeVore, M. Fornasier, and C. S. G¨ unt¨ urk, “Iteratively reweighted least squares minimization for sparse recovery,” Communications on Pure and Applied Mathematics: A journal issued by the Courant Inst. of Mathematical Sciences, vol. 63, pp. 1–38, 2010. [89] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996. [90] J.Johnson, M.Douze, andH.J´ egou, “Billion-scalesimilaritysearchwithgpus,” IEEE Trans. on Big Data, 2019. [91] S.BoydandL.Vandenberghe,Convex optimization. Cambridgeuniversitypress,2004. 159 [92] C. Hurtado, S. Shekkizhar, J. Ruiz-Hidalgo, and A. Ortega, “Study of manifold geom- etry using multiscale non-negative kernel graphs,” in Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2023. [93] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text clas- sification using string kernels,” Journal of machine learning research, vol. 2, no. Feb, pp. 419–444, 2002. [94] P. O. Hoyer, “Non-negative sparse coding,” in Proceedings of the 12th IEEE workshop on neural networks for signal processing, pp. 557–565, IEEE, 2002. [95] T. T. Nguyen, J. Idier, C. Soussen, and E. Djermoune, “Non-negative orthogonal greedy algorithms,” IEEE Trans. on Signal Processing, 2019. [96] A. Martinez and R. Benavente, “The AR face database: CVC technical report, 24,” 1998. [97] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Il- lumination cone models for face recognition under variable lighting and pose,” IEEE Trans. on pattern analysis and machine intelligence, vol. 23, no. 6, pp. 643–660, 2001. [98] A. Gadde, A. Anis, and A. Ortega, “Active semi-supervised learning using sampling theory for graph signals,” in Proceedings of the 20th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, pp. 492–501, ACM, 2014. [99] J. Rijn, B. Bischl, L. Torgo, B. Gao, V. Umaashankar, S. Fischer, P. Winter, B. Wiswedel, M. R. Berthold, and J. Vanschoren, “OpenML: A collaborative science platform,” in Joint European Conf. on machine learning and knowledge discovery in databases, pp. 645–649, Springer, 2013. [100] P. Ram and K. Sinha, “Federated nearest neighbor classification with a colony of fruit-flies,” in AAAI Conf. on Artificial Intelligence , 2022. [101] J. H. Friedman, “Multivariate adaptive regression splines,” The annals of statistics, vol. 19, no. 1, pp. 1–67, 1991. [102] J. Batson, D. Spielman, N. Srivastava, and S. Teng, “Spectral sparsification of graphs: Theory and algorithms,” Communications of the ACM, 2013. [103] M.BelkinandP.Niyogi, “Laplacianeigenmapsandspectraltechniquesforembedding and clustering,” in Intl. Conf. on Neural Info. Processing Systems, 2001. [104] D. Bonet, “Improved neural network generalization using channel-wise nnk graph con- structions,” 2021. [105] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of the Conf. on Computer Vision and Pattern Recognition, 2016. [106] A. V. Little, M. Maggioni, and L. Rosasco, “Multiscale geometric methods for esti- mating intrinsic dimension,” Proc. SampTA, vol. 4, no. 2, 2011. 160 [107] P. Campadelli, E. Casiraghi, C. Ceruti, and A. Rozza, “Intrinsic dimension estima- tion: Relevant techniques and a benchmark framework,” Mathematical Problems in Engineering, 2015. [108] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Sixth International Conference on Computer Vision, pp. 839–846, IEEE, 1998. [109] A.Buades,B.Coll,andJ.Morel,“Anon-localalgorithmforimagedenoising,”inIEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 60–65, 2005. [110] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3d transform-domain collaborative filtering,” IEEE Transactions on image processing, vol. 16, no. 8, pp. 2080–2095, 2007. [111] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for image processing and reconstruction,” IEEE Transactions on Image Processing, vol. 16, no. 2, pp. 349–366, 2007. [112] D.ZoranandY.Weiss,“Fromlearningmodelsofnaturalimagepatchestowholeimage restoration,” in International Conference on Computer Vision (ICCV), pp. 479–486, IEEE, 2011. [113] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with applicationtoimagedenoising,”inIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2862–2869, 2014. [114] P. Milanfar, “A tour of modern image filtering: New insights and methods, both practical and theoretical,” IEEE Signal Processing Magazine, vol. 30, no. 1, pp. 106– 128, 2012. [115] G. Peyr´ e, “Image processing with nonlocal spectral bases,” Multiscale Modeling & Simulation, vol. 7, no. 2, pp. 703–730, 2008. [116] A.Gadde,S.K.Narang,andA.Ortega,“Bilateralfilter: Graphspectralinterpretation and extensions,” in IEEE International Conference on Image Processing, pp. 1222– 1226, 2013. [117] G. Cheung, E. Magli, Y. Tanaka, and M. K. Ng, “Graph spectral image processing,” Proceedings of the IEEE, vol. 106, no. 5, pp. 907–930, 2018. [118] Y.Tanaka,Y.C.Eldar,A.Ortega,andG.Cheung,“Samplingsignalsongraphs: From theory to applications,” IEEE Signal Processing Magazine, vol. 37, no. 6, pp. 14–30, 2020. [119] A. Gadde, M. Xu, and A. Ortega, “Sparse inverse bilateral filters for image process- ing,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1437–1441, 2017. 161 [120] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from data under laplacian and structural constraints,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017. [121] P. Milanfar, “Symmetrizing smoothing filters,” SIAM Journal on Imaging Sciences, vol. 6, no. 1, pp. 263–284, 2013. [122] M.ZhangandB.K.Gunturk, “Multiresolutionbilateralfilteringforimagedenoising,” IEEE Transactions on Image Processing, vol. 17, no. 12, pp. 2324–2333, 2008. [123] S. Deutsch, A. Ortega, and G. Medioni, “Manifold denoising based on spectral graph wavelets,” in IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pp. 4673–4677, 2016. [124] J. W. Tukey et al., Exploratory data analysis, vol. 2. Addison Wesley Publishing Company, 1977. [125] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition letters, vol. 31, no. 8, 2010. [126] J.Leskovec,A.Rajaraman,andJ.D.Ullman, Mining of massive data sets. Cambridge university press, 2020. [127] V. Feldman and C. Zhang, “What neural networks memorize and why: Discovering the long tail via influence estimation,” Advances in Neural Information Processing Systems, vol. 33, pp. 2881–2891, 2020. [128] J. M. Phillips, “Coresets and sketches,” in Handbook of discrete and computational geometry (C. D. Toth, J. O’Rourke, and J. E. Goodman, eds.), CRC press, 2017. [129] A. Munteanu and C. Schwiegelshohn, “Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms,” KI-K¨ unstliche Intelli- genz, vol. 32, no. 1, 2018. [130] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. on Information theory, vol. 28, no. 2, 1982. [131] R. Gray, “Vector quantization,” IEEE ASSP Magazine, vol. 1, no. 2, 1984. [132] M. Kleindessner, P. Awasthi, and J. Morgenstern, “Fair k-center clustering for data summarization,” in Intl. Conf. on Mach. Learning, 2019. [133] G. Gan, C. Ma, and J. Wu, Data clustering: theory, algorithms, and applications. SIAM, 2020. [134] K. Engan, S. O. Aase, and J. H. Husoy, “Method of optimal directions for frame design,” in Intl. Conf. on Acoustics, Speech, and Signal Processing., vol. 5, IEEE, 1999. 162 [135] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing over- complete dictionaries for sparse representation,” IEEE Trans. on Signal processing, vol. 54, no. 11, 2006. [136] L. Zhang, W.-D. Zhou, P.-C. Chang, J. Liu, Z. Yan, T. Wang, and F.-Z. Li, “Ker- nel sparse representation-based classifier,” IEEE Trans. on Signal Processing, vol. 60, no. 4, 2011. [137] H. Van Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Kernel dictionary learning,” in Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2012. [138] H. Van Nguyen, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, “Design of non- linear kernel dictionaries for object recognition,” IEEE Trans. on Image Processing, vol. 22, no. 12, 2013. [139] B.DumitrescuandP.Irofti,Dictionary learning algorithms and applications. Springer, 2018. [140] I.Ramirez,P.Sprechmann,andG.Sapiro,“Classificationandclusteringviadictionary learningwithstructuredincoherenceandsharedfeatures,”inIEEEConf.onComputer Vision and Pattern Recognition, 2010. [141] T. H. Vu and V. Monga, “Fast low-rank shared dictionary learning for image classifi- cation,” IEEE Trans. on Image Processing, vol. 26, 2017. [142] D. Feldman, M. Feigin, and N. Sochen, “Learning big (image) data via coresets for dictionaries,” J. of Mathematical imaging and vision, vol. 46, no. 3, 2013. [143] A. Golts and M. Elad, “Linearized kernel dictionary learning,” J. of Selected Topics in Signal Processing, vol. 10, no. 4, 2016. [144] Z. Borsos, M. Mutny, and A. Krause, “Coresets via bilevel optimization for continual learning and streaming,” Advances in Neural Information Processing Systems, vol. 33, pp. 14879–14890, 2020. [145] M. Bevilacqua, A. Roumy, C. Guillemot, and M.-L. A. Morel, “K-web: Nonnegative dictionary learning for sparse image representations,” in 2013 IEEE Intl. Conf. on Image Processing, pp. 146–150, IEEE, 2013. [146] Q. Pan, D. Kong, C. Ding, and B. Luo, “Robust non-negative dictionary learning,” in 28th AAAI Conf. on Artificial Intelligence , 2014. [147] B. Hosseini, F. H¨ ulsmann, M. Botsch, and B. Hammer, “Non-negative kernel sparse coding for the analysis of motion data,” in Intl. Conf. on Artificial Neural Networks , pp. 506–514, Springer, 2016. [148] Y. Zhang, T. Xu, and J. Ma, “Image categorization using non-negative kernel sparse representation,” Neurocomputing, 2017. 163 [149] J. Zhou, S. Zeng, and B. Zhang, “Kernel nonnegative representation-based classifier,” Applied Intelligence, 2021. [150] D. Maclaurin, D. Duvenaud, and R. Adams, “Gradient-based hyperparameter opti- mizationthroughreversiblelearning,”inIntl. Conf. on Mach. learning,pp.2113–2122, PMLR, 2015. [151] T. Wang, J.-Y. Zhu, A. Torralba, and A. A. Efros, “Dataset distillation,” arXiv preprint arXiv:1811.10959, 2018. [152] I. Sucholutsky and M. Schonlau, “Soft-label dataset distillation and text dataset dis- tillation,” in 2021 Intl. Joint Conf. on Neural Networks (IJCNN), pp. 1–8, IEEE, 2021. [153] B.Zhao, K.R.Mopuri, andH.Bilen, “Datasetcondensationwithgradientmatching,” in Intl. Conf. on Learning Representations, 2021. [154] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and gen- eralization in neural networks,” Advances in neural information processing systems, vol. 31, 2018. [155] T.Nguyen, Z.Chen, andJ.Lee, “Datasetmeta-learningfromkernelridge-regression,” arXiv preprint arXiv:2011.00050, 2020. [156] P. K. Agarwal, S. Har-Peled, K. R. Varadarajan, et al., “Geometric approximation via coresets,” Combinatorial and computational geometry, vol. 52, no. 1, 2005. [157] D. Feldman and M. Langberg, “A unified framework for approximating and clustering data,” in Proceedings of the forty-third annual ACM symposium on Theory of comput- ing, pp. 569–578, 2011. [158] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit,” tech. rep., Computer Science Department, Technion, 2008. [159] T.Chen,S.Kornblith,M.Norouzi,andG.Hinton,“Asimpleframeworkforcontrastive learning of visual representations,” in International conference on machine learning, pp. 1597–1607, PMLR, 2020. [160] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learn- ing requires rethinking generalization,” in Int. Conf. on Learn. Representations, 2017. [161] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do Cifar-10 classifiers generalize to Cifar-10?,” arXiv:1806.00451, 2018. [162] M. Belkin, D. J. Hsu, and P. Mitra, “Overfitting or perfect fitting? Risk bounds for classificationandregressionrulesthatinterpolate,”in Advances in Neural Inf. Process. Syst., 2018. 164 [163] M. Belkin, A. Rakhlin, and A. B. Tsybakov, “Does data interpolation contradict sta- tistical optimality?,” in The 22nd Int. Conf. on Artificial Intelligence and Statistics , PMLR, 2019. [164] N. Papernot and P. McDaniel, “Deep k-nearest neighbors: Towards confident, inter- pretable and robust deep learning,” CoRR, 2018. [165] E. Wallace, S. Feng, and J. Boyd-Graber, “Interpreting neural networks with nearest neighbors,” in Proc. of the EMNLP Workshop BlackboxNLP, 2018. [166] P. Massart, ´E. N´ ed´ elec, et al., “Risk bounds for statistical learning,” The Annals of Statistics, 2006. [167] A. Elisseeff and M. Pontil, “Leave-one-out error and stability of learning algorithms with applications,” NATO Science Series III: Computer and Systems Sciences, 2003. [168] A. Luntz, “On estimation of characters obtained in statistical procedure of recogni- tion,” Technicheskaya Kibernetica, 1969. [169] W. H. Rogers and T. J. Wagner, “A finite sample distribution-free performance bound for local discrimination rules,” The Annals of Statistics, 1978. [170] L. Devroye and T. Wagner, “Distribution-free inequalities for the deleted and holdout error estimates,” IEEE Trans. on Inf. Theory, 1979. [171] C. Rogers, “Covering a sphere with spheres,” Mathematika, 1963. [172] L. Chen, “New analysis of the sphere covering problems and optimal polytope approx- imation of convex bodies,” J. of Approximation Theory, 2005. [173] T. Liang and A. Rakhlin, “Just interpolate: Kernel “ridgeless” regression can gener- alize,” Annals of Statistics, 2020. [174] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in high- dimensional ridgeless least squares interpolation,” arXiv:1903.08560, 2019. [175] P. W. Koh and P. Liang, “Understanding black-box predictions via influence func- tions,” in Int. Conf. on Mach. Learn., 2017. [176] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Inf. Process. Syst., 2017. [177] M. Bontonou, C. Lassance, G. B. Hacene, V. Gripon, J. Tang, and A. Ortega, “Intro- ducing graph smoothness loss for training deep learning architectures,” in IEEE Data Science Workshop, 2019. [178] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing, “Deep kernel learning,” in Artificial Intelligence and Statistics , 2016. 165 [179] R.-E.Fan,K.-W.Chang,C.-J.Hsieh,X.-R.Wang,andC.-J.Lin,“Liblinear: Alibrary for large linear classification,” J. of Mach. Learn. research, 2008. [180] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., “Matching networks for one shot learning,” in Advances in Neural Inf. Process. Syst., 2016. [181] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. Tenenbaum, H. Larochelle, andR.Zemel,“Meta-learningforsemi-supervisedfew-shotclassification,”in Int.Conf. on Learning Representations, 2018. [182] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpa- thy, A. Khosla, M. Bernstein, et al., “ImageNet: Large Scale Visual Recognition Chal- lenge,” Int. J. of Computer Vision, 2015. [183] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in British Mach. Vision Conf., 2016. [184] E.D.Cubuk,B.Zoph,D.Mane,V.Vasudevan,andQ.V.Le,“Autoaugment: Learning augmentation strategies from data,” in Proc. of the Conf. on Computer Vision and Pattern Recognition, 2019. [185] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum con- trastive learning,” arXiv:2003.04297, 2020. [186] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” 2020. [187] P.Goyal,Q.Duval,J.Reizenstein,M.Leavitt,M.Xu,B.Lefaudeux,M.Singh,V.Reis, M. Caron, P. Bojanowski, A. Joulin, and I. Misra, “VISSL,” 2021. [188] G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,” arXiv:1702.05374, 2017. [189] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A compre- hensive survey on transfer learning,” Proceedings of the IEEE, 2020. [190] A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” 2009. [191] D.HendrycksandT.Dietterich,“Benchmarkingneuralnetworkrobustnesstocommon corruptions and perturbations,” Int. Conf. on Learning Representations, 2019. [192] T. T. Nguyen, J. Idier, C. Soussen, and E.-H. Djermoune, “Non-negative orthogonal greedyalgorithms,” IEEE Transactions on Signal Processing,vol.67,no.21,pp.5643– 5658, 2019. [193] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learn- ing requires rethinking generalization,” in International Conference on Learning Rep- resentations (ICLR), 2017. 166 [194] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learn- ing (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021. [195] B.Recht,R.Roelofs,L.Schmidt,andV.Shankar,“Doimagenetclassifiersgeneralizeto imagenet?,” in International conference on machine learning, pp. 5389–5400, PMLR, 2019. [196] M. Anthony, P. L. Bartlett, P. L. Bartlett, et al., Neural network learning: Theoretical foundations, vol. 9. cambridge university press Cambridge, 1999. [197] O. Bousquet and A. Elisseeff, “Algorithmic stability and generalization performance,” Advances in Neural Information Processing Systems, vol. 13, 2000. [198] F. Bauer, S. Pereverzev, and L. Rosasco, “On regularization algorithms in learning theory,” Journal of complexity, vol. 23, no. 1, pp. 52–72, 2007. [199] S.Gunasekar,J.D.Lee,D.Soudry,andN.Srebro,“Implicitbiasofgradientdescenton linear convolutional networks,” Advances in Neural Information Processing Systems, vol. 31, 2018. [200] M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the op- timization landscape of over-parameterized shallow neural networks,” IEEE Transac- tions on Information Theory, vol. 65, no. 2, pp. 742–769, 2018. [201] S. Arora, S. Du, W. Hu, Z. Li, and R. Wang, “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks,” in International Conference on Machine Learning, pp. 322–332, PMLR, 2019. [202] M.Belkin,Problems of Learning on Manifolds. PhDthesis,TheUniversityofChicago, 2003. [203] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric frame- work for learning from labeled and unlabeled examples,” The Journal of Machine Learning Research, vol. 7, pp. 2399–2434, 2006. [204] C. Lassance, M. Bontonou, G. B. Hacene, V. Gripon, J. Tang, and A. Ortega, “Deep geometric knowledge distillation with graphs,” in Int. Conf. on Acoustics, Speech and Signal Process., IEEE, 2020. [205] J.A.Tropp, Topics in sparse approximation. TheUniversityofTexasatAustin, 2004. [206] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613, 1998. [207] G. Ongie, R. Willett, D. Soudry, and N. Srebro, “A function space view of bounded norm infinite width ReLU nets: The multivariate case,” in International Conference on Learning Representations, 2019. 167 [208] M. Fazlyab, M. Morari, and G. J. Pappas, “Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming,” IEEE Transactions on Automatic Control, 2020. [209] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, “Efficient and accu- rate estimation of lipschitz constants for deep neural networks,” Advances in Neural Information Processing Systems, 2019. [210] M.Soltanolkotabi, E.Elhamifar, andE.J.Candes, “Robustsubspaceclustering,” The annals of Statistics, vol. 42, no. 2, pp. 669–699, 2014. [211] P.Pope,C.Zhu,A.Abdelkader,M.Goldblum,andT.Goldstein,“Theintrinsicdimen- sion of images and its impact on learning,” in International Conference on Learning Representations, 2020. [212] M. Caron, H. Touvron, I. Misra, H. J´ egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” CoRR, vol. abs/2104.14294, 2021. [213] T.Chen,S.Kornblith,K.Swersky,M.Norouzi,andG.E.Hinton,“Bigself-supervised modelsarestrongsemi-supervisedlearners,”Advancesinneuralinformationprocessing systems, vol. 33, pp. 22243–22255, 2020. [214] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for con- trastive learning of visual representations,” CoRR, vol. abs/2002.05709, 2020. [215] J. Grill, F. Strub, F. Altch´ e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. ´ A.Pires,Z.D.Guo,M.G.Azar,B.Piot,K.Kavukcuoglu,R.Munos,andM.Valko, “Bootstrap your own latent: A new approach to self-supervised learning,” CoRR, vol. abs/2006.07733, 2020. [216] I. Misra and L. van der Maaten, “Self-supervised learning of pretext-invariant repre- sentations,” CoRR, vol. abs/1912.01991, 2019. [217] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman, “With a little helpfrommyfriends: Nearest-neighborcontrastivelearningofvisualrepresentations,” CoRR, vol. abs/2104.14548, 2021. [218] A. Bardes, J. Ponce, and Y. LeCun, “Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning,” CoRR, vol. abs/2105.04906, 2021. [219] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” CoRR, vol. abs/2103.03230, 2021. [220] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” CoRR, vol. abs/1807.05520, 2018. [221] J.Li,P.Zhou,C.Xiong,R.Socher,andS.C.H.Hoi,“Prototypicalcontrastivelearning of unsupervised representations,” CoRR, vol. abs/2005.04966, 2020. 168 [222] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning,” CoRR, vol. abs/2005.10243, 2020. [223] L. Jing, P. Vincent, Y. LeCun, and Y. Tian, “Understanding dimensional collapse in contrastive self-supervised learning,” arXiv preprint arXiv:2110.09348, 2021. [224] Y. Tian, X. Chen, and S. Ganguli, “Understanding self-supervised learning dynam- ics without contrastive pairs,” in International Conference on Machine Learning, pp. 10268–10278, PMLR, 2021. [225] T. Hua, W. Wang, Z. Xue, Y. Wang, S. Ren, and H. Zhao, “On feature decorrelation in self-supervised learning,” CoRR, vol. abs/2105.00470, 2021. [226] J.-B. Grill, F. Strub, F. Altch´ e, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21271–21284, 2020. [227] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758, 2021. [228] J. Li, P. Zhou, C. Xiong, and S. C. Hoi, “Prototypical contrastive learning of unsuper- vised representations,” arXiv preprint arXiv:2005.04966, 2020. [229] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020. [230] I.MisraandL.v.d.Maaten, “Self-supervisedlearningofpretext-invariantrepresenta- tions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717, 2020. [231] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non- parametric instance discrimination,” in Proceedings of the IEEE conference on com- puter vision and pattern recognition, pp. 3733–3742, 2018. [232] Y.M.Asano,C.Rupprecht,andA.Vedaldi,“Self-labellingviasimultaneousclustering and representation learning,” arXiv preprint arXiv:1911.05371, 2019. [233] R. Cosentino, A. Sengupta, S. Avestimehr, M. Soltanolkotabi, A. Ortega, T. Willke, and M. Tepper, “Toward a geometrical understanding of self-supervised contrastive learning,” arXiv preprint arXiv:2205.06926, 2022. [234] W. Huang, M. Yi, and X. Zhao, “Towards the generalization of contrastive self- supervised learning,” arXiv preprint arXiv:2111.00743, 2021. [235] J.Z.HaoChen,C.Wei,A.Gaidon,andT.Ma,“Provableguaranteesforself-supervised deeplearningwithspectralcontrastiveloss,” Advances in Neural Information Process- ing Systems, vol. 34, 2021. 169 [236] F.WangandH.Liu,“Understandingthebehaviourofcontrastiveloss,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2495– 2504, 2021. [237] S.Kornblith,J.Shlens,andQ.V.Le,“Dobetterimagenetmodelstransferbetter?,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2661–2671, 2019. [238] L. Ericsson, H. Gouk, and T. M. Hospedales, “How well do self-supervised models transfer?,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pp. 5414–5423, 2021. [239] T. Xiao, X. Wang, A. A. Efros, and T. Darrell, “What should not be contrastive in contrastive learning,” arXiv preprint arXiv:2008.05659, 2020. [240] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K¨ opf, E. Z. Yang, Z. DeVito, M. Raison, A.Tejani, S.Chilamkurthy, B.Steiner, L.Fang, J.Bai, andS.Chintala, “Pytorch: An imperativestyle, high-performancedeeplearninglibrary,” CoRR,vol.abs/1912.01703, 2019. [241] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchicalimagedatabase,”in2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009. [242] Y. Guo, N. C. Codella, L. Karlinsky, J. V. Codella, J. R. Smith, K. Saenko, T. Ros- ing, and R. Feris, “A broader study of cross-domain few-shot learning,” in European Conference on Computer Vision, pp. 124–141, Springer, 2020. [243] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascalvisualobjectclasses(VOC)challenge,”Internationaljournalofcomputervision, vol. 88, no. 2, pp. 303–338, 2010. [244] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European conference on computer vision, pp. 746–760, Springer, 2012. [245] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Seman- tic understanding of scenes through the ADE20k dataset,” International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019. [246] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformersforimagerecognitionatscale,” arXiv preprint arXiv:2010.11929, 2020. [247] A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez, L.Kaiser,and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. 170 [248] Y. Bai, J. Mei, A. L. Yuille, and C. Xie, “Are transformers more robust than CNNs?,” Advances in Neural Information Processing Systems, vol. 34, 2021. [249] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” arXiv preprint arXiv:1803.07728, 2018. [250] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glorot, “Higher order contractive auto-encoder,” in Joint European conference on machine learning and knowledge discovery in databases, pp. 645–660, Springer, 2011. [251] R.Cosentino,R.Balestriero,R.Baranuik,andB.Aazhang,“Deepautoencoders: From understanding to generalization guarantees,” in Mathematical and Scientific Machine Learning, pp. 197–222, PMLR, 2022. [252] P.H.Richemond, J.-B.Grill, F.Altch´ e, C.Tallec, F.Strub, A.Brock, S.Smith, S.De, R. Pascanu, B. Piot, et al., “Byol works even without batch statistics,” arXiv preprint arXiv:2010.10241, 2020. [253] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.Gimelshein,L.Antiga,etal.,“Pytorch: Animperativestyle,high-performancedeep learning library,” Advances in neural information processing systems, vol. 32, 2019. [254] S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” CoRR, vol. abs/1306.5151, 2013. [255] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few trainingexamples: Anincrementalbayesianapproachtestedon101objectcategories,” in 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178– 178, 2004. [256] J. Krause, J. Deng, M. Stark, and L. Fei-Fei, “Collecting a large-scale dataset of fine- grained cars,” 2013. [257] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” CoRR, vol. abs/1311.3618, 2013. [258] M.-E.NilsbackandA.Zisserman,“Automatedflowerclassificationoveralargenumber of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics Image Processing, pp. 722–729, 2008. [259] L. Bossard, M. Guillaumin, and L. V. Gool, “Food-101–mining discriminative compo- nents with random forests,” in European conference on computer vision, pp. 446–461, Springer, 2014. [260] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 3498–3505, IEEE, 2012. 171 [261] J.Xiao,J.Hays,K.A.Ehinger,A.Oliva,andA.Torralba,“Sundatabase: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492, 2010. [262] S. P. Mohanty, D. P. Hughes, and M. Salath´ e, “Using deep learning for image-based plant disease detection,” Frontiers in plant science, vol. 7, p. 1419, 2016. [263] P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019. [264] N.Codella, V.Rotemberg, P.Tschandl, M.E.Celebi, S.Dusza, D.Gutman, B.Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al., “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic),” arXiv preprint arXiv:1902.03368, 2019. [265] P.Tschandl,C.Rosendahl,andH.Kittler,“TheHAM10000dataset,alargecollection of multi-source dermatoscopic images of common pigmented skin lesions,” Scientific data, vol. 5, no. 1, pp. 1–9, 2018. [266] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scalechestx-raydatabaseandbenchmarksonweakly-supervisedclassification and localization of common thorax diseases,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2097–2106, 2017. 172
Abstract (if available)
Abstract
This dissertation advances the algorithmic foundations and applications of neighborhood- and graph-based data processing by introducing (i) a novel sparse signal approximation perspective and an improved method, non-negative kernel regression (NNK), for neighborhood and graph construction; (ii) geometric and scalable extensions of the proposed NNK method for image representation, summarization, and non-parametric estimation; and (iii) a graph framework that forms the basis for a geometric theory of deep neural networks (DNNs).
Neighborhoods are data representations based on sparse signal approximations. This perspective leads to a new optimality criterion for neighborhoods. Conventional kNN and epsilon-neighborhood methods are sparse approximations based on thresholding, with their hyperparameters k/epsilon used to control the sparsity level. We introduce NNK, an improved and efficient approach inspired by basis pursuit methods. We derive the polytope geometry of neighborhoods defined with NNK and basis pursuit-based approaches. In particular, we show that, unlike earlier approaches, NNK accounts for the relative position of the neighbors in their definitions. Our experiments demonstrate that NNK produces robust, adaptive, and sparse neighborhoods with superior performance in several signal processing and machine learning tasks.
We then study the application of NNK in three domains. First, we use the properties of images to obtain a scalable NNK graph construction algorithm. Image graphs obtained with our approach are up to 90% more sparse and possess better energy compaction and denoising ability than conventional methods. Second, we extend the NNK algorithm for data summarization using dictionary learning. The proposed NNK-Means algorithm has runtime and geometric properties similar to the kMeans approach. However, unlike kMeans, our approach leads to a smooth partition of the data and superior performance in downstream tasks. Third, we demonstrate the use of NNK for interpolative estimation. We bound the estimation risk of proposed estimators based on the geometry of NNK and show, empirically, its superior performance in classification, few-shot learning, and transfer learning.
Finally, we present a data-driven graph framework to understand and improve DNNs. In particular, we propose a geometric characterization of DNN models by constructing NNK graphs on the feature embeddings induced by the sequential mappings in the DNN. Our proposed manifold graph metrics provide insights into the similarities and disparities between different DNNs, their invariances, and the transfer learning performances of pre-trained DNNs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient graph learning: theory and performance evaluation
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Human motion data analysis and compression using graph based techniques
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Scalable sampling and reconstruction for graph signals
PDF
Human activity analysis with graph signal processing techniques
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Nonparametric ensemble learning and inference
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Hashcode representations of natural language for relation extraction
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Object classification based on neural-network-inspired image transforms
PDF
Human appearance analysis and synthesis using deep learning
Asset Metadata
Creator
Shekkizhar, Sarath
(author)
Core Title
Neighborhood and graph constructions using non-negative kernel regression (NNK)
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-08
Publication Date
07/14/2023
Defense Date
04/10/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clustering,deep learning,dictionary learning,epsilon-neighborhood,graph image processing,graphs,interpolation estimators,k-nearest neighbor,local linearity,nearest neighbors,neural networks,OAI-PMH Harvest,self-supervised learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Shahabi, Cyrus (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
sarath0210@gmail.com,shekkizh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113279238
Unique identifier
UC113279238
Identifier
etd-Shekkizhar-12084.pdf (filename)
Legacy Identifier
etd-Shekkizhar-12084
Document Type
Dissertation
Format
theses (aat)
Rights
Shekkizhar, Sarath
Internet Media Type
application/pdf
Type
texts
Source
20230717-usctheses-batch-1068
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
clustering
deep learning
dictionary learning
epsilon-neighborhood
graph image processing
interpolation estimators
k-nearest neighbor
local linearity
nearest neighbors
neural networks
self-supervised learning