Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Neuroscience inspired algorithms for lifelong learning and machine vision
(USC Thesis Other)
Neuroscience inspired algorithms for lifelong learning and machine vision
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Neuroscience Inspired Algorithms for Lifelong Learning and Machine Vision by Amanda Sofie Rios A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (NEUROSCIENCE) August 2022 Copyright 2022 Amanda Sofie Rios Acknowledgements First and foremost, I would like to deeply thank my advisor Dr. Laurent Itti and my co-advisor Dr. Bartlett Mel for teaching, mentoring and inspiring me through my PhD program. I would also like to thank my Committee Chair, Dr. Judith Hirsch for all her continued support. Furthermore, I want to acknowledge my internship supervisor Dr. Omesh Tickoo and my research collaborators Dr. Nilesh Ahuja, Utku Genc and Dr. Ibrahima Ndiour all of whom really helped enrich my PhD experience. I also want to thank my lab member and collaborator Jong Woo Nam. I am also very grateful to all my lab members for their friendship and stimulating conversations. I would also like to thank the NGP staff for their help and support. Lastly, I would like to thank my husband, parents and sisters for believing, loving and taking care of me through the good and bad moments. ii Table of Contents Acknowledgements ii List of Tables vii List of Figures ix Abstract xiv Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Neuroscience of Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Origins of the Brain Learning Machinery - Developmental Learning . . . . 2 1.2.2 Cellular Basis of Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Memory Management at a Systems Level . . . . . . . . . . . . . . . . . . 4 1.2.4 Integrating Novel Memories Into Knowledge . . . . . . . . . . . . . . . . 6 1.3 This Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Closed Loop Memory GAN for Continual Learning 10 2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Closed Loop memory GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.1 Model Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.3 Closed-Loop Training with Replay . . . . . . . . . . . . . . . . . . . . . . 15 2.4.4 Image Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.5 Dynamic Memory Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.6 Continual Learning Baselines . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Buffer Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.2.1 Average Continual Performance . . . . . . . . . . . . . . . . . . 20 2.5.2.2 Stochastic Up-Sampling . . . . . . . . . . . . . . . . . . . . . . 22 2.5.2.3 Per Task Performance . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iii Chapter 3: Lifelong learning Without a Task oracle 26 3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Task-Independent Continual Learning . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Task-Dependent Continual Learning . . . . . . . . . . . . . . . . . . . . . 30 3.4 Methods - Learning without a Task Oracle . . . . . . . . . . . . . . . . . . . . . . 30 3.4.1 Incremental Unsupervised Task Mappers . . . . . . . . . . . . . . . . . . 32 3.4.1.1 Nearest Means Classifier (NMC) . . . . . . . . . . . . . . . . . 33 3.4.1.2 Gaussian Mixture Model Classifier (GMMC) . . . . . . . . . . . 33 3.4.1.3 Fuzzy ART Classifier . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Supervised Prototype Mapping (ARTMAP) . . . . . . . . . . . . . . . . . 35 3.4.3 Perceptron with Coreset Replay (PCR) . . . . . . . . . . . . . . . . . . . 36 3.4.3.1 Coreset Building . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.3.2 Perceptron Training . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.4 Perceptron + replay of Task-Specific Embeddings (PCR-E) . . . . . . . . . 37 3.4.5 Task-Mapper Baselines and Baseline-Modifications . . . . . . . . . . . . . 38 3.4.5.1 Baseline - Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.5.2 Baseline - Expert Autoencoder Gates (AE-gates) . . . . . . . . . 39 3.4.5.3 Baseline - Clustering in Multi-Head Outputs (KM-heads) . . . . 39 3.4.5.4 Baseline Modification - Clustering in Multi-Head or Shared- Head with PSP-BD Task-Partitioning (KM-heads Ours) . . . . . 39 3.4.6 Task Independent Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Experiment Descriptions - Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5.1 8 Datasets experiment (Inter-Dataset) . . . . . . . . . . . . . . . . . . . . 41 3.5.2 Permuted MNIST (Inter-Dataset) . . . . . . . . . . . . . . . . . . . . . . 41 3.5.3 Sequence of 10 Cifar100 superclasses (Intra-Dataset) . . . . . . . . . . . . 41 3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.1 Task Estimation - Parameter Dependency . . . . . . . . . . . . . . . . . . 42 3.6.2 Task-Dependent Vs Task-Independent Performances . . . . . . . . . . . . 43 3.6.3 Incremental Inter x Intra Dataset learning . . . . . . . . . . . . . . . . . . 45 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: incDFM: Incremental Deep Feature Modeling for Continual Novelty Detec- tion 48 4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Methodology for Continual Novelty Detection . . . . . . . . . . . . . . . . . . . . 52 4.4.1 incDFM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.1.1 Deep feature Modeling . . . . . . . . . . . . . . . . . . . . . . 54 4.4.1.2 Knowledge consolidation and storage: . . . . . . . . . . . . . . 55 4.4.1.3 Novelty Detection and Selection: Incremental recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 iv 4.4.2 Full Pipeline: unsupervised class-incremental learning using incDFM for continual novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5.1 Baselines: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5.2 Architecture and Training Parameters: . . . . . . . . . . . . . . . . . . . . 60 4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6.1 Preliminary offline evaluation of incDFM and baselines . . . . . . . . . . 61 4.6.2 Continual Novelty detection . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.6.2.1 Intra-dataset OOD: Class incremental novelty detection . . . . . 62 4.6.2.2 Inter-dataset OOD: dataset incremental novelty detection . . . . 63 4.6.3 Full Pipeline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6.4 Ablation and Hyper-parameter sensitivity study in incDFM . . . . . . . . . 65 4.6.4.1 Error Propagation in continual OOD detection: . . . . . . . . . . 65 4.6.4.2 Incremental Recruitment sensitivity in incDFM: . . . . . . . . . 66 4.6.4.3 Mixing Ratio of New/Old in each task: . . . . . . . . . . . . . . 66 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 5: Shape Encoding for Object Recognition in Artificial Agents 68 5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.1 Human Object Recognition Relies on Shape Information . . . . . . . . . . 71 5.3.2 Deep Network Object Recognition Does not necessarily follow human-like patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4 Benchmarks to study Shape reliance in Object recognition models . . . . . . . . . 74 5.4.1 Towards a pure shape recognition benchmark via randomizations of non- shape cues - iLabShape benchmark . . . . . . . . . . . . . . . . . . . . . 74 5.4.2 Measuring Shape Recognition Capacity quantitatively using Nearest Neigh- bor Matching – ShapeY benchmark . . . . . . . . . . . . . . . . . . . . . 75 5.4.2.1 miniShapeY Image Set . . . . . . . . . . . . . . . . . . . . . . 76 5.4.2.2 ShapeY benchmark . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5.1 Enforcing shape-exclusive Object recognition in conventional DNNs . . . . 79 5.5.1.1 Lack of concept generalization between a DNN trained on Real- world images (Imagenet) and tested on Toy vehicles (iLab) . . . 79 5.5.1.2 Training a DNN with regular unaltered iLab images and testing on iLabShape images . . . . . . . . . . . . . . . . . . . . . . . 80 5.5.1.3 Training a DNN with randomized iLabShape images and testing on original unperturbed iLab images . . . . . . . . . . . . . . . 80 5.5.2 Using contour descriptors (explicit object boundaries) to train DNNs . . . . 81 5.5.2.1 Comparing the contour descriptor input to RGB using networks matched for number of parameters . . . . . . . . . . . . . . . . 82 5.5.2.2 Comparing Shape-rich Contour input to RGB input using a cus- tom residual network (RESNET) adaptable to both contour and RGB inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 v 5.5.3 Beyond Basic Contour descriptors . . . . . . . . . . . . . . . . . . . . . . 84 5.5.3.1 A Custom hierarchical Shape Feature Extractor – ShapeRNet . . 84 5.5.3.2 ShapeRNet Input Layer . . . . . . . . . . . . . . . . . . . . . . 85 5.5.3.3 Higher-order Layers: Detection of high-order conjunctions and invariance pooling . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5.4 Evaluating Invariance profile of high-order shape descriptors from ShapeR- Net using ShapeY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5.5 Comparing the shape invariance profiles of ShapeRNet and conventional DNNs (Resnet 50) using ShapeY . . . . . . . . . . . . . . . . . . . . . . 90 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Chapter 6: General Conclusion 93 References 95 Appendices 101 A Supplementary Materials - Closed Loop Memory GAN for Continual Learning . . 102 A.1 Per Task Performance for MNIST and FASHION . . . . . . . . . . . . . . 102 A.2 Stochastic Up-Sampling for MNIST and FASHION . . . . . . . . . . . . . 103 A.3 EWC Complementary Results and Training Details . . . . . . . . . . . . . 104 A.4 Image Filtering - Rejection Sampling . . . . . . . . . . . . . . . . . . . . 105 A.4.1 Soft Rejection Filtering - SRF . . . . . . . . . . . . . . . . . . . 106 A.4.2 Discriminator Rejection Sampling - DRS . . . . . . . . . . . . . 106 A.4.3 Filtering Results . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.5 Conditional Convolutional V AE . . . . . . . . . . . . . . . . . . . . . . . 108 A.5.1 Closed-Loop Replay . . . . . . . . . . . . . . . . . . . . . . . . 108 A.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.6 CloGAN Network and Data Parameters . . . . . . . . . . . . . . . . . . . 109 B Supplementary Materials - Lifelong Learning Without a Task oracle . . . . . . . . 110 B.1 Permanent Memory Usage Across Models . . . . . . . . . . . . . . . . . . 110 B.2 Choice of Feature Extraction Embedding . . . . . . . . . . . . . . . . . . 111 B.3 Coreset Replay Task Mapper Parameter Variations . . . . . . . . . . . . . 112 B.4 AE-gates Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 113 B.5 Training and optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 114 C Supplementary Materials - Incremental Deep Feature Modeling for Continual Nov- elty Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 C.1 Intra-dataset Novelty Detection Results . . . . . . . . . . . . . . . . . . . 114 C.2 Inter-dataset Novelty Detection Results . . . . . . . . . . . . . . . . . . . 116 C.3 Estimating the stopping point for incremental novelty recruitment in incDFM116 C.4 Thresholding in Baselines - Hyperparameter Sweep . . . . . . . . . . . . . 118 C.5 Feature Extraction Network . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.6 End-to-end Unsupervised class-incremental classification Pipeline . . . . . 119 C.6.1 Memory Coreset . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.6.2 Experience replay . . . . . . . . . . . . . . . . . . . . . . . . . 120 C.7 Inter-Dataset novelty detection using the 8 datasets . . . . . . . . . . . . . 120 vi List of Tables 2.1 Displays % Correct As A Function Of Buffer Selection . . . . . . . . . . . . . . . 19 2.2 Performance after continual learning of all tasks. For CloGAN, memory allotment is in parenthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Fine-Grained Classification Performance and Memory Usage for each Task-Mapper and Baseline Method. Task(%) is task mapping accuracy and Main(%) fine-grained classification performance. The first set of results are task-dependent with PSP- BD backbones. The (ours) indicates a task-mapper we propose. The second set includes task-independent baselines. Best results in bold. . . . . . . . . . . . . . . 44 3.2 Best Task Mapper with respect to Oracle. Percentages taken from the best versions of our full model (GMMC/NMC + PSP-BD) with respect to upper-bound oracle + PSP-BD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 AUROC scores for offline OOD estimation . . . . . . . . . . . . . . . . . . . . . 61 4.2 Inter-dataset continual learning (8-dataset) with and without Task Oracle. . . . . . 63 4.3 AUPR Scores with task data imbalanced towards more old samples (Cifar10). . . . 66 5.1 Performance of Alexnet trained on regular iLab and tested on randomized iLab images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Performance of Alexnet trained on randomized iLab images and tested on original iLab images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1 EWC training parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 % Correct As A Function Of Closed-Loop Filtering . . . . . . . . . . . . . . . . . 107 6.3 CCV AE Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 AC-GAN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.6 Permanent Memory Usage for Task Mappers . . . . . . . . . . . . . . . . . . . . 110 vii 6.7 Permanent Memory Usage for Other Model Components . . . . . . . . . . . . . . 111 6.8 Embedding Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.9 PCR Hidden Layers* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.10 Coreset Building Techniques - Task Classification* . . . . . . . . . . . . . . . . . 113 6.11 Full Pipeline Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.12 incDFM F1 Scores - averaged across all tasks in intra-dataset class incremental experiments - when varying P val . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.13 AUROC scores for offline OOD estimation . . . . . . . . . . . . . . . . . . . . . 119 viii List of Figures 2.1 Diagram of CloGAN Model used for cumulative and continual learning. Past data is sampled from the generator and filtered by the embedded classifier. Old data is a combination of the fresh stochastic generator output and a small memory buffer used to ”smoothen” the old data distribution for quality output . . . . . . . . . . . 14 2.2 CloGAN Training Algorithm. Procedure Train is described in section 2.4.3; Filter in 2.4.4; Bu f f erConstruct in 2.4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Graphs of Maximum Average Accuracies across continually learned tasks. A) MNIST; B) Fashion; C) SVHN; D) E-MNIST. The dashed lines indicate start of a new task represented by a disjoint set of classes. We illustrate the performance CloGAN as memory sizes are varied (memory allotment is in parenthesis). In red we show catastrophic forgetting when fine-tuning by gradient descent. In salmon we show multi-task training until convergence with the full datasets starting from scratch at every task switch (MT-Full). We also show results for EWC, DGR and MeRGAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Stochastic Up-Sampling of CloGAN graphs. A) SVHN. B) E-MNIST. The contri- bution of upsampling is indicated by a positive gap between CloGAN and Frozen- CloGAN as more tasks are learned. We also compare to the MT condition in which training is re-started at each task, eliminating forward-transfer of possible shared task features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Accuracies per task graphs for EMNIST (A) and SVHN (B). Generated images for SVHN and EMNIST (C). CloGAN preserves performance for early tasks through- out training. MeRGAN has degenerated performance and image generation for early tasks. In EMNIST, degradation is a darkening in the first tasks (a,...,f) in contrast to the last task (v,w,x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 (A) Task-Independent (B) Task-Dependent model outlines. Here we show the most common architectural schemes for the two modalities. Task-dependent models require task labels as input. Task-independent do not. . . . . . . . . . . . . . . . . 29 ix 3.2 Outline of our proposed pipeline. We employ parameter superposition (PSP- Che- ung et al, 2019) with Beneficial Biases (BD - Wen et al, 2020) as a state-of-the-art task-dependent backbone for fine-grained classification. We then substitute the or- acle input for an end-to-end trainable task-mapper. We experiment with different low-memory task-mapper variants. In all the variants, at test time, the task-mapper predicts task assignments which are then used to activate task-specific PSP context keys and BD biases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Schematic of our incremental learning experiments, which can be divided into two main categories, Inter-Dataset and Intra-dataset. For Inter-dataset, each task is a complete dataset: (A) A sequence of 8 datasets containing natural images [58]. (B) A sequence of 25 datasets, each a permutations of MNIST. In the Intra-Dataset modality, each task is a subset of one dataset. Here we use 10 different super- classes of Cifar-100 each as a separate task. . . . . . . . . . . . . . . . . . . . . . 40 3.5 Parameter Dependencies of the different task-mappers. The vertical axes (task ac- curacy), have been scaled differently between each subplot so that variations to critical parameters are amplified. The green markers denote the best memory- performance tradeoff according to score= A− αM. Most mappers achieve op- timum memory-performance tradeoffs on an elbow point. However, for the first column (ART/ARTMAP), the vertical red line is a virtual memory wall, meaning we did not increase vigilance further because memory usage was already up to 10-fold larger than for all other task-mappers. Overall best memory-performance tradeoff is obtained by inc-GMMC and inc-NMC. . . . . . . . . . . . . . . . . . . 42 3.6 Fine-grained classification performance versus model’s memory usage. Squares are task-dependent models and triangles, task-independent. The absolute upper- bound is Oracle + PSP-BD. Our full pipeline, with GMMC, NMC or PCR + PSP- BD classifier, is the closest to the optimal upper-left corner. . . . . . . . . . . . . . 45 4.1 incDFM estimates novelty incrementally per task. A tasks’s unlabeled data mixture is shown here with ID/old samples in blue and OOD/novel samples in orange. At each iteration within one novel task, incDFM recruits the top most “certain” novel samples (in red) according to the evaluation function S i . It then removes them from the unlabeled pool. At iteration 1 we can see new and old distributions are entangled but tend to separate in later tasks, as incDFM improves its estimate of novelty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Procedures KnowledgeScores and SelectTop are described in section 4.4.1.3; Consolidate in 4.4.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Full Pipeline - unsupervised class incremental learning with incDFM . . . . . . . . 58 4.4 Intra-dataset Novelty Detection: (a) AUROC scores per task for novelty detection using detected samples as train/fit data for model update. (b) Average AUROC and AUPR scores after all tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 x 4.5 Unsupervised incremental classification pipeline - (a) Average incremental classi- fication accuracy over tasks. (b) Final classification accuracy after all tasks. . . . . 64 4.6 (a) Error Propagation from using d OOD t /(yes) vs. ground-truth OOD t /(no). (b) incDFM iterations and recruitment % (Cifar10 averaged across tasks). . . . . . . . 65 5.1 Hue and Polarity transformations used to establish a shape-exclusive classification benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Viewpoint Exclusion in miniShapeY chair category- Positive match candidates (PMCs) for view 8 (blue box) out of 11 in the series with CVT = ’pw’. Rows show all 8 series containing ’pw’. The difficulty of the matching task is controlled by excluding positive match candidates in the ”vicinity” of the reference view in viewpoint space. The ”exclusion zone” shown (red shading) is for an exclusion radius re = 2. This image is a courtesy of lab member Jong Woo Nam. . . . . . . . 76 5.3 Contrast Exclusion in the ShapeY benchmark– from a query image, one is only al- lowed to match to contrast reversed images within the same object category. Other objects (distractors) have the same background as the query. This will examine if the encoding is more sensitive to background or the underlying shape. This image is a courtesy of lab member Jong Woo Nam. . . . . . . . . . . . . . . . . . . . . . 77 5.4 Histogramed Top-5 outputs of Imagenet-trained Alexnet when presented with toy car test samples from iLab. Under those conditions, the most frequent class pre- dicted was syringe, followed by measuring cup and envelope, neither of which bear any sensible shape-resemblance to toy cars. Moreover, no car or vehicle class stand within the top-5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 [left] A Contrast Image – contains pure contour information projected back to 2D image space. [right] Visualization of the high-dimensional shape-rich contour in- put, eliminating the orientation information and keeping only the locations of con- tours in the pyramidal scale format. . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.6 Performance of CNNs trained with raw iLabShape RGB images (RGB), pre-processed contours (Contours) as well as Contrast images. Different CNNs were used for each of the inputs, but were adjusted to keep model complexity constant, e.g. same number of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.7 Dependency of the 3 contour angles in defining the final first-order feature orien- tation. Here, zero-degree straight-like contours non-intuitively contain individual contours with middle angles of 30 degrees for example. . . . . . . . . . . . . . . . 85 5.8 Example pictures generated by the visualization code developed to aid in hyper- parameter tuning and model design. In this case, we plot the prevalent contour orientation occurring per super-pixel of space and orientation. The images corre- spond to a 256x256 fine-grained scale (left) and a coarser 128x128 scale (right). Note that different scales have non-redundant information about shape. . . . . . . . 87 xi 5.9 Detection of 4th-order features. Conjunctions are formed between 2nd-order fea- tures if located at appropriate approximate distances. In the scheme, we illustrate the detection of a fourth order conjunction formed by two pairs of 2nd-order fea- tures at 16x16 spatial pools spaced by 26 units in the horizontal external orienta- tion (zero degree). Note that the 2nd-order features themselves include extensive pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.10 Top-1 Nearest neighbor matching error over ShapeY , plotted against the exclusion radius re. Results are using ShapeNet 4th order shape descriptors. ShapeRNet was invariant to contrast exclusion. Note that our system is very robust to planar trans- lation (x,y) but can struggle with depth rotations (p,w) after an exclusion distance of 3 ( 30 degrees). The effective exclusion distance value is actually larger then 30 degrees because series walk the same number of steps in each dimension. For example, a distance of 2 along X can lead to distances larger than 2 along series including X and another dimension, e.g. XY , XYW, from which the distance of 2 is enforced along all dimensions of each series. . . . . . . . . . . . . . . . . . . . 90 5.11 Top-1 nearest neighbor matching error over ShapeY for both Resnet50 (dashed circle) and ShapeNet (bold triangle). The color code separates exclusion transfor- mations involving either 1 (red), 2 (green), or 3 (blue) transformation dimensions. (a) Does not include contrast reversal. (b) Includes contrast reversal. . . . . . . . . 91 6.1 Accuracies per task for MNIST (A) and FASHION (B). Our method, CloGAN is shown using a memory of size of 0.16% (100 images) for MNIST and FASHION. We compare to the performance of MeRGAN, EWC and FGD . . . . . . . . . . . 102 6.2 Stochastic Up-Sampling of CloGAN for MNIST (A) and FASHION (B). CloGAN- Frozen corresponds to training the AC-GAN continuously with a memory but with no stochastic generation. We also compare to the MT condition in which training is re-started at each task, eliminating forward-transfer of possible shared task features104 6.3 EWC Accuracies Per Task with Permuted MNIST and Incremental MNIST. For the Permuted experiment, we reproduce the qualitative results of the original paper [Kirkpatrick et al, 2017] whereas our Incremental Single-Headed MNIST experi- ment causes EWC performance to quickly derail whenever a new task is learned. . 105 6.4 A) Average Accuracies comparing CloGAN and V AE-Loop. B,C) Generated Im- ages for CloGAN and V AE-Loop respectively. The denomination f rozen refers to training without closed-loop replay, only with the memory buffer. Note that while CloGAN exhibits a significant gap between the closed− loop and Frozen variants, the same does not occur for V AE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5 AE-gates task mapper performance: Task classification accuracy is plotted as a function of the autoencoder latent dimension and resulting mapper memory size in Kilobytes (KB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 xii 6.6 Intra-dataset Novelty Detection - AUROC scores per task using test set. The test set is equivalent, in proportions (ratio old:new), to unlabeled train data used to fit to detected old samples. Also, in the case of incDFM, this is evaluated after all iterations are performed on the unlabeled train data. . . . . . . . . . . . . . . . . . 115 6.7 Inter-dataset Novelty Detection (8 dataset sequence) - AUROC scores per task us- ing test set equivalent, in proportions (ratio old:new), to unlabeled train data. In the case of incDFM, this is evaluated after all iterations are performed on the unlabeled train data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.8 Average F1 scores for baselines across tasks, when varying the validation threshold used during d OOD t estimate - (a) Cifar10, (b) Cifar100, (c) emnist, (d) iNaturalist - The threshold is set as a percentile P val of the validation set, the latter containing only ID data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 xiii Abstract In this thesis we present four research projects that draw inspiration from neurobiology to design artificial lifelong learning and machine vision algorithms. The first project, inspired by biological hippocampal replay, develops CloGAN, an algorithm for class incremental continual learning that employs hybrid generative and stored experience replay to mitigate forgetting. Our second project addresses hierarchical continual learning, where a lifelong agent benefits from having a separate algorithm for coarse level object classification and specialist partitions of a deep neural network capable of fine-grained class discrimination. In our third project, we develop a continual novelty detector and integrate it to end-to- unsupervised class incremental learning. incDFM, our novelty detector model, functions incrementally to gradually build confidence and improve its novelty predictions. Lastly, in our final project, we explore complementary topics to lifelong learning for machine vision such as shape representation for artificial object recognition. Overall, the work developed in this thesis provides novel contributions to the field of artificial lifelong learning which is in turn crucial to enabling the next generation of real-world artificial intelligence agents. xiv Chapter 1 Introduction 1.1 Motivation The human brain is a hallmark of continual learning. While interacting with new environments our brain learns from and adapts to a wide range of situations, tasks, and problems while still preserv- ing core knowledge through time. In contrast, most machine learning (ML) approaches to date, only learn one sophisticated but fixed mapping between inputs and outputs with data drawn from a stationary distribution. Whenever the condition for stationarity if destroyed, most ML approaches drastically underperform: (1) first, ML models struggle to reliably detect distribution shifts (de- tecting novelties) and secondly, also fail to successfully adapt to these shifts without sacrificing important prior knowledge (preserving memory). In fact, state of the art deep neural networks (DNN) – a widespread ML approach - are known to undergo a phenomenon termed “catastrophic forgetting”, which describes a sharp decline in the performance of the model on previously learned tasks as soon as new knowledge (ex: new classes) are introduced. The field of Continual Learn- ing (CL) in Artificial Intelligence seeks to tackle the balance between adaptation and memory, with broad relevance to industry and modern technology domains. Evolving distributions appear ubiquitously in several industry-relevant settings such as robotics, self-driving cars, recommender systems, amongst many others. Notably, the current limitations to learning continuously form a critical bottleneck to the development of real-world artificial intelligence. 1 Artificial Intelligence (AI) design will constantly draw inspiration from Neuroscience, after all, much of what AI seeks to enable is already achieved biologically in the human brain. Take Continual Learning, humans can amass an impressive repertoire of knowledge through their lifes- pans. Yet, AI is still very far from this. Consequently, understanding the biological basis of CL is crucial to inspiring the next wave of artificial learners. This thesis will focus on Continual learning for artificial intelligence drawing inspiration from some of neuroscience’s learning and memory mechanisms. The biological underpinnings of continual learning summarized in this chapter can provide insight into designing artificial intelligence lifelong knowledge integration. 1.2 Neuroscience of Continual Learning The human brain enables continual learning through a diverse set of neurophysiological mech- anisms that differ greatly in scope – i.e. can range from cellular and molecular components to complex systems-level and brain-wide processes. At these different scopes, the brain balances be- tween preserving and stabilizing knowledge and adapting to novel knowledge via plasticity. We commonly refer to this as the stability plasticity dilemma, rooted on the assumption that even an exhaustive machinery such as the human brain has finite capacity. 1.2.1 Origins of the Brain Learning Machinery - Developmental Learning Differently from a randomly initialized artificial DNN, the human brain is not a “tabula rasa”. Genetics and pre-defined developmental trajectories will shape much of our brain morphology and learning potential prior to sensorimotor experience. In most mammals, the brain architec- ture and capabilities are shaped during the developmental period, which encompasses roughly two stages, in-utero and postnatal. The in-utero stage is the “first-phase” and is largely genetically pre- programmed, with strong modulation by the maternal environment and epigenetics (modulation of protein transcription and DNA read-out). The second stage occurs postnatally and comprises a pat- tern of strong sensorimotor driven plasticity [1]. In fact, the seminal work of [2] on the emergence 2 of ocular dominance showed the importance of experience on the development of normal patterns of cortical organization in the postnatal period. Plasticity, including neurogenesis and synaptogen- esis, drastically declines post-development and more so in adulthood. In general, it can be said that during development the human brain acquires most of its core “feature extraction” capabilities. 1.2.2 Cellular Basis of Memory Arguably, the smallest element (“scope”) of learning and memory in the brain is cellular and molecular. Before we can begin addressing circuits and brain regions, we must understand how individual neurons manifest learning and in what way it impacts inter-neuron communication. The most basic theory for cellular (neuronal) learning is Hebbian plasticity. The theory postulates that when one neuron drives the activity of another connecting neuron, the connection between them is strengthened. This postulate is encapsulated in the seminal phrase “Neurons that fire together wire together”. The simplest widespread mathematical description for Hebb’s rule considers a synaptic strength “w” (a synapse here would the point of connection between two neurons) that is modulated by the pre-synaptic activity x and post-synaptic activity y, as in 1.1. W(t+ 1)= w(t)+ nxy. (1.1) Nevertheless, this simplistic formulation of Hebbian plasticity is known to be unstable [3, 4] . Mathematically, stability can be achieved by adding additional constraints such as upper limits to synaptic weights or overall neuronal activity, assuming fixed capacities [3, 5, 6]. In general, the underlying molecular basis for Hebbian-like learning is profoundly complex and is yet to be exhaustively mathematically described. The most well-known form of synaptic plasticity is called long-term potentiation (LTP) and is closely linked to memory storage. LTP induces long term strengthening of a synaptic connection which in turn causes larger synaptic potentials thereon. The inverse cellular mechanism to LTP is the weakening of synaptic connection, long term depres- sion (LTD). Currently, timing has been shown to be critical for whether a synaptic connection is 3 enforced (LTP) or weakened (LTD), in what is generically termed spike timing dependent plasticity (STDP). For instance, LTP arises either via high-frequency electrical stimulation of the neuronal circuit or repeated pairing of the presynaptic neuron preceding the postsynaptic neuron. When the inverse occurs, i.e. postsynaptic precedes the presynaptic, LTD occurs. Broadly, LTP will elicit numerous physical and electrophysiological properties that together form a basis for memory storage for short and long term, with effects persisting for up to a year. Most forms of LTP are glutamatergic and induced following activation of the N- methyl- D-aspartate (NMDA) receptor. For long-term changes to the synapse and spine, LTP is also known to modulate gene transcription and selective protein recruitment via synaptic tagging. LTP and LTD have also been linked to structural modifications to spines (enhancing growth in the case of LTP and shrinkage for LTD) [7, 8]. Furthermore, we can connect these molecular and cellular learning mechanisms to early brain development: during the brain-shaping development stage, Hebbian or more complex cellular plasticity mechanisms are manifested at their maximum level, to give rise to optimal functional patterns of brain connectivity with the capacity for refinement through adulthood. Beyond modifications at a spine-level, dendritic, axonal growth and move- ment are also associated to experience-dependent memory formation and especially prominent in developmental phases. 1.2.3 Memory Management at a Systems Level Synaptic plasticity occurs throughout different brain regions linked to memory storage. Most man- ifestations of synaptic plasticity are accomplished within a few hours after encoding. On the other hand, systems consolidation – referring to memory encoding through vast neuronal networks and different brain regions – uses synaptic consolidation as a subroutine for encoding and stabilizing memory representations. As such, systems memory consolidation involves the transformation and redistribution of synaptic connectivity across distributed brain areas, a lengthy process that can take up to years to stabilize. 4 The complementary learning systems (CLS) theory provides a framework within which to char- acterize the organization of systems-level learning and memory in the brain [9]. The CLS theory roughly refers to two core brain regions linked to memory storage: (1) hippocampus and (2) neo- cortex. Both regions display LTP and LTD manifestations of Hebbian-like synaptic plasticity [10]. According to the CLS theory, the hippocampus enables fast short-term memory formation, encod- ing episodic memories of one-time occurrences. Hippocampal lesion studies have documented that an intact hippocampus is required for acquisition of new episodic memories [11]. Overall, the circuitries of the different hippocampal subregions seem specialized for episodic memory forma- tion: fast single-trial registration along with representation of space and time enable a cohesive encoding of a vivid temporospatial episodic event. In contrast, the neocortex is conceptualized as a slow learner that gradually builds structured semantic long-term knowledge representations that have high stability but are of a more abstract nature. Most importantly, the CLS theory posits that long-term memory consolidation arises from the interplay of the hippocampus and neocortex: the hippocampus would initiate memory formation and then, through time, replay memory activation patterns to the neocortex for progressive long-term stabilization. Sleep is known to be crucial for long-term memory consolidation. Sleep offers an optimal envi- ronment for memory consolidation because of the absence of external stimuli, altered neurotrans- mitter levels as well as across-brain wave oscillation patterns that enforce enhanced hippocampal- neocortical interplay. When an episode is experienced and initially encoded, different sensory neocortical brain areas are activated. The initial hippocampal memory encoding is thought to bind these across-brain activation components into an episodic memory representation. Then, in what occurs mostly during sleep, hippocampal neuronal representations are reactivated and replayed, in turn also reactivating the sensory neocortical components that were initially tied to that hippocam- pal memory trace. The coactivation of neocortical components is believed to trigger progressive synaptic plasticity in the neocortex, to the effect of slowly integrating episodic experiences into a complex and abstract network of semantic knowledge. In one study, reorganization was followed 5 for over a period of 6 months after the initial post-encoding night of sleep, emphasizing the poten- tial long timescale of corticalization of memories [12]. Lastly, there is the important question of what experiences are most likely to be vividly encoded by the hippocampus and then replayed to the cortex. In mouse studies of place cells in the hippocampus, reactivation and replay is stronger for locations an animal has explored more intensely or for environments that are more novel. Addi- tionally, replay is also enhanced for locations associated with averse emotional response or reward [13]. 1.2.4 Integrating Novel Memories Into Knowledge Novelty in the context of learning and memory encoding can refer to different stimulus charac- teristics. One of the widespread notions of novelty is that of a stimulus that lacks a pre-existing representation. This is also referred to as absolute novelty. Additionally, there is contextual nov- elty, arising from a mismatch between components of an encountered stimulus-context pairing. Context can be understood as the spatiotemporal information that accompanies a given stimulus, contributing to the stimulus’ representation. The predictive coding theory posits that the brain is constantly using an inner model to produce predictions about incoming sensory information based on contextual cues. As such, novelty would arise from a mismatch of the probability assignment and the observation of a given stimulus. By inner model, we refer to the stored memory knowledge base and the relational processing machin- ery of the brain. Relating this back to machine learning, novelty would bring about prediction errors in the loss function and subsequent network update. Of course, the biological brain nov- elty mechanisms are more complex then loss guided gradient updates. Nevertheless, the attribute “novelty” does seem to also guide biological memory encoding. For instance, as previously de- scribed, the hippocampus and parahippocampal cortex are very sensitive to the “novelty” attribute of a stimulus, prioritizing its reactivation and replay thereon [14]. Additionally, the CA3 region of the hippocampus is known to be involved in stimulus pattern completion, enabled by its neuronal network pattern characterized by profuse recurrent same neuron synaptic connections. Novelty 6 could be assessed at this biological level by the degree of success of pattern completion. Stimulus that are unsuccessfully “completed” are of a more “novel” nature and should elicit novel memory formation. Ultimately, the detection of novelty can elicit dopamine release in the hippocampus and facilitate LTP in involved synapses [15, 16]. 1.3 This Dissertation This thesis aims to take inspiration from the neurobiological underpinnings of lifelong memory to develop artificial continual learning and machine vision approaches. While we understand the complexity of the human brain by far exceeds and differs from deep neural networks or other ma- chine learning statistical tools, we believe the phenomenology of learning and memory in biology can shed light into how to enhance and bias our current modeling tools. For a continual learning algorithm, the most prominent question is how to handle memory storage and prevent forgetting when incorporating novel information. As we discussed in 1.2.3, replay is a mechanism employed by the brain via the hippocampus neocortex interplay that is believed to enable long term memory consolidation. Inspired by this, in the first chapter of this thesis, we develop an algorithm for continual learning that is based on a hybrid generative and stored-experience replay – “CloGAN, Closed-Loop Memory GAN for Continual Learning” [17]. The core of our model was a conditional generative adversarial network that at each batch gener- ated class-conditioned fake training images that were then mixed with stored images and used to train its next iteration. The stored images were dynamically managed to enforce sample diversity and act as ground truth anchors for the GAN. By leveraging both generative and stored replay our approach yielded state of the art results for both continual conditional image generation and supervised task-free object class incremental learning. In the second chapter, we continued focusing on memory preservation but addressed it in a hierarchical continual learning setting, where the lifelong agent had a separate algorithm for coarse level classification which then triggered a specialist partition of a deep neural network capable of 7 fine grained discrimination (ex: fish could be the coarse label and codfish the fine label) [18]. We find that by subdividing this and proposing a very cheap but consistently performing coarse classifier, we could overperform task independent approaches, yielding less forgetting, in problems suited to hierarchical learning. Moreover, our proposed best performing coarse classifier could be appended to any task-dependent fine-grained CL classifier. In chapter 3 we move beyond tackling forgetting and into novelty detection and open-set learn- ing. After all, forgetting is one side of the coin but in more realistic AI environments, we need to recognize what we don’t know and incorporate it to knowledge, so as not to treat them as novel subsequently. We seek to bridge the divide between the continual learning and novelty detection fields by addressing novelty detection in continual learning, a much more challenging evaluation and deployment paradigm. We are the first to directly address error propagation resulting from erroneous novelty detection during continuous novelty integration. Our proposed approach, - in- cDFM, “incremental deep feature modeling for continuous novelty detection” [Rios et al, 2022 – under review] is an incremental novelty detector, loosely inspired in recurrent brain mechanisms associated to pattern completion and novelty detection in the hippocampal CA3, as described in section 1.2.4. Roughly, we believed that to solve very difficult novel vs. non-novel partitions, we should incrementally build confidence and broaden the scope of our novelty predictions, prioritiz- ing first most evident novelty which then would contribute further to the “novelty representation” model at next iterations. We show that incDFM achieves state of the art continuous novelty detec- tion even in completely unsupervised scenarios. Lastly, In the fourth chapter we temporarily digress from Lifelong Learning to explore comple- mentary aspects of human and computer vision. Specifically, there is ample experimental evidence to suggest that the underlying classification strategies used by DNNs differ greatly from those employed by the human visual system for object recognition. The former is heavily biased to en- code texture/color features whereas the latter is known to rely mostly on object shape features for decision. We argue that the lack of a shape bias in trained DNNs contributes to the limited gen- eralizability potential of deep features across datasets and susceptibility to image perturbations, 8 failures modes not exhibited by biological vision. In fact, we argue that if a machine vision sys- tem can build more shape-biased rich generalizable features the problem of catastrophic forgetting could be more easily ameliorated: as data distributions change in different continual tasks, gener- alizable features can be more easily reused across continual tasks and the continual learning is left for only high-level decision layers, instead of for the entire network. We believe that overall, a rich visual embedding should be generalizable to almost any task with the possibility of modulation only. As a reference, the human visual system after development does not retrain itself and relearn all features every time a new object category (or new stimulus) is learnt (refer to section 1.2.1) – instead, features will be reused. With all of this in consideration, we contend that embedding ma- chine vision systems with more shape bias can contribute to their generalizability and robustness. As such, we open an avenue of research in employing domain knowledge to engineer shape fea- tures that can be used in conjunction with DNNs or similar statistical machine learning tools thus enabling more generalizable and robust object category learning. Finally, to evaluate shape bias, we propose several experiments and benchmarks to evaluate and measure the degree and quality of the shape encoding of a model, including invariance to viewpoint representation. 9 Chapter 2 Closed Loop Memory GAN for Continual Learning 2.1 Abstract Sequential learning of tasks using gradient descent leads to an unremitting decline in the accuracy of tasks for which training data is no longer available, termed catastrophic forgetting. Generative models have been explored as a means to approximate the distribution of old tasks and bypass storage of real data. Here we propose a cumulative closed-loop memory replay GAN (CloGAN) provided with external regularization by a small memory unit selected for maximum sample di- versity. We evaluate incremental class learning using a notoriously hard paradigm, “single-headed learning,” in which each task is a disjoint subset of classes in the overall dataset, and performance is evaluated on all previous classes. First, we show that when constructing a dynamic memory unit to preserve sample heterogeneity, model performance asymptotically approaches training on the full dataset. We then show that using a stochastic generator to continuously output fresh new images during training increases performance significantly further meanwhile generating quality images. We compare our approach to several baselines including fine-tuning by gradient descent (FGD), Elastic Weight Consolidation (EWC), Deep Generative Replay (DGR) and Memory Re- play GAN (MeRGAN). Our method has very low long-term memory cost, the memory unit, as well as negligible intermediate memory storage. 10 2.2 Introduction Since early development and throughout life humans are constantly faced with unknowns in the environment which demand a persistent adaptation and expansion of past knowledge. In addition, as knowledge is expanded, learning is often facilitated since objects and tasks are often closely related and interconnected. For instance, during development, infants learn to categorize animals according to dimensions such as size, texture, shape, sound, among others. However, subsequent addition of new species rarely corrupts classification performance on the already learned cate- gories. In fact, learning broad-species domains can aid in finer species discriminatory capability [19]. Nonetheless, recreating human-like lifelong continual learning remains a central challenge in Artificial Intelligence. State of the art deep neural networks (DNN) trained to perform supervised continual learning are known to undergo a phenomenon termed “catastrophic forgetting”, which describes a sharp decline in the performance of the model on previously learned tasks as soon as a new task is introduced [20–22]. This behavior does not come as a surprise if one recalls that in DNNs, learning an input output mapping implies parameterizing the network with an optimal weight set, through loss minimization. Thus, if training data is unavailable for previous tasks, there will be no more loss term for the old data and a weight parametrization may blatantly deviate from the previous optimal state incurring severe memory erasure. 2.3 Prior Work In the recent literature, several methods have been proposed aiming to ameliorate catastrophic forgetting. They can be roughly subdivided into 3 groups: regularization, network-growing and replay approaches. With regularization methods, one constrains the change of learnable parameters to prevent ”overwriting” what was previously encoded. For instance, [23] perform distillation between multiple realizations of a network at distinct time-points, ensuring that the new weights do not shift significantly from the old. In a similar vein, [24] operate within a single network 11 model and uses a Fisher information matrix computed with saved samples drawn from past tasks, which then acts as a regularizer preserving highly correlated weights. Similarly, [25] use path integrals of loss-derivatives to constrain weights crucial to past tasks, yielding an intermediate parameterization with minimal combined loss. Alternatively, in region-growing algorithms, the architecture itself is altered to accommodate new tasks followed by retraining. For instance, [26] freeze the most important paths in the net- work, therefore forcefully preventing forgetting, and incrementally add new network chunks to incorporate new tasks. Lastly, In replay methods, the models no longer preserve a key pathway or weights. In these algorithms, one estimates the distribution of the old data either by saving a small fraction of the original dataset into a memory buffer or by training a generator to mimic the lost data and labels. At each new task, these methods learn by presenting a network with both new images as well as replay of estimated or buffered old images, reverting the continual framework into a multi-task setting and thus alleviating forgetting [27]. Other works have built on the idea of using a buffer of real data to approximate the past distribution [28–30]. Yet, despite a growing number of appealing solutions, catastrophic forgetting is not a solved issue. Regularization methods have been shown to perform poorly in single-headed incremental class learning, for instance [31, 32], and here we reproduce this limitation in our own results for elastic weight consolidation [24]. On the other hand, region growing approaches, while usually providing a clean solution for constrained incremental problems, can quickly become memory expensive since they require both an architectural expansion and the storage of at least a portion of old data for retraining. Likewise, replay methods also run into scalability issues. So far, generative replay models learn a data distribution by resorting to intermediate copy states of the generator. In Deep Generative Replay (DGR) an unconditional GAN is trained at each task to cumulatively generate and dis- criminate images. Since the proposed GAN is unconditional, they employ an additional classifier (Solver) which is trained in parallel to classify the generated images and assign corresponding la- bels [27]. During each task switch, DGR makes a copy of the generator and classifier networks and 12 uses them to generate sample images and labels for the old tasks. In Memory Replay GAN (MeR- GAN) with joint replay, [33] propose a modification in the DGR framework by substituting the unconditional GAN for an ACGAN, thereby eliminating the need for the additional solver. Copy operations are both expensive and often lead to image quality being degraded through consecutive tasks. Moreover, replicating network states successively is not a fully desirable solution since, from the biological perspective, a human brain cannot produce an “intermediate copy” of itself to trans- fer knowledge. Lastly, methods which rely rather on small subsets of past data, memory buffers, have shown to yield good results but they do not make explicit how much of the performance is due to the algorithm developed and how much is intrinsically due to the variability included in the buffer unit. 2.4 Closed Loop memory GAN 2.4.1 Model Overview In this paper, we propose a hybrid approach between memory buffers and deep generative models aiming to specifically reduce memory costs and maximize both the classification performance and generated image quality throughout training. In our model, there is only one generator and embedded classifier trained cumulatively, with no intermediate copy step. In this framework, as a new task is learned, the old data is approximated by continuously sampling from the generator at its present state, forming a closed loop training paradigm. Of course, since a new task also modifies the parameterization of the generator, this procedure cannot be applied without some verification that the generated images are reasonable approximations of the old distribution that has been lost. Our method tackles this issue by, first, using an image filtering step in which either the classifier or the discriminator is used to assess the sample image quality and, as a result, blocking bad images from entering the training loop. Second, we employ external regularization by constructing a small dynamic memory buffer with real data samples chosen to maximize image heterogeneity and to enforce smoothness in the representation of old classes. The image buffer has fixed memory 13 {𝑋 "#$ , 𝑌 "#$ } Real images and labels 𝛧+{𝑌 )*++#), , } Generator Discriminator Classifier 𝑋 "#$ ∪ 𝑋 *01 2344 ∪𝑋 *01 5678 𝑌 "#$ ∪ 𝑌 *01 2344 ∪ 𝑌 *01 5678 Dynamic Memory {Fixed Size (% Dataset)} Per class clustering Extended Training Set {𝑌 *01 , ,𝑃(𝑓𝑎𝑙𝑠𝑒)} Image Filtering Figure 2.1: Diagram of CloGAN Model used for cumulative and continual learning. Past data is sampled from the generator and filtered by the embedded classifier. Old data is a combination of the fresh stochastic generator output and a small memory buffer used to ”smoothen” the old data distribution for quality output allotment. Therefore, it is not allowed to grow which requires eliminating some old images to make room for new ones. The sampling for the old data is then always a combination of buffer samples and “on-the-fly” generated samples, which provide a stochastic up-sampling of the memory unit. 2.4.2 Model Architecture A vanilla GAN consists of two networks, a Generator and a Discriminator, competing with each other in a zero-sum game framework. The core block of our model (CloGAN), see figure 2.1, is a modified GAN termed Auxiliary Conditional Generative Adversarial (AC-GAN) [34]. The AC- GAN is also composed of 2 networks, but it includes a classifier combined in the same architecture as the discriminator, via an expansion to K+1 output nodes, for K classes plus the original vanilla Real/Fake discriminator output. In an AC-GAN framework the generator is fed a uniform noise z∼ p z appended with a corre- sponding class label c∼ p c . Thus, the conditional generator, described by θ G , generates an image x= G θ G (z,c) and the AC-GAN learns a mapping in which the noise z is independent of the class c, enabling multiple class outputs for a fixed noise input. While the generator θ G is trained to generate images as closely resembling the input image distribution, the discriminator, θ D , is con- versely trained to discriminate these generated images as fake, loss L FT . The embedded classifier, 14 θ C , shares most weights with the discriminator and generates a label prediction which, if incorrect, contributes to the overall loss of both generator and discriminator, L C . Overall, an AC-GAN is eas- ier to train than a conventional vanilla GAN while also producing higher quality images. The loss functions are given as follows in (1) and (4) for generator and discriminator/classifier respectively. θ ∗ G = min θ G (L G FT (θ,X)+ L G C (θ,X)) (2.1) L G FT (θ,X)=− E z∼ p z ,c∼ p c [D θ D (G θ G (z,c))] (2.2) L G C (θ,X)=− E z∼ p z ,c∼ p c [y c log(C θ C (G θ G (z,c)))] (2.3) θ ∗ D ,θ ∗ C = min θ D ,θC (L D FT (θ,X)+ L D C (θ,X)) (2.4) L D FT (θ,X)=E z∼ p z ,c∼ p c [D θ D (G θ G (z,c))]− E (x,c)∼ X [D θ D (x)] (2.5) L D C (θ,X)=− E (x,c)∼ X [C θ C (G θ G (z,c))] (2.6) Note that a plausible alternative to using a GAN would be to use a variational auto encoder (V AE) instead [35]. However, in our testing, we have not been able to achieve results with a V AE as good as those presented here using a GAN. Hence, in the following, we restrict our analysis to approaches based on GAN. Details of the implementation can be found in appendix A. 2.4.3 Closed-Loop Training with Replay In the continual learning setting, our method approximates the likelihood of old data by employing CloGAN to continuously output fresh new images at each mini-batch during training. A combi- nation of image filtering and external regularization by an image memory buffer confer stability to the closed-loop procedure. At each task, our model is trained using an extended dataset which includes real images for the new task, GAN replayed images for old tasks, and memory images, forming an extended training set S t (8), see figure 11. The memory component can be given a weighted importance,λ mem . The network is then trained by minimizing (9,10). 15 Algorithm 1: CloGAN Train Input : Data S real t ,...,S real T ; Require: T : Number of Tasks; I t : Number of iterations; B : Buffer Size; K c : Number of clusters per class; λ mem : Memory importance; 1 θ ∗ G ,θ ∗ D,C ,θ ∗ C ← Train AC-GAN(S t=1 ) for i= 1 to I 1 2 S memory t=1 ← BufferConstruct(K,B,S real t ) 3 for t← 2 to T do 4 S ∗ t = S real t ∪λ mem S memory t− 1 5 for i← 1 to I t do 6 S i∗ t ← Batch(S ∗ t ) 7 S GAN t− 1 ← Forward(G(z,y c t− 1 )) 8 S GAN t− 1 ← Filter (S GAN t− 1 ) 9 S i t ← S i∗ t ∪ S GAN t− 1 10 θ ∗ G ,θ ∗ D,C ,θ ∗ C ← Train AC-GAN(S i t ) 11 S memory t ← BufferConstruct(K,B,S memory t− 1 ,S real t ) Figure 2.2: CloGAN Training Algorithm. Procedure Train is described in section 2.4.3; Filter in 2.4.4; Bu f f erConstruct in 2.4.5 S t = S real t ∪ S GAN t− 1 ∪λ mem S memory t− 1 (2.7) min θ t D ,θ t C (L D R (θ,S t )+ L D C (θ,S t )) (2.8) min θ t G (L G R (θ,S t )+ L G C (θ,S t )) (2.9) 2.4.4 Image Filtering At each mini-batch, the generator outputs fresh images approximating samples from old tasks, with the intent of producing a stochastic up-sampling of the reduced memory core. However, since these images are then used as training data in a closed loop, they have to be of the best quality possible to minimize error propagation. Thus, at each generation step, images are assessed for their quality and ”filtered” out if they do not correspond to the standard. 16 Here, we use the embedded classifier in CloGAN to generate a prediction for the conditional image. If this prediction does not match the conditioning label, the image is filtered out. When old images are generated for closed-loop replay, they are sampled from a model which has already previously converged for generation and classification of old tasks. The rationale behind this eval- uation is that images which are missclassified have a higher probability of being distorted because of the ongoing training of the new task, and of deviating too grossly from the original distribution. We term this method Class-Conditioned Filtering (CFM). In addition to CFM, we implemented a more complex procedure, ”Discriminator Rejection Sampling” (DRS) proposed in [36]. The latter employs the discriminator of an AC-GAN to ap- proximately correct errors in the GAN generated distribution. Details of the implementation can be found in appendix A.We compare both to a baseline case for DRS which rejects a sample if the output from its discriminator logit layer has a score below some threshold, Soft rejection Filtering (SRF) [37]. Overall, we found that CFM, DRS and SFR perform equivalently well. A table with comparisons is included in the appendix A. Hence, since CFM has a much faster running time, we opted for carrying out only class conditional filtering in our final model. 2.4.5 Dynamic Memory Buffer We fill a small memory buffer with samples and labels of original past data to perform external regularization. The memory can be seen as a stable reference frame throughout training that en- forces a ”smoothness” in the representation for each class. At each task, a selection method is employed to choose the samples from the new task which will go into the buffer, with the aim to maximize sample heterogeneity. Also, since a buffer has fixed size, this selection method is further used to determine which of the old task samples will be removed to make space for the incoming new data, employing again the heuristics of sample heterogeneity. Several buffer selection strate- gies were initially experimented but the best selection scheme was K-means clustering per class, both at image insertion and removal. In more details, the construction scheme is as follows: at the end of each current task, a k-centers algorithm is run per each class in the current tasks’s training 17 labels, super-labeling each image as one of K clusters. At the time of insertion into the memory buffer, we select equal numbers of image samples from each class-specific cluster. Additionally, if the buffer is full we compute the space needed for new images and remove an equivalent num- ber of old images. We do this by assessing their stored super-cluster labels and removing equal amounts of samples per cluster, thereby preserving heterogeneity. By storing the per-class, cluster assignment superlabels we also avoid repeating the clustering operation. 2.4.6 Continual Learning Baselines We evaluate other continual learning algorithms as baseline comparisons. We implement Elastic Weight Consolidation (EWC; [24]), Deep Generative Replay (DGR; [38]) and Memory Replay GAN (MeRGAN; [33]). With DGR, to make our implementation a fair comparison, we use an unconditional GAN with the same architecture and complexity as our CloGAN, except that it has only one Real/Fake output node. For both EWC and DGR, we use a classifier with identical architecture as our embedded classifier/discriminator, but with one fewer output node since a pure classifier does not evaluate Real/Fake attribution. Finally, for MeRGAN we implement an AC- GAN with identical architecture as our CloGAN. 2.5 Experiments 2.5.1 Buffer Selection We experimented with several buffer selection schemes but they under-performed class-specific K-centers. In the other selection methods, we extracted the logit or softmax layer of the discrim- inator/classifier network and computed measures such as Kurtosis and Peak-Difference to assess sample heterogeneity. The latter measure corresponds to the difference between softmax scores of the most probable and second most probable class for a given image. As such, we ranked the im- ages according to each measure and kept the images with a probability proportional to their score. In other words, we performed a roulette weighting procedure such as in genetic selection [39]. 18 Table 2.1 contains performance metrics for 3 buffer selection schemes and no selection (none) dur- ing CloGAN incremental class learning using the FASHION dataset with memory buffer of size 0.16%. Method CloGAN Class-Kcenter 75.87 +/- 0.43 Kurtosis 64.52 +/- 0.73 Peak Difference 57.74 +/- 0.61 None 71.03 +/- 1.4 Table 2.1: Displays % Correct As A Function Of Buffer Selection 2.5.2 Incremental Learning We evaluate continual learning as accumulating knowledge of a growing number of disjoint classes, termed incremental learning. Furthermore, we make use of a challenging variation of incremental learning, “single headed learning”. Here, each task is a disjoint subset of classes from the overall dataset. Performance is evaluated for all previous classes, resulting in a 1/K chance level, where K is the number of classes accumulated to that point. We evaluate incremental class learning in 4 datasets: MNIST [40], FASHION [41], SVHN [42] and E-MNIST [43]. The first 3 were subdivided in disjoint subsets of 2 classes per task, with a total of 5 tasks to cover all the label types. E-MNIST, a larger dataset, was divided into tasks of 3 classes, covering 24 different classes in 8 consecutive tasks. To account for the growing number of classes, we create extra output nodes which are incrementally used, which allows us a single head for all tasks. We distinguish our procedure from Multi-Headed learning [44] in which prediction is con- strained to classes within each task. For instance, a multi-headed version of our MNIST test would use and re-use only two output nodes. After training on full disjoint MNIST with 5 tasks of 2 classes each, when evaluating the first task (digits 0 and 1), a multi-headed would only have to decide between digit 0 vs 1, as opposed to a one in ten decision for single-headed. This typically 19 Figure 2.3: Graphs of Maximum Average Accuracies across continually learned tasks. A) MNIST; B) Fashion; C) SVHN; D) E-MNIST. The dashed lines indicate start of a new task represented by a disjoint set of classes. We illustrate the performance CloGAN as memory sizes are varied (memory allotment is in parenthesis). In red we show catastrophic forgetting when fine-tuning by gradient descent. In salmon we show multi-task training until convergence with the full datasets starting from scratch at every task switch (MT-Full). We also show results for EWC, DGR and MeRGAN. leads to much higher accuracies in part because an output node never becomes completely dis- abled, since it is always used for the last task. Finally note that a multi-headed network with only 2 output nodes provides an output that needs to be further disambiguated by knowing the task. 2.5.2.1 Average Continual Performance Figure 2.3 displays the average performance of CloGAN when varying memory buffer size. Our method avoids catastrophic forgetting even with very small buffer sizes such as 0.08% (50 images) and 0.16% (100 images), for both MNIST and FASHION. For the more challenging E-MNIST and SVHN, buffer requirement becomes more demanding. Nonetheless, we obtain superior per- formance over the competing methods with still very reduced memory sizes: only 0.5% (576 images) and 1% (492 images). 20 Table ?? compares maximum average accuracies after training all tasks, for all methods tested. First, when no memory or GAN sampling is performed, the ”FGD” condition which contains only fine-tuning with gradient descent, catastrophic forgetting occurs. Second, EWC accuracy rapidly declines, asymptotically reaching the catastrophic forgetting curve. EWC has already been shown to behave poorly in incremental single-headed paradigms [31, 44]. To further confirm that this degradation of performance was not particular to our implementation, we replicated the permuted- MNIST experiment proposed in the original EWC paper by Kirkpatrick, 2017; and verified that in this learning paradigm EWC performs very well. This discrepancy between the experiments is likely due to the difference in output mapping, see A. Method Mnist Fashion Svhn Emnist MT-full 98.29 86.48 84.43 89.41 CloGAN 98.03 (1.6%) 85.25 (1.6%) 79.30 (5%) 83.50 (5%) CloGAN 92.26 (0.16%) 76.15 (0.16%) 73.08 (1%) 79.14 (1%) MeRGAN 98.25 65.62 31.94 61.92 DGR 94.90 62.11 46.83 42.35 EWC 29.19 26.52 22.55 23.76 FGD 19.97 20.22 19.56 14.28 Table 2.2: Performance after continual learning of all tasks. For CloGAN, memory allotment is in parenthesis Lastly, we report the accuracies for the deep replay methods, DGR and MeRGAN. For MNIST, both DGR and MeRGAN perform very well, reaching 94.9 % and 98.25% whereas CloGAN achieves accuracies of 92.26% with memory of 0.16% and 98.03 with (1.6%). However, for all other datasets, which are significantly harder than MNIST, DGR and MeRGAN both underperform CloGAN by significant amounts. For SVHN, the most challenging, both DGR and MeRGAN display degraded performance after the first task. This behavior likely has cause in a persistent 21 Figure 2.4: Stochastic Up-Sampling of CloGAN graphs. A) SVHN. B) E-MNIST. The contribu- tion of upsampling is indicated by a positive gap between CloGAN and Frozen-CloGAN as more tasks are learned. We also compare to the MT condition in which training is re-started at each task, eliminating forward-transfer of possible shared task features. degradation of generated image quality throughout training. Both methods represent old data ex- clusively by replayed images from an intermediate generator copy. If the generator cannot produce images which represent the original distribution with high fidelity, the gap in representation capac- ity can be enlarged and propagated through successive GAN transfer (copy) operations. CloGAN alleviates GAN representation degeneration because it is trained from an extended set containing both replay images from the generator and real images in the buffer. The real images never de- generate and act as an anchor to keep smoothness and quality in the subsequent generated images. DGR has another disadvantage over CloGAN: it does not generate conditioned images, requiring a separate classifier to produce old image labels during training. If that classifier does not have perfect performance, it will inevitably misslabel some images, contributing to error propagation. In the A we tested a new copy-CloGAN that copies the generator at each task switch. However, the copy operation did not provide an advantage if comparing memory usage. 2.5.2.2 Stochastic Up-Sampling We confirm that CloGAN performs an upsampling of the memory buffer selection by comparing our method to two variations in which the AC-GAN is trained only from a memory buffer, both in continual (Frozen-CloGAN) and multi-task settings (MT). For the latter two conditions there is no closed-loop replay of GAN samples. Furthermore, in the MT setting we re-start training at each task switch. The results reported in figure 2.4 correspond to the maximum accuracies achieved 22 for each task for all 3 variations. We verify that stochastic generation in CloGAN provides an upsampling of the buffer and achieves superior performance to Frozen-CloGAN and MT. We show results for the more challenging datasets, E-MNIST and SVHN. Upsampling is indicated when a positive gap between CloGAN and Frozen-CloGAN increases as more tasks are added. For SVHN, the last task shows clear gaps between CloGAN Frozen- CloGAN as well as MT (maximum gap of 11.39% at task 5). Similarly, E-MNIST shows a clear gap in the last two tasks, 7 and 8 (maximum gap of 10.41% at task 8). Additionally, we show that MT under-performs starting in early tasks due to lack of forward transfer since the networks are re-started from scratch at each task switch. Similar upsampling behavior was observed in MNIST and FASHION, with maximum gaps of 6.84% and 9.35% respectively. Additional figures can be found in the appendix A. 2.5.2.3 Per Task Performance In figure 2.5A,B), we exhibit per task accuracies along time. Here, CloGAN is shown to produce stable performance throughout consecutive tasks. For both E-MNIST and SVHN all past tasks maintain high accuracies consistently throughout learning of new classes. For example in EM- NIST, task 1 preserves its accuracy at 84.33% despite the learning of 7 other tasks in succession. Likewise, SVHN task 1 has an accuracy of 83.87%. The results are significantly higher when com- pared to the baseline of catastrophic forgetting and EWC. Moreover, we also display performance for MeRGAN. In E-MNIST, MeRGAN accuracies for tasks 1 and 2 are clearly underperforming CloGAN at the end of training, likely due to image degradation from GAN to GAN transfer. In figure 2.5C), we show generated images by CloGAN and MeRGAN for both SVHN and E- MNIST. We list images taken after training of all tasks. For SVHN we list all classes cumulatively learned. For E-MNIST, since there are 24 classes, we limit the display to the two first tasks as well as the last task (8th). For CloGAN, we find that images are sharp even when using small memory sizes, 1% - SHVN and 0.5% - EMNIST. This is true for beginning tasks as well as latter tasks. In 23 Figure 2.5: Accuracies per task graphs for EMNIST (A) and SVHN (B). Generated images for SVHN and EMNIST (C). CloGAN preserves performance for early tasks throughout training. MeRGAN has degenerated performance and image generation for early tasks. In EMNIST, degra- dation is a darkening in the first tasks (a,...,f) in contrast to the last task (v,w,x). contrast, in MeRGAN former taks are sharply more degenerated than latter ones. In EMNIST this can be seen by an overall darkening of letters a through f . 2.6 Conclusion In conclusion, we have shown how using very small buffers in conjunction with stochastic replay can give rise to superior performance compared to simple gradient descent, EWC or other replay methods. In our model, CloGAN, the memory buffer acts as an external regularization for the 24 generator, counteracting image degradation through time. Our approach is relatively easy to im- plement and necessitates only low computation (no full retraining) and memory (small buffer), making it ideal to enable life-long learning on resource-constrained mobile (at the edge) devices. Acknowledgments This work was supported by the National Science Foundation (grant numbers CCF-1317433 and CNS-1545089), C-BRIC (one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA), and the Intel Corporation. The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof. 25 Chapter 3 Lifelong learning Without a Task oracle 3.1 Abstract Supervised deep neural networks are known to undergo a sharp decline in the accuracy of older tasks when new tasks are learned, termed “catastrophic forgetting”. Many state-of-the-art solu- tions to continual learning rely on biasing and/or partitioning a model to accommodate successive tasks incrementally. However, these methods largely depend on the availability of a task-oracle to confer task identities to each test sample, without which the models are entirely unable to perform. To address this shortcoming, we propose and compare several candidate task-assigning mappers which require very little memory overhead:(1) Incremental unsupervised prototype assignment using either nearest means, Gaussian Mixture Models or fuzzy ART backbones;(2) Supervised in- cremental prototype assignment with fast fuzzy ARTMAP; (3) Shallow perceptron trained via a dynamic coreset. Our proposed model variants are trained either from pre-trained feature extrac- tors or task-dependent feature embeddings of the main classifier network. We apply these pipeline variants to continual learning benchmarks, comprised of either sequences of several datasets or within one single dataset. Overall, these methods, despite their simplicity and compactness, per- form very close to a ground truth oracle, especially in experiments of inter-dataset task assignment. Moreover, best-performing variants only impose an average cost of 1.7% parameter memory in- crease. 26 3.2 Introduction Throughout life, humans are presented with unknowns in the environment, to which they must adapt and learn from. Yet, lifelong novelty integration must always coexist with strong mechanisms of protection against interference to consolidated learning, the stability-plasticity balance. Lifelong Learning remains a persistent challenge for artificial intelligence. State of the art deep neural networks are known to undergo a phenomenon termed “catastrophic forgetting”, which describes a drastic decline in the performance of the model on previously learned tasks as soon as a novel task is introduced [20, 21]. A straightforward reason is that if data used for acquiring previous knowledge is no longer present when assimilating a new task, gradient descent will optimize the model’s weights under an objective that pertains only to the new samples. This may lead to a parameterization that grossly deviates from previous tasks’ optimal states, and subsequent memory erasure. In recent literature, several methods have been proposed to mitigate catastrophic forgetting, many of which rely on partitioning or biasing a network to accommodate successive tasks incre- mentally [45–48]. However, these approaches require an oracle to assign task identities to incom- ing samples and re-tune the model to an appropriate task-dependent response. In fact, we show that if an oracle is removed from a task-dependent model, performance starkly declines. Such a dependency is a latent impediment towards more realistic lifelong learning settings where task assignments would ideally be inferred in an end-to-end manner or would not be required at all. Methods for substituting a task oracle have not been extensively studied or discussed in the field. Incipient proposals and discussions have occurred in [49–51] but their scalability to difficult benchmarks is poor. To fill this gap, here we propose and compare several models that impose only a very restricted parameter memory increment with respect to the base task-dependent fine-grained classification model. Moreover, we do so in an equalized context, using the same architecture back- bones among task mappers and task-independent baselines, while weighing memory-performance trade-offs of each approach. Lastly, emphasizing that task mapping is itself subject to catastrophic 27 forgetting, we show that task assignment can have largely varying degrees of difficulty depend- ing on the incremental learning paradigm used. For this we distinguish between two overarching domains of incremental learning protocols, (1) learning categories within one dataset versus (2) learning categories over different datasets. Overall, we find that when using our best performing task mapper coupled with a state-of-the- art fine-grained classifier, we can perform better than current task-independent methods tested and at much lower relative memory expenditure. 3.3 Background We focus on a continual learning paradigm where a single neural network must incrementally learn from a series of tasks and, each time a new task is learned, access to previous tasks’ data is limited or absent. We assume that all tasks are unique and clearly separated. Within this framework, we can interpret most of the recent continual learning (CL) literature as being grouped into models which rely on information of task identity at test time, termed “task-dependent”, and those which do not, “task-independent” (Figure 3.1). 3.3.1 Task-Independent Continual Learning This class of algorithms does not require task labels at test time and typically operates on a single- head architecture, i.e., the final layer has as many nodes as classes over all tasks and each task trains on shifted labels of its original classes. Nonetheless, task-independent models can also operate with a single shared output head [52]. Within task-independent methods a major subclass are regularization approaches, which con- strain the change of learnable parameters to prevent ”overwriting” what was previously encoded [32, 53]. For instance, Elastic Weight Consolidation (EWC) [24] computes a Fisher information matrix at each task switch to penalize changes to highly correlated weights. In a similar vein, Synaptic Intelligence [25] uses path integrals of loss derivatives and a minimum combined loss to 28 Feature Extraction Head 1 Head 2 Head 3 Feature Extraction Extended Head Task Independent Estimation A) Task Dependent Estimation B) Each input is a pair containing an image and an oracle task- assignment which is used to select/train task- specific parameters Each task receives a shifted output label so that at test-time, prediction is task- independent Task Figure 3.1: (A) Task-Independent (B) Task-Dependent model outlines. Here we show the most common architectural schemes for the two modalities. Task-dependent models require task labels as input. Task-independent do not. constrain crucial parameters. Alternatively, Learning Without Forgetting (LWF) [23] perform dis- tillation using soft labels from a saved snapshot of the network as a means of regularizing weight change. Another important subclass are replay algorithms. These methods purposely approximate the continual learning framework to a multi-task setting by estimating past data distribution either via storage of a select memory coreset [28, 29] or by reconstruction via training a generative model to sample past data and labels [17, 33, 38]. Overall, because task-independent models do not make use of task-specific parameters at test time, the classification endeavour is naturally more challenging. For example, regularization ap- proaches such as EWC have been shown to perform poorly in single-headed scenarios [32]. Replay approaches can quickly escalate to expensive memory usage. Generative replay methods are also limited by the output quality of the data generators themselves, which are, furthermore, subject to forgetting. Finally, task-independent algorithms may still be converted to task-dependent if task- specific elements are added. An example is using regularization in the main network but adding independent output heads. In such cases, even though the algorithms themselves do not require task-IDs, the added task-specific elements will. 29 3.3.2 Task-Dependent Continual Learning Task-Dependent algorithms rely on oracle-generated task labels both for training and testing. Be- cause of this, models can explore a range of task-specific components. For instance, most task- dependent models include one separate output head per task and employ various mechanisms to share the remainder of the network. A common strategy is to make use of task-dependent con- text [45, 47] or mask matrices [48] to partition the original network for each task. In general, partitioning methods come at the cost of large storage overheads [47]. To counteract memory limitations, Cheung et al [45] propose diagonal binary context matrices and perform parameter superposition (PSP) at very low memory cost. Finally, Wen et al [46] at- tempt to minimize performance decay over cumulative tasks by computing task-specific layer-wise beneficial directions (BD) that bias the network towards correct classification. BD’s are inspired by an inverse of the commonly known adversarial directions, in this case to force apart classifi- cation boundaries which might otherwise become tangled during incremental learning. BD is not meant as a standalone method, but provides significant boost in performance when paired with other models such as PSP. 3.4 Methods - Learning without a Task Oracle The major limitation of task-dependent methods is their inability to perform without oracle input. In fact, determining task ID’s is a learning procedure itself subject to catastrophic forgetting. We propose several task-mapper algorithms to substitute the oracle input required for task-dependent continual learning. Our models prioritize simplicity and low memory usage insofar as they are meant as an add-on to a base fine-grained classifier. However, we show that our best mappers are sufficient in obtaining very good task estimation accuracies. 30 Feature Extraction Task Mapper Head 1 Head 2 Head 3 Beneficial Bias PSP Key Frozen or Shared Head A task mapper is trained to substitute the oracle Figure 3.2: Outline of our proposed pipeline. We employ parameter superposition (PSP- Cheung et al, 2019) with Beneficial Biases (BD - Wen et al, 2020) as a state-of-the-art task-dependent backbone for fine-grained classification. We then substitute the oracle input for an end-to-end trainable task-mapper. We experiment with different low-memory task-mapper variants. In all the variants, at test time, the task-mapper predicts task assignments which are then used to activate task-specific PSP context keys and BD biases. To evaluate our task mappers and baselines, we use parameter superposition (PSP - [45]) with Beneficial Biases (BD - [46]) as a standard state-of-the-art task-dependent backbone for fine- grained classification. At test time, the task mapper predicts a task label which activates task- specific PSP context keys and BD biases (Figure 3.2). Both our fine-grained classifier and our task mappers receive as input a shared fixed-feature representation from a CNN (Resnet or other) pretrained on ImageNet. One reason is that, biologically, low-level visual features are thought to be optimized during evolution and early development and later re-utilized, i.e., are task nonspe- cific and do not need to be constantly re-learned. In contrast, higher-level visual features are often task-specific and build upon a recombination of early-level input [54]. In fact, many recent works on meta and adaptive learning have shown state-of-the-art performance when re-utilizing a fixed common feature embedding for all novel tasks [55]. Figure 3.3 shows our proposed methods. 31 A ) Incremental Unsupervised Task Mapper: GMMC; NMC; ART (Ours) Task 1 Task N B ) Supervised Prototype Task Mapper: ARTMAP (Ours) C ) Supervised Replay Task Mapper: PCR (Ours) Coreset of Fixed Size Task 1 Task 2 F ) KM-Heads (Gepperth et al, 2018) – (Baseline) Shallow Perceptron Features FC Layers Task-Specific Clusters from each head Map Field inputs are associated to prototypes H(T) ! = argmin ) (+,-./012 3 ! ,5 ! ) C(T) Output Field Task (…) Coreset is always used up to full capacity. Coreset Replay Incremental Mapping of prototypes to tasks Store Mapping Coreset Replay FC Layers Shared Head or Multi-head PSP-BD Parameter Activation For each task from fine-grained Classifier E(T) D ) Supervised Replay Embedding: PCR-E (Ours) Features FC Layers Shared Head or Multi-head Fine-Grained Classifier Task-Specific Clusters, from each PSP-BD embedding E(T) ! = argmin ) (+,-./012 7 ! ,5 ! ) C(T) E ) KM-Heads with PSP-BD (Modified - Ours) Fine-Grained Classifier G ) Entropy (Oswald et al, 2020) – (Baseline) Heads Features FC Layers Shared Head or Multi-head Fine-Grained Classifier Task-Specific PSP-BD embeddings E(T) ! = argmin ) (70.89:; 7 ! ) Task Stored Features Task E(1) E(2) Shallow Perceptron < = :! = ,< ? :! = , < @ :! = < A :! A ,< AB= :! A , < AB? :! A 1 2 3 N N+2 N+1 Features Autoencoder Gates Task 1 Task 2 Task 3 MSE(1) MSE(2) MSE(3) ! = argmin ) (CD7 ! ) H ) Autoencoder Gates (Aljundi et al, 2017) – (Baseline) Figure 3.3: Task-Mapping Models. Mappers receive features taken from a feature extractor net- work that is shared with the pipeline’s fine-grained classifier. A) Incremental versions of ART, Gaussian Mixture Classifier (GMMC) and Nearest Means (NMC) - prototype-based networks which all employ a dictionary mapping of prototypes to task labels. B) Fast ARTMAP archi- tecture trained to form feature to task mapping directly. C) (PCR): maps features to task by a one layer perceptron with aid of a very small coreset for replay. D)(PCR-E) - we sift through PSP-BD keys and biases to collect all task-specific output responses which are then concatenated and used as input to a shallow perceptron. Features are stored in a coreset similar to (C). E) (KM-heads Ours) is a proposed modification to the baseline model in Gepperth et al - we sift through PSP- BD task activated embeddings to perform task-conditioned K-means and the task is assigned as the closest prototype’s task label. F) (KM-heads, Gepperth et al) is similar to (E) except, there are no PSP-BD elements and output is necessarily multi-head. G) (Entropy, oswald et al, 2020) - task assignment is given by the index of the PSP-BD embedding that yields lowest predictive uncertainty. H) (AE-Gates, Aljundi et al) Separate Autoencoders (AEs) are trained for each task. During testing, a sample image is reconstructed by each of the task-dependent AEs and the final task assignment is given by the AE which yielded the smallest mean squared reconstruction error (MSE). 3.4.1 Incremental Unsupervised Task Mappers Different unsupervised prototype-based networks are employed to cluster pre-extracted features from the current task. Prototype-based networks have as an advantage that they encode information much more locally than conventional deep networks (DNs), i.e., in each individual prototype. DNs rely on a highly distributed mapping between thousands of network weights. In incremental learning, localization naturally minimizes inter-task disruption. In fact, for single-headed continual learning, most approaches employing DNs have been shown to require a minimum degree of data replay to prevent overwriting [32]. Nonetheless, prototype networks underperform DNs in classic 32 supervised classification. Thus, we leverage prototype-based networks only for incremental task- mapping, a form of coarse-level identification. By coupling them to an efficient task-dependent DN fine-grained classifier, we seek a complementarity that enables more efficient lifelong learning. 3.4.1.1 Nearest Means Classifier (NMC) In our simplest prototype-based approach, to learn a new task we start with a fixed embedding of this task and perform K-means clustering. The K resulting prototypes all receive an attached super-label equal to the current task’s identity. As new tasks pile up, to perform task prediction at any given time, we keep a running dictionary, D map , of the cluster m i to task t mapping: D map ={(m 1 : 1),...,(m K : 1),...,(m Kt : t)} (3.1) Since the feature representation is fixed, this mapping will not change over time. For any given sample, we find the closest stored prototype and use its task label as the predicted task: Task= D map [min i (||m i − x|| 2 2 )] (3.2) 3.4.1.2 Gaussian Mixture Model Classifier (GMMC) Here we employ Gaussian Mixtures (GMs) to perform task-wise incremental prototype generation. The advantages of using GMs over nearest means is that they additionally encode variance infor- mation, and, furthermore, perform soft-assignment, which renders them more robust to outliers during clustering. The incremental task-mapping algorithm is very similar to NMC, differing only in how the prototypes are generated. For each new task, K GM prototypes are computed from the extracted fixed embedding of that task, with overall distribution: f t (x)= K ∑ i=1 w i · N i (x|m i ,Σ i ) (3.3) 33 where m i ,Σ i are mean and covariance of each of the K Gaussian distributions. Moreover, we also generate K Gaussian weights, w i , per task. In order to allow for continual growth of our model we compute absolute weights by multiplying w i to the number of samples at that new task. We store the non-normalized weights W ∗ i . The task mapper then becomes a dictionary containing assignments from GM parameters to task super-labels: D map ={(N i : 1),...,(N kt : t)} (3.4) where N i refers to one of the Gaussians in the accumulated task-mapping leading up to K· T gaussians at a task t. The probability of sample x belonging to the k th gaussian is given by: f(x,i= k)= w i=k N(x|µ i=k ,Σ i=k ) ∑ KT i=1 w i N(x|µ i ,Σ i ) (3.5) Task prediction follows as: Task= D map [argmax i ( f i (x))] (3.6) 3.4.1.3 Fuzzy ART Classifier In this variant we generate the incremental prototypes with an unsupervised fuzzy ART network [56, 57]. ART networks were initially proposed to overcome the stability-plasticity dilemma by accepting and adapting a stored prototype only when an input is sufficiently similar to it. In ART, when an input pattern is not sufficiently close to any existing prototype, a new node is created with that input as prototype template. Similarity depends on a vigilance parameter ρ, with 0<ρ < 1. When ρ is small, the similarity condition is easier to achieve, resulting in a coarse categorization with few prototypes. A ρ close to 1 results in many finely divided categories at the cost of larger memory consumption. One further specification of ART is that an input x of dimension D under- goes a pre-processing step called complement coding, which doubles its dimension to 2D while 34 keeping a constant norm, x ∗ =[x, ⃗ 1− x]. This procedure prevents category proliferation [56] but makes each prototype also occupy double amount of space. During learning of each task, if for x a prototype w i is sufficiently similar by satisfying: ||min(x,w i )|| 1 ||x|| 1 >ρ (3.7) then w i can be updated according to: w i (t+ 1)=(1− β)w t (t)+β(min(x,w i (t))) (3.8) To adapt an unsupervised ART for incremental task classification, as a new task is learned, we setρ= 1 for all prototypes of the already learned tasks. By doing this we allow updates only to the current tasks’ freshly created prototypes, shielding previous tasks’ information from interference. Similarly to sections A-1 and A-2, we also keep a running dictionary, D map with task mappings between prototypes and their corresponding task super-labels. Task is predicted as: Task= D map (argmax i ( ||min(x,w i )|| 1 α+||w i || 1 )) (3.9) where α is a regularization hyperparameter that penalizes larger weights. We found α = 0.001 worked best in our experiments. 3.4.2 Supervised Prototype Mapping (ARTMAP) This variant is a natural extension to our unsupervised ART model, which employs a supervised fuzzy ART (ARTMAP) architecture [57]. The advantage of using an ARTMAP for task mapping is it they naturally allows for incremental learning without interference to previous prototypes. Whereas in ART we imposed a prototype-specific ρ parameter freezing, in ARTMAP this is no longer necessary since updates only occur if a prototype has the same label as the training sample. 35 Otherwise, in case of a category mismatch, the vigilance is adjusted temporarily, called match tracking: ρ temp = ||min(x,w i )|| 1 ||x|| 1 +ε (3.10) This can be repeated as many times as necessary until finding a label match or until ρ temp = 1, at which point a new sample x is drawn, the current one discarded, andρ temp =ρ. At test time, the task predicted is the label of the closest prototype. 3.4.3 Perceptron with Coreset Replay (PCR) In this model, we incrementally map feature arrays to task assignments using a shallow perceptron aided by replay from a fixed-size memory coreset. The features used as inputs are obtained via transfer-learning from a frozen feature extractor. The pre-trained feature extractors were selected to generate feature arrays with a much lower dimensionality then the original input images, reducing the memory burden on coreset size as well as network size. We hypothesized that using coreset replay only to perform coarse level classification, i.e., task- mapping, instead of fine-grained incremental classification, could radically cut costs of memory storage overall. The intuition is that covering coarse-level data statistics inside a small coreset can be comparably a much easier endeavour. Thus, in a full pipeline, we could obtain a better performance-to-memory ratio overall. 3.4.3.1 Coreset Building At the time of insertion into the memory buffer, we select a number of feature vectors from the new task equal to N/T(t), where N is size of coreset and T(t) is total number of tasks learned until then. Additionally, since the coreset is always filled to full capacity, at each task-switch we re-compute the per-task allowance and remove an equivalent number of old feature vectors per task, maintaining a homogeneous task representation in the coreset at all times. We experimented 36 with other coreset building techniques such as homogeneous coreset sampling according to center means and ART prototypes, but they did not perform as well (additional results in supplementary - B). We hypothesize that this was because the coresets used were very small, and it was more important to guarantee homogeneous task sampling than feature-level prototype diversity. 3.4.3.2 Perceptron Training At each task, our model is trained using an extended dataset which includes feature vectors from the new task and those from past tasks that are contained in the memory coreset, forming an extended training set. The final loss is optimized by stochastic gradient descent: Loss t = Loss new t ∪λ mem Loss memory t− 1 (3.11) whereλ mem weighs the importance of old tasks relative to new. We found that dynamically setting λ mem = T memory T all during training worked best. T memory refers to the number of tasks in memory and T all = T memory +T new . We train in single-head mode: one output node per task. During training, the perceptron is not reinitialized, to enable forward transfer of knowledge. For architecture, we found that a 1-layer perceptron outperformed memory-equivalent deeper perceptrons (supplementary - B). 3.4.4 Perceptron + replay of Task-Specific Embeddings (PCR-E) Previously proposed methods [49, 50] use the final layer task-specific embedding as basis for a task-mapping heuristic. Their hypothesis is that the last-layer activity space for the correct task will differentiate itself statistically from the others. In their case, this is measured via entropy or prototype distances, but here we also propose to map last-layer task-specific embeddings to task predictions via a shallow perceptron. As such, we use a PSP-BD fine-grained classifier and sift through PSP-BD keys and biases to collect all task-specific output responses which are then concatenated and used as input to a shallow perceptron (one hidden layer). At each task, the 37 perceptron learns to map this concatenated array to the correct task label. To not forget previous task mappings, we keep a coreset with previous tasks’ feature arrays as previously described in section C-1. These samples are processed in the same way as new inputs, by also sifting through task-specific keys and collecting the concatenated output responses. 3.4.5 Task-Mapper Baselines and Baseline-Modifications Few works in the literature have addressed task mapping, and their coverage of more difficult benchmarks is very limited. We compare our approaches to three recently proposed methods [En- tropy - [50]; Autoencoder Gates - [51]; Multi-Head KM - [49]]. We also include upper and lower bound baselines which consist of training the task-dependent fine-grained classifier with ground truth task labels and random task assignments, respectively. Finally, we set as an additional base- line a modified version of Multi-Head KM which includes other relevant task-specific elements such as PSP partitioning and BD biases. For all baselines we always employ the same fine-grained classifier architecture and optimizer. The only differences appear in the layout of the output heads which depend on the particulars of each proposed method. 3.4.5.1 Baseline - Entropy This algorithm is adapted from [50] who use an output layer’s predictive uncertainty to establish task identity. In our implementation, given an input pattern, we sift through PSP task partitions and task-specific BD biases for all tasks seen so far, obtaining multiple task-conditioned embeddings. We predict task assignment by the index of the task embedding, f(x,θ t ), which yields lowest predictive uncertainty, i.e., lowest output entropy: task= argmin t (Entropy( f(x,θ t ))) (3.12) 38 3.4.5.2 Baseline - Expert Autoencoder Gates (AE-gates) We adapt the algorithm proposed in [51]. Our implementation details and parameters can be found in B. In this model, one single-layer undercomplete autoencoder (AE) is trained separately for each task, capturing shared task statistics in it’s latent space encoding. The authors refer to these task-specific autoencoders as task gates. At test time, we pass a test sample through all AE’s and compute for each AE the mean squared reconstruction error (MSE). The final task prediction is given by the AE with lowest MSE: task= argmin t (MSE(Autoencoder(x))) (3.13) 3.4.5.3 Baseline - Clustering in Multi-Head Outputs (KM-heads) We adapt the algorithm in [49]. In this model, one separate output head is created for each task, for a total of T heads. After supervised training of current task t, a forward pass with the current tasks’s data gives an embedding of head H(t) which is then clustered via k-means. The resulting prototypes are stored. This procedure is repeated after each task is learned, resulting, at time t, in a number T(t) of different embeddings as well as T(t)· N prototypes. At test time, task is predicted by running a sample through the network and, at each head, computing the minimum distance from that sample to the closest head-specific prototype. The overall task prediction is given by the absolute minimum distance from all heads. The winner head is then used to perform fine-grained classification: task= argmin t [min n (∥(H t (x)− P(n))∥)] (3.14) 3.4.5.4 Baseline Modification - Clustering in Multi-Head or Shared-Head with PSP-BD Task-Partitioning (KM-heads Ours) In KM-heads the only task-specific elements are heads and the remainder of the network is shared. Here we modify the algorithm by adding other task-specific elements such as PSP partitioning and 39 Inter-Dataset Learning (A) 8 Datasets Flowers Scenes Birds Cars Aircraft Actions Letters SVHN … P- 1 P- 25 (B) Permuted MNIST Intra-Dataset Learning (C) CIFAR-100 Super-Classes 102 Cumulative number of classes 169 369 565 10 250 635 645 707 717 Cumulative number of classes Aquatic Mammals 5 Fish Flowers Food Containers Fruits and vegetables Household Electronics Household Furniture Insects Large Carnivores Large Outdoor Monuments 10 15 25 20 30 35 40 45 50 Each Task has 5 classes and 3000 images Each Task has 10 classes and 60K images 102 classes; 8,189 images 67 classes; 15,620 images 200 classes; 11,788 images 196 classes; 16,185 images 70 classes; 10,200 images 10 classes; 3,334 images 62 classes; 62,992 images 10 classes; 99,289 images Figure 3.4: Schematic of our incremental learning experiments, which can be divided into two main categories, Inter-Dataset and Intra-dataset. For Inter-dataset, each task is a complete dataset: (A) A sequence of 8 datasets containing natural images [58]. (B) A sequence of 25 datasets, each a permutations of MNIST. In the Intra-Dataset modality, each task is a subset of one dataset. Here we use 10 different super-classes of Cifar-100 each as a separate task. BD biases. By adding these components we enable two versions of readout, one in which multiple heads are used and the other in which all tasks share the same head. 3.4.6 Task Independent Baselines We evaluate how our pipeline (task mapper + PSP-BD fine-grained classifier) performs compared to task-independent methods. We implement EWC - [24]) and GEM – [29] in a completely task- independent manner. Additionally, in reference to our coreset-replay task-mapper of section C, we propose task-independent vanilla replay in which we create a coreset as in C-1, but populate it homogeneously among classes. During training, coreset samples are interleaved with the new classes. For all baselines, we test versions where the output layer contains one node per class versus having a shared head with as many nodes as number of classes of the most populous task. 3.5 Experiment Descriptions - Datasets Our experiments are divided into two categories: Inter and Intra-dataset. In Inter-dataset, each task is a complete dataset whereas in Intra-Dataset, a task is a subset of one dataset. 40 3.5.1 8 Datasets experiment (Inter-Dataset) We consider a sequence of eight object recognition datasets (Figure 3.4), with a total of 227,597 pictures in 717 classes as in [58]. In this experiment, the frozen feature extractor corresponds to convolutional layers of Alexnet pretrained on Imagenet. For fine-grained classification, we use the complete embedding which is a 256x6x6 feature array, but, for task-mapping, we perform complete spatial pooling to only 256 units. This pooling operation did not significantly impact performance but greatly reduced storage requirements. For classification, we use the two last fully-connected layers of Alexnet, 4096 units each. We include one separate output layer per task as in [58]. 3.5.2 Permuted MNIST (Inter-Dataset) This experiment is formed by 25 datasets generated from randomly permuted handwritten MNIST digits. Each new task (dataset) has 10 classes. Since the datasets here are simple enough, our task mappers are fed permuted images directly, without previous feature extraction. We use a two hidden-layer fully connected perceptron of 256 hidden units for a PSP fine-grained classifier. Tasks share a head with 10 output nodes. 3.5.3 Sequence of 10 Cifar100 superclasses (Intra-Dataset) In this experiment, a task is equivalent to one super-class from Cifar100 and we use a total of 10 tasks. Each super-class contains 5 different sub-classes that are semantically linked, i.e., different fish species are sub-classes of the fish super-class. Assigning a task label thus constitutes a high- level categorization of sub-classes. For the frozen feature extraction we use resnet-34 up to the penultimate layer, pre-trained on Imagenet. The resnet embedding size is 512. For fine-grained classification, we use a MLP with two hidden layers of 7680 and 4096 units, followed by a shared head of 5 output nodes. 41 Figure 3.5: Parameter Dependencies of the different task-mappers. The vertical axes (task ac- curacy), have been scaled differently between each subplot so that variations to critical parame- ters are amplified. The green markers denote the best memory-performance tradeoff according to score= A− αM. Most mappers achieve optimum memory-performance tradeoffs on an elbow point. However, for the first column (ART/ARTMAP), the vertical red line is a virtual memory wall, meaning we did not increase vigilance further because memory usage was already up to 10- fold larger than for all other task-mappers. Overall best memory-performance tradeoff is obtained by inc-GMMC and inc-NMC. 3.6 Results 3.6.1 Task Estimation - Parameter Dependency We analyze our task-mappers’s performance-parameter dependency in figure 3.5, specifically for inc-GMMC, inc-NMC, PCR, PCR-E, inc-ART and ARTMAP. For inc-ART/ARTMAP the actual number of prototypes per task is not determined a priori and varies according to the natural vari- ance of each task. Instead, we vary vigilance parameterρ, which controls how many prototypes are formed. The closest ρ is to 1, the more prototypes are generated, reaching an upper bound where 42 prototypes become copies of the inputs. For GMMC and NMC we explicitly vary the number of prototypes per task. In the case of PCR and PCR-E we vary coreset size as well as the number of units in the hidden layer. We select the best parameter configuration for each task mapper according to the memory- performance tradeoff as measured by: score= A− αM, where A stands for task classification accuracy, M for memory storage usage and α is the weight of memory usage, which was set to 10 − 6 . We define memory usage as bytes required to store parameters and any appended data which will be used by the task mapper at all times (permanent). This is different then transient RAM usage during training. Detailed memory calculations in B. GMMC and NMC provided the best memory-performance tradeoff overall. In the 8-dsets and Cifar100 experiments, these methods used only 5 prototypes. For 8-dsets, GMMC occupied 82 Kilobytes while achieving 93.9% task accuracy. For Permuted, the best mapper is an NMC with merely one prototype per task and virtually 100% task-determination accuracy. Other methods such as inc-ART and ARTMAP show a poor absolute memory-performance tradeoff, where more prototypes naturally lead to better performance, but come at a huge memory cost. For example, withρ = 0.9, inc-ART forms 246 prototypes (575 KB) but at only 67.4% task accuracy. Our task mappers all build upon input from an ImageNet-pretrained feature extractor. The best feature embedding was the one to yield best performance in the fine-grained oracle-PSP-BD upper bound baseline. For the cifar100 experiment, the preferred feature extractor architecture was a Resnet-34, whereas for 8-dsets, it was Alexnet (Supplementary - B). 3.6.2 Task-Dependent Vs Task-Independent Performances To analyze our full pipeline, task-mapper + PSP-BD classifier, we keep only the best parameter configuration of each task-mapper according to the memory-performance tradeoff formula. Table 3.1 contains the fine-grained classification performance of our task-mappers when combined with the fine-grained PSP-BD classifier. We compare our model variants with task-dependent (KM- Heads, AE-Gates and Entropy) and task-independent baselines (EWC, GEM and Vanilla-Replay) 43 Table 3.1: Fine-Grained Classification Performance and Memory Usage for each Task-Mapper and Baseline Method. Task(%) is task mapping accuracy and Main(%) fine-grained classification performance. The first set of results are task-dependent with PSP-BD backbones. The (ours) indicates a task-mapper we propose. The second set includes task-independent baselines. Best results in bold. Method 8-dsets Permuted-MNIST-25 Cifar100 Task-Dependent Memory(KB) Task(%) Main(%) Memory(KB) Task(%) Main(%) Memory(KB) Task(%) Main(%) Oracle 166 100.0 58.0 399 100.0 91.2 207 100.0 76.9 Random 166 13.0 10.9 399 3.9 13.3 207 10.9 26.0 Entropy 166 27.1 25.2 399 78.6 76.9 207 34.3 44.9 AE-gates 1,804 79.6 45.4 1,183 100.0 91.2 2,255 56.6 53.4 KM-Heads† 287 32.0 20.7 345 26.7 18.1 757 21.7 32.1 KM-Heads (ours) 453 3.1 12.5 499 63.9 66.6 227 20.9 32.5 NMC (ours) 207 92.9 55.3 475 100.0 91.2 309 66.8 59.1 GMMC (ours) 248 93.9 55.8 556 100.0 91.2 411 68.0 59.4 ART (ours) 741 67.4 36.5 4,181 93.2 83.5 2,033 34.1 44.6 ARTMAP (ours) 1,304 62.5 35.8 5,567 83.6 79.5 9,554 45.8 47.3 PCR (ours) 376 91.6 54.5 982 98.8 86.0 709 57.3 54.6 PCR-E (ours) 2,932 75.2 45.8 845 66.3 69.2 959 48.7 47.0 Task-Independent (KB) (%) (%) (KB) (%) (%) (KB) (%) (%) EWC 0 - 15.2* 0 - 45.8* 0 - 35.3** Vanilla-Replay 66,060 - 19.3** 20,070 - 20.0** 4,719 - 31.3** GEM 66,060 - 10.0* 20,070 - 79.3* 4,719 - 32.5* Task-Independent: best results were obtained by *using a shared output head; **using one node per class. † has no PSP-BD and is multi-head. with respect to performance and memory storage usage. Memory-performance tradeoff for fine- grained classification is shown in Figure 3.6. The upper bound is the Oracle + PSP-BD. For full pipelines, memory is measured as space occupied by additional parameters needed to ensure remembering; this excludes the original weights of the backbone network. For instance, in oracle + PSP-BD the 160 KB memory usage refers exclusively to PSP and BD components. Despite the simplicity of our best mappers, GMMC, NMC, PCR, when combined with the strong PSP-BD classifier, they enable much better performance than the task-independent and task-dependent baselines. In figure 3.6, GMMC, NMC and PCR mappers cluster closely to the oracle baseline as well as the optimal upper-left corner. In contrast, all task-independent methods (EWC, GEM and Vanilla-Replay) show very poor memory usage and lower performance. From Figure 3.6, we also note how 8-dsets and Cifar100 are in general much harder benchmarks than Permuted-MNIST. In the latter, most models achieve above 70% accuracy after 25 tasks, including GEM, KM-Heads and Entropy, which fall drastically to below 40% and 20% in Cifar-100 and 8-dsets, respectively. 44 A) 8-datasets C) CIFAR-100 B) Permuted-MNIST-25 90 80 70 60 50 40 30 20 10 0 100 0 500 1000 1500 2000 2500 (…) 66,060 Memory (KB) 1.Oracle 7.NMC 8.GMMC 11.PCR 9.ART 10.ARTMAP 3.Entropy 5.KM-heads 6.KM-heads (ours) 12.PCR-E 2.Rand 13.EWC 14.Vanilla Replay 15.GEM 90 80 70 60 50 40 30 20 10 0 100 (…) 20,070 0 1000 2000 3000 4000 5000 6000 7000 8000 1.Oracle 8.GMMC 7.NMC 11.PCR 12.PCR-E 3.Entropy 6.KM-heads (ours) 9.ART 10.ARTMAP 15.GEM 14.Vanilla Replay 13.EWC 5.KM-heads 2.Rand Memory (KB) 90 80 70 60 50 40 30 20 10 0 100 Memory (KB) Fine Grained Accuracy (%) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1.Oracle 8.GMMC 7.NMC 11.PCR 3.Entropy 9.ART 12.PCR-E 13.EWC 5.KM-heads 2.Rand 6.KM-heads (ours) 14.Vanilla Replay 15.GEM 10.ARTMAP 4.AE-gates 4.AE-gates 4.AE-gates Figure 3.6: Fine-grained classification performance versus model’s memory usage. Squares are task-dependent models and triangles, task-independent. The absolute upper-bound is Oracle + PSP-BD. Our full pipeline, with GMMC, NMC or PCR + PSP-BD classifier, is the closest to the optimal upper-left corner. When comparing PCR and PCR-E, we observe that using features from a fixed-extractor as input to task mappers in general worked much better than feeding task-specific logit embeddings. Similarly, the task-dependent baselines KM-heads and Entropy, which also use logit embeddings as input to a task classifier, also perform poorly. Additionally, AE-gates was shown to perform better than other task-dependent baselines (Entropy and KM-heads). Yet, in the harder benchmarks of 8-dsets and Cifar-100, AE-gates still significantly underperformed our best proposed task-mappers (GMMC, NMC and PCR) both in terms of accuracy and memory usage. For instance, when comparing AE-gates to GMMC in 8-dsets, the former obtains a task accuracy of 79.6% at a cost of 1,804 KB whereas our GMMC model obtains 93.9% task accuracy at only 248 KB. In Table 3.2 we compute the percentage increase in performance and decrease in accuracy from our best task-mapper + PS-BD models relative to the oracle + PSP-BD upper bound. 3.6.3 Incremental Inter x Intra Dataset learning Our approach is shown to work particularly well for Inter-dataset incremental learning, i.e., 8-dsets and Permuted MNIST (Table 3.2). Inter-dataset variability is in general larger than intra-dataset, 45 making task mapping easier. Several factors contribute to this, one being greater diversity of low- level image statistics between different datasets. Thus, even in a difficult benchmark like 8-dsets, we achieve very high task classification performances. For instance, in 8-dsets, while tasks such as VOC and ACTIONs are close both semantically and in low-level statistics, we also have much more distant tasks such as VOC with relation to SVHN. Meanwhile, in the Intra-dataset modality (Cifar100), while each superclass enforces some semantic grouping, i.e., fish versus flowers, the low-level statistics among tasks is very similar since all images are sampled from the same source. 8-dsets Permuted Cifar100 Performance Decrease (%) 3.79 0.22 22.75 Memory Increase (%) 0.03 5.37 0.09 Table 3.2: Best Task Mapper with respect to Oracle. Percentages taken from the best versions of our full model (GMMC/NMC + PSP-BD) with respect to upper-bound oracle + PSP-BD. 3.7 Conclusion We propose and compare several task mapping models that impose only very modest memory storage increments when coupled to a fine-grained classification model. We find that when using our best performing task-mapper with a state-of-the-art fine-grained classifier, we perform better than with baseline mappers and also much superior to task-independent methods both in terms of accuracy and memory usage. As such, our results suggest that a hierarchical two-step incremental learning approach, combining cost-efficient coarse-level task classification with fine-grained class identification can be more advantageous then mapping class and task simultaneously as is done in task-independent models. Recent work in meta and adaptive learning appears to suggest that a crucial element for generalizable learning is a good representation [55]. In fact, we found that for task mapping, inputs from a shared robust fixed extractor enabled much better performance then inputs from task-specific embeddings. By exploring a good representation, we show that 46 simple methods of task classification can work well both for incorporating new information and for not forgetting past information. A pivotal question for the field is then how to generate the most generalizable visual embedding, from which can arise simpler and more memory efficient continual learning algorithms. Acknowledgments This work was supported by the National Science Foundation (grant numbers CCF-1317433 and CNS-1545089), C-BRIC (one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA), and the Intel Corporation. The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof. 47 Chapter 4 incDFM: Incremental Deep Feature Modeling for Continual Novelty Detection 4.1 Abstract Novelty detection is a key capability for practical machine learning in the real world, where mod- els operate in non-stationary conditions and are repeatedly exposed to new, unseen data. Yet, most current novelty detection approaches have been developed exclusively for static, offline use. They scale poorly under more realistic, continual learning regimes in which data distribution shifts occur. To address this critical gap, this paper proposes incDFM (incremental Deep Feature Modeling), a self-supervised continual novelty detector. The method builds a statistical model over the space of intermediate features produced by a deep network, and utilizes feature reconstruction errors as uncertainty scores to guide the detection of novel samples. Most importantly, incDFM estimates the statistical model incrementally (via several iterations within a task), instead of a single-shot. Each time it selects only the most confident novel samples which will then guide subsequent re- cruitment incrementally. For a certain task where the ML model encounters a mixture of old and novel data, the detector flags novel samples to incorporate them to old knowledge. Then the de- tector is updated with the flagged novel samples, in preparation for a next task. To quantify and benchmark performance, we adapted multiple datasets for continual learning: CIFAR-10, CIFAR- 100, SVHN, iNaturalist, and the 8-dataset. Our experiments show that incDFM achieves state of 48 the art continual novelty detection performance. Furthermore, when examined in the greater con- text of continual learning for classification, our method is successful in minimizing catastrophic forgetting and error propagation. 4.2 Introduction Deep Neural network models excel at learning complex mappings between inputs and outputs, so long as the data is drawn from a stationary distribution. Yet, when these models are deployed in the real-world, they may encounter out-of-distribution (OOD, ”novel”) inputs, i.e. input data that does not resemble the training data (in-distribution, ”ID”), prompting misleading predictions. This is a strong limitation because many real world applications require handling non-stationary data. Models deployed in self-driving cars, for instance, will inevitably encounter novel out-of- distribution data (e.g. new terrains, objects, weather) that they have to adapt to. Hence, continual novelty detection is critical for operating in real-world, non-stationary conditions. However, most novelty detection methods were developed for and evaluated against a single fixed split of ID/OOD data. They do not integrate the detected OOD data into the learnt knowledge and perform poorly in dynamic, non-stationary conditions. On the other hand, most approaches in continual learning (CL) focus on mitigating catastrophic forgetting, a phenomenon in which training a neural network on a new task with novel data typically destroys the fixed mapping learned from the previous tasks. Most importantly, they use an oracle to identify novel data, leaving the question of continual novelty detection largely unaddressed. We seek to bridge the divide between the continual learning and novelty detection fields by ad- dressing novelty detection in continual learning, a much more challenging evaluation and deploy- ment paradigm. Specifically, we focus on the task-incremental continual learning setting, where the model increasingly encounters new, additional classes of data without significant distribution shift for the already-seen classes of previous tasks. In this setting, a novelty detector is presented with several OOD/ID separation tasks through time. This can bring about several challenges: (1) 49 Novelty consolidation: integrating detected novel samples to knowledge (to avoid treating them as novel in subsequent tasks) (2) Catastrophic forgetting: remembering this cumulative knowledge through tasks, and (3) Error propagation: minimizing the number of samples falsely flagged as novel to avoid impairing knowledge consolidation. Contribution: We propose a novelty detection algorithm, “incremental Deep Feature Model- ing” (incDFM) that addresses these three challenges. It is trained using only ID data and designed to operate under the continual learning setting. incDFM builds a per-class or per-task statistical model over the space of intermediate features produced by the deep network and computes a fea- ture reconstruction score to flag the OOD samples. Most importantly, with the goal of minimizing continual error propagation, incDFM estimates this statistical model incrementally (via several it- erations within a task) for each novel task. At each iteration within a novel task, it recruits the top most ”certain” novel samples that will then improve subsequent recruitments incrementally. incDFM can be used to substitute the novelty oracle used in traditional supervised CL. Finally, we show that incDFM achieves state of the art novelty detection performance when evaluated on mul- tiple datasets adapted for task-based continual learning, such as CIFAR-10, CIFAR-100, SVHN, iNaturalist and the 8-dataset. 4.3 Background and Motivation Novelty Detection: Also known as outlier or out-of-distribution (OOD) detection, novelty detec- tion is a very active research area. It is typically performed by making the network provide an uncertainty score (along with the output) for each input. Common methods include the Softmax score [59] and its temperature-scaled variants such as ODIN [60]. Bayesian neural networks [61] and ensembles of discriminative classifiers [62] can generate high quality uncertainty, but at the cost of complex model representations, and substantial compute and memory. Deep generative models learn distributions over the input data, and then evaluate the likelihood of new inputs with respect to the learnt distributions [63], [64], [65]. Gradient-based characterization of abnormality 50 in autoencoders is highlighted in [66]. Finally, there are methods [67], [68] that learn parametric class-conditional probability distributions over the features and use the likelihoods (w.r.t the learnt distributions) as uncertainty scores. Continual Learning: This paper focuses on task incremental learning, a paradigm where a model continually learns from a sequence of tasks that each introduce novel data but with no or lim- ited access to past, labeled data. The majority of the CL literature has focused on catastrophic forgetting [20, 32] while mostly offloading the task transition detection duty to a so-called nov- elty oracle. Overall, [18] proposes that current continual learning algorithms can be grouped into task-dependent and task-independent models by their reliance on task labels at test time. Task- independent algorithms do not require task labels and typically employ a single shared classifica- tion layer which has as many output nodes as the number of learned classes over all tasks. One subclass consists of regularization-based approaches which aim to mitigate forgetting by con- straining the change of learnable parameters. Alternatively, replay-based algorithms approximate the CL problem to a multi-task setting by either storing [28, 29] or learning to generate [17, 33, 38] past data. Broadly, task-independent models solve a more challenging CL formulation since task- specific parameters are not exploited for test time performance. Task-dependent methods, on the other hand, require the availability of task labels which are usually provided by a task oracle and utilize this information by employing task-specific classification heads and other task-dependent parameters to share the rest of the network for different tasks, e.g. partitioning with context [45, 47, 69] or mask matrices [48, 70]. Dependence on an oracle limits their applicability as determining tasks and detecting task transitions are challenging and also prone to forgetting [18]. Continual Novelty Detection: The problem of novelty detection in the continual learning setting has not been extensively studied or discussed in the literature. Most CL literature has assumed the use of a novelty oracle to indicate fully-labeled task transitions. Incipient proposals and dis- cussions for novelty detection have occurred in [71–73]. Yet, most of these works do not propose novel OOD algorithms, rather, they adapt existing OOD approaches and to a limited success. For instance, the closest work, by [72], compares among several existing OOD detectors. However, 51 their best results occur under a task-oracle-dependent continual learning setting and using task- dependent CL algorithms to mitigate novelty detection forgetting. We argue that this limits real- world applicability since it is not realistic to assume that at deployment, unlabeled test samples will be accompanied by their respective task IDs, which would obviate the need for OOD detection in the first place. To our knowledge, no work has yet proposed a reliable OOD detection solution for oracle-less continual learning over several tasks, a more realistic but also more challenging setting. 4.4 Methodology for Continual Novelty Detection To bridge the gap between the CL and OOD fields, we first establish a novelty detection method- ology suited to continual learning. In our framework, each incoming task t is an unsupervised mixture of unseen ID t samples (”old” classes) and OOD t samples (”new” unseen classes): ID t ={u old,unseen |u old,unseen ∼ D k },k= 1,...,t− 1 (4.1) OOD t ={u new,unseen |u new,unseen ∼ D t } (4.2) ID t comprises unseen samples that were never used in training, but come from the same source distributions D k ,k= 1,...,t− 1 that were used to train past tasks, while OOD t consists of samples from an entirely new distribution D t . In our experiments, we simulate this by using 80% of original training samples as novel data at each task and leaving the remainder for introduction at later tasks (at which point they will be old ID data). The goal for the novelty detector is to accurately differentiate between ID t and OOD t to pro- duce an estimate of the novel samples which we denote as [ OOD t . This then becomes the training data to consolidate knowledge of novel samples. This methodology leads to the additional challenges of catastrophic forgetting and error prop- agation (alluded to in Section 5.2) that aren’t present in conventional offline OOD detection. First, 52 as more and more classes/tasks are encountered, incDFM has to increasingly add to its stored rep- resentation of what is ID and remember the cumulative{D k },k≤ t going forward. If past D k ’s are not properly remembered and represented in knowledge, this can result in catastrophic forgetting and failure to identify incoming old samples as ID. incDFM addresses this by building a per-class or per-task statistical model to detect novel samples at each task. The per-task parameters once stored are not interfered with in future tasks, minimizing forgetting (refer to section 4.4.1.2). Sec- ond, as already mentioned, whenever the novelty detector finalizes its selection of novel samples [ OOD t , these are then used as training data to consolidate knowledge and expand what is consid- ered as ID t+1 for the following task. However, since [ OOD t can contain misclassified samples, this could result in an inaccurate representation of D t during consolidation, which will lead to error propagation that grows progressively worse. Cumulatively, these two aspects can lead to severe performance degradation. We show incDFM’s incremental recruitment strategy (section 4.4.1.2) minimizes error propagation. Lastly, we also evaluate continual OOD detection in an inherently more difficult experimen- tal paradigm where ID and OOD sets are drawn from different splits of the same dataset (intra- dataset). In particular, we propose experiments of intra-dataset class-incremental learning where, at each task, only one novel class is introduced, up until all classes of a dataset are covered. ID and OOD splits sampled from the same dataset tend to be close and harder to disentangle [18]. In contrast, most offline OOD detection literature has focused on OOD/ID splits between different datasets (inter-dataset, e.g., CIFAR-10 as ID vs. SVHN as OOD) - these are typically comprised of highly divergent data distributions, causing the model to first explore accidental low-level sta- tistical differences instead of more meaningful semantic variances. Overall, combining a naturally harder ID/OOD setting per task with having to remember what is ID through time makes most con- ventional OOD detectors underperfom. In incDFM, iterative estimation and recruitment algorithm is better suited to continual and challenging ID/OOD splits. 53 ! ",$%& =! " ()* Legend: = ID, Old = OOD, New = Selected samples ! ",$≫& incDFM consolidated parameters ! ("%-,$) = ! "%- ()* ! "%-,$/& 012 Store most confident samples per iteration Estimated as New 3 ",$%& Use them to consolidate Iteration i=1 Iteration i>>1 3 4 3 -,$/& 3 5 3 6 3 & Ex: Task 5 Figure 4.1: incDFM estimates novelty incrementally per task. A tasks’s unlabeled data mixture is shown here with ID/old samples in blue and OOD/novel samples in orange. At each iteration within one novel task, incDFM recruits the top most “certain” novel samples (in red) according to the evaluation function S i . It then removes them from the unlabeled pool. At iteration 1 we can see new and old distributions are entangled but tend to separate in later tasks, as incDFM improves its estimate of novelty. 4.4.1 incDFM Model 4.4.1.1 Deep feature Modeling incDFM is built upon the OOD detection technique proposed in [67] based on probabilistic mod- eling of deep features. Consider a deep neural network (DNN) trained on an N-class classifi- cation problem. For an input x, let u≜F l (x) denote the output at an intermediate layer l of the network. In [67], class-conditional probability densities are learnt on this set of intermedi- ate deep-features and the likelihood scores from these are used to discriminate between ID and OOD samples. A principal component analysis (PCA) transformation,T :H →L , is simulta- neously learnt to map the high-dimensional features onto an appropriate lower-dimensional sub- space, dim(L)≪ dim(H ),prior to density estimation. The PCA transformations are also learnt on a per-class basis. For incDFM, this implies that a separate PCA transformation,T t , is learnt for each task t. In [74], it was shown that the feature reconstruction error (FRE) score, defined as FRE(u,T)=∥u− (T † ◦ T)u∥ 2 (4.3) is highly effective at discriminating between ID and OOD samples, whereT † is the inverse PCA transformation (computed as the Moore-Penrose pseudo-inverse ofT ) . The intuition behind FRE 54 is that OOD samples will lie outside the subspace of ID samples and will hence result in higher FRE scores. 4.4.1.2 Knowledge consolidation and storage: To obtain deep features, incDFM employs a frozen feature extractor pre-trained via unsupervised contrastive learning on an independent large dataset - e.g., imagenet. Using a frozen pre-trained deep feature extractor showed superior performance to fine-tuning, which is in line with recent findings in the adaptive learning field [54, 55]. At each task t, we process all unlabeled samples, x t = OOD t ∪ ID t through the feature extractor and collect deep features u t =F l (x t ) which are used as input to the main incDFM algorithm. Further, as mentioned earlier, we learn and store the parameters forT t for each task separately (Procedure Consolidate in Algorithm box Fig 4.2). This consolidation approach has two advantages for continual learning. First, by modeling OOD t via isolated per-task (per-class) parameters, we minimize catastrophic interference when new classes are introduced later on. The consolidated per-class parameters are never altered so cannot actually be ”forgotten”, assuming no distribution shift for old tasks. In deep neural networks (DNNs), by contrast, the majority, if not all, of parameters are shared between the classes and per-class importance of each weight is not as easily assessed. As such, when new classes are introduced, it is naturally much more difficult to isolate inter-class interference in DNN weight space. This is one of the reasons most CL approaches tackling single-headed classification require a replay strategy to not ”forget”, which can quickly escalate in memory usage. This brings us to the second advantage: our consolidation approach is both fast and memory-efficient. More specifically, it is fast because it requires a single PCA fitting operation per task. Additionally, it entails a low memory usage since it only retains the PCA transformationT t per task, which is almost always less memory expensive then storing raw image samples for replay typical in task-independent CL approaches. 55 Algorithm 2: incDFM - Incremental Novelty Recruitment per Task Input : u t - Deep Features of current Task Require : I - Maximum Number of iterations; R - Recruitment per iteration; Initialize : S old ← KnowledgeScores(X t ,{T k }for k< t); i← 1; N new 1,le ft ← length(X t ); S 1 ← S old ; Indices t =[1,...,length(X t )]; Indices new 1 =[]; // Select most certain novel samples per iteration until the stopping criterion 1 while(i< I) and(N new i,le ft > 0) do // Concatenate newly selected indices to previously selected Indices new i ,N new i,le ft ← SelectTop(S i ,R); // Remove selected indices from unlabeled pool Indices new ← [Indices new i ,Indices new ] Indices t ← Indices t − Indices new i T i ← Consolidate(Indices new ,X t ) S new i ← FRE(u t ,T i ) S i ← S old λS new i i← i+ 1 2 {T k },k= 1,..,t← Store(T t,I ) Figure 4.2: Procedures KnowledgeScores and SelectTop are described in section 4.4.1.3; Consolidate in 4.4.1.2 4.4.1.3 Novelty Detection and Selection: Incremental recruitment When a new task arrives, the stored consolidation parameters ({T k } for k= 1 : t− 1) are used to initialize an incremental recruitment of novel samples. We express the unlabeled deep features from incoming task t as u t =F l (x t ),x t = ID t ∪ OOD t . For all unlabeled samples u t , we first compute the FRE scores for all k = t− 1 stored sets of transforms and then take the minimum FRE: S old (u t )= min k (FRE(u t ,T k )),k= 1,...,t− 1 (4.4) Intuitively, this indicates which of the older classes/tasks each sample is closest to. We sort the set of unlabeled samples by their FRE scores with the intuition that ID t samples will tend to yield lower values of FRE than OOD t . We could presumably set a threshold and select samples whose scores exceed that to constitute [ OOD t and then estimateT t from those. A relaxed threshold 56 could result in [ OOD t containing a large number of ID samples misclassified as novel, whereas a high threshold might result in very few novel samples being available for computation ofT t . Either way, this could lead to poor estimates ofT t and the error from this would propagate and progressively worsen for subsequent tasks. Hence, we propose an iterative method to estimate the novel samples in an incremental fashion as outlined in Fig 4.2. In the first iteration, i= 0, we compute S old (u t ) as previously described in equation 2 (Procedure KnowledgeScores) and select only the highest R percent of the S old scores, corresponding to the most confident ”novel” samples until now (farthest from old). These samples constitute a first estimate, \ OOD t,0 , of what is OOD. They are used to consolidate knowledge by computingT t,0 and using the latter to obtain S new 0 ≜ FRE(u t ,T t,0 ). For all subsequent iterations i≥ 1, we compute a composite evaluation score function S i which combines S old and the previous iteration’s S new i− 1 S i = S old λS new i− 1 , S new i− 1 ≜ FRE(u t ,T i− 1 ) (4.5) and use this composite score to select the next R top percent, indices new i (Procedure SelectTop in Algorithm 1), which are then concatenated to all previous iteration’s indices, indices new , and used to compute the next estimate ofT t,i . The idea behind this algorithm is to increasingly separate hard ID/OOD splits (Fig 4.1). At each iteration, OOD (novel) samples will tend to have low scores S new i− 1 and high S old , resulting in the highest composite S i values. To minimize errors, we set R conservatively to recruit only the most confident OOD detections. Moroever, as more and more confident OOD estimated samples are recruited, i.e. Indices new grows in size, the better will be the subsequent estimate of PCA parametersT i . This in turn will yield progressively more reliable S new i scores. To estimate a stopping point to incremental recruitment, we set a total maximum number of iterations and employ a small validation set with only in-distribution (old) samples, ({V k },k< t), to estimate if there is still a probability of having non-recruited novel samples left (C). In practice, at each task we reserve a small percentage of detected novel samples for validation and do not use 57 Feature Extraction incDFM Classifier Experience Replay Frozen Feature Extractor DNN Long-term-memory classifier Filter: Only Novel Raw Unlabeled Train Data Y = classes+1 Legend: = = Novel Old Coreset (t=1) Novelty Detection Deep Features Pseudo-label: Figure 4.3: Full Pipeline - unsupervised class incremental learning with incDFM them for fitting any parameters. For fairness, the same validation set is used across all baselines that we compare with. 4.4.2 Full Pipeline: unsupervised class-incremental learning using incDFM for continual novelty detection We show that incDFM can be coupled onto an unsupervised class incremental classification pipeline, Figure 4.3. We take the same experimental setting previously described, where at each task we have a mixture of holdout samples of old classes and one new class at a time, all unlabeled. Over tasks, we keep a counter of how many novelties have been introduced so far,C t (equivalent to number of classes in this case). At each task, after incDFM has selected a final estimate of novel samples \ OOD t , these are pseudolabeled asC t− 1 + 1 and the counter is also incremented. As the classifier we use a perceptron on top of the frozen feature extractor that is also shared with incDFM, similarly to [18, 46]. The detected novel samples are then used to train the classifier using the pseudolabels as targets and are stored in a coreset for replay at future tasks. We employ a fixed size coreset with the same building strategy as in [17]. Thus, at each task, the classifier is trained using the current tasks detected \ OOD t samples and the samples in the coreset using experience replay to mitigate forgetting (suppl. C). 58 4.5 Experiments Intra-dataset class-incremental experiments: For intra-dataset experiments, we consider four datasets: CIFAR-10 (10 classes), CIFAR-100 (super-class level, 20 classes) [75], EMNIST (26 classes) [43] and iNaturalist21 (phylum level, 9 classes) [76]. We adapt all datasets for class- incremental learning by starting with 2 classes for the first task and adding one class at each incre- mental task until all classes are covered. Inter-dataset Experiment: In this experiment, the novelty per task is an entire novel dataset (with multiple new classes). This is a CL version of the conventional ID/OOD setting. We compare how much easier it is to detect ID/OOD shifts in this CL inter-dataset paradigm versus the previous CL intra-dataset class incremental experiment. We consider a sequence of eight tasks each being one of 8 object recognition datasets (Flowers [77]; Scenes [78]; Birds [79]; Cars [80]; Aircrafts [81]; VOC Actions [82]; Letters [83]; svhn [42]) as in [58]. The total 8 dataset sequence contains a total of 227,597 pictures in 717 classes (suppl. C). 4.5.1 Baselines: We compare and benchmark our method against the various commonly used offline OOD detec- tors: (i) Mahalanobis based OOD detector [68] (ii) Softmax based OOD detector [59] which uses softmax output as a confidence score, and (iii) Generalized ODIN [84] which introduces a de- composed softmax scoring function as an improvement of Softmax. Note that while Softmax and ODIN both rely on classification layers to detect novelties, Mahalanobis relies on distance scores computed from intermediate features of a DNN. For ODIN and Softmax we use the same classifier architecture as in our full-pipeline (section 4.6.3), i.e. a perceptron (MLP) on top of the frozen feature extractor. Since these baselines were developed for offline OOD detection, we make the necessary adap- tations to use them in continual learning: First, for Mahalanobis we keep a coreset with select past ID samples to estimate the joint covariance needed for the metric. Second, because the MLP 59 classifier in both Softmax and in ODIN is plastic and updated continually, catastrophic forgetting is expected to have a degrading effect on continual novelty detection performance unless an al- leviation mechanism is employed. In the intra-dataset class incremental experiments, we apply coreset-based experience replay [85], the same CL strategy as in our full-pipeline. Task-dependent algorithms cannot be applied in this case since each task is only one class. Alternatively, to mitigate forgetting in the inter-dataset experiment, we used PSP [45] and separate readout heads per task, similar to [18]. PSP is a state-of-the-art task-dependent CL algorithm which partitions the weights of an MLP by projecting inputs layer-wise onto orthogonal multidimensional directions and thus minimizing interference between tasks. The original PSP formulation requires a task oracle. Hence, we also propose a version of PSP that is oracle-less: we loop through all PSP task-conditioned MLP partitions and output heads collecting task-dependent Softmax/ODIN scores. We then select the task-dependent score yielding maximum certainty among them as a final task-independent score. Finally, we also compare against a direct implementation of DFM, which uses the same per-task knowledge consolidation strategy as described in Section 4.4.1.2. but does not employ our proposed incremental recruitment algorithm. This serves as an ablated version of incDFM. For all four baselines, we select \ OOD t by applying a single threshold per task on the corresponding generated uncertainty scores. The threshold is chosen based on a validation set con- taining ID samples. For fairness, we employ the same validation set{V k },k< t used by incDFM (see section 4.4.1.3). For all baselines we perform a hyperparameter sweep over thresholds - and report best results (suppl C). 4.5.2 Architecture and Training Parameters: For all methods considered, including ours, we use a ResNet50 [86] backbone pre-trained on ImageNet using SwA V [87], a contrastive learning algorithm. For OOD methods which rely on classification ( ODIN, Softmax) and also for the end-to-end class incremental learning pipeline, we use an MLP, with 4096-dimensional hidden layer, as the classifier. The backbone is kept frozen for all tasks and only the classifier is fine-tuned over the course of an experiment. We compared 60 ID→ OOD incDFM DFM Mahal Softmax ODIN CIFAR-10→ SVHN 99.9 93.4 93.1 88.2 95.8 CIFAR-100→ SVHN 99.9 93.6 87.7 83.5 88.4 Table 4.1: AUROC scores for offline OOD estimation to fine-tuning the backbone continually using experience replay but the reported frozen backbone approach worked best for incDFM and baselines (see suppl. C), in line with the results reported in [88]. We optimize using ADAM [89] with a learning rate of 0.001 and decrease the learning rate when on a plateau (i.e., ReduceLROnPlateau). Finally, for all methods requiring a coreset, .e.g., our end-to-end incremental learning pipeline (section 4.6.3), ODIN, Softmax and Mahalanobis, we keep between 5-10% of the dataset converted to deep embeddings (output of frozen feature extractor) into a fixed-size coreset. For the end-to-end-pipeline, we train using 20 epochs per task with the exception of ODIN which we train for 40 epochs per task for all experiments since it takes longer to converge. 4.6 Results 4.6.1 Preliminary offline evaluation of incDFM and baselines In table 4.1 we evaluate incDFM performance in a conventional offline inter-dataset setting: train- ing on one ID dataset (one task) and evaluating once on another OOD dataset. We use CIFAR-10 and CIFAR-100 as ID datasets and SVHN as OOD. We implemented each baseline (DFM, ODIN, Softmax and Mahalanobis) using the same architecture as described in Section 4.5.1 - a frozen Resnet50 backbone followed by trainable MLP. The latter in the case of methods that perform clas- sification (i.e. ODIN and Softmax). We show that incDFM overperforms the compared baselines as measured by AUROC scores. AUROC stands for area under the receiver operating characteristic curve, which plots the true positive rate (TPR) of in-distribution data against the false positive rate (FPR) of OOD data by varying a threshold. It can be regarded as an averaged score. 61 Cifar10 AUROC Cifar100 (a) CIFAR-10 iNaturalist CIFAR-100 EMNIST AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR incDFM 98.2 98.2 91.5 90.4 95.4 95.5 98.7 98.8 DFM 74.0 72.1 61.3 60.5 63.8 62.2 75.4 71.1 Mahal 75.2 72.2 60.9 61.6 58.7 57.7 67.7 65.0 Softmax 66.6 63.5 70.3 66.8 56.4 52.7 61.1 57.6 ODIN 81.6 79.7 75.3 71.6 53.7 53.2 63.3 60.6 (b) Figure 4.4: Intra-dataset Novelty Detection: (a) AUROC scores per task for novelty detection using detected samples as train/fit data for model update. (b) Average AUROC and AUPR scores after all tasks. 4.6.2 Continual Novelty detection 4.6.2.1 Intra-dataset OOD: Class incremental novelty detection Figure 4.4(a) displays the performance per task of incDFM when evaluated on intra-dataset class incremental novelty detection, shown for CIFAR-10 and CIFAR-100. Additionally, figure 4.4(b) shows the average performance across tasks for all datasets. We evaluate performance using AU- ROC and AUPR scores. the latter refers to the area under precision recall curve with respect to the novelty class. Overall, our approach over-performs the competing methods. incDFM shows consistent performance over tasks, with minimal to no degradation. We can directly observe the advantage of incDFM’s incremental recruitment algorithm by comparing it to DFM (our ablated 62 incDFM DFM Mahal Softmax ODIN Task-Oracle AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR Yes - - - - - - 99.9 99.9 99.8 99.6 No 99.9 99.9 95.0 94.5 94.7 94.3 69.4 70.1 64.2 64.0 Table 4.2: Inter-dataset continual learning (8-dataset) with and without Task Oracle. baseline) which employs a single threshold for OOD selection instead. Additionally, we can ob- serve that the performance gap between incDFM and compared methods is much larger in this class incremental setting then those shown in Table 4.1 for the offline setting. When ID t and OOD t sets are drawn from the same dataset, as is the case in our class-incremental setting, OOD de- tectors cannot explore low-level statistics to arrive at a prediction. Instead, the distinction must come from more conceptual class-defining properties, arguably harder. Moreover, in this continual setting, other factors such as forgetting and error propagation pose a further challenge. 4.6.2.2 Inter-dataset OOD: dataset incremental novelty detection Table 4.2 shows results for incDFM in a different continual learning setting, where each task now corresponds to a fully novel dataset (experiment described in section 4.5). In general, all OOD detectors, including incDFM, show higher performance in this experiment than in the previ- ous intra-dataset experiments (refer to Fig 4.4). This again reaffirms the notion that inter-dataset ID/OOD splits are easier to disentangle than splits within the same dataset. Additionally, we show that baselines Softmax and ODIN really suffer in performance when they don’t have access to ground-truth task labels (second row). This finding is in line with other works that have explored task oracle substitutions in CL [18]. In fact, having access to task labels for unlabeled ID samples is unrealistic in novelty detection since if a task-label is known, it obviates the need for novelty detection in the first place. 63 1) Cifar10 2) Cifar100 Epochs Epochs Accuracy Epochs for Odin Epochs for Odin (a) Cifar10 iNaturalist Cifar100 EMNIST MT 93.1 91.1 75.6 93.1 Oracle 94.0 85.3 77.2 92.8 incDFM 92.0 74.6 74.7 90.3 DFM 75.2 60.8 46.2 71.3 Mahal 63.6 55.8 35.6 48.5 Softmax 59.9 62.5 36.2 41.7 Odin 62.5 64.3 36.8 46.1 (b) Figure 4.5: Unsupervised incremental classification pipeline - (a) Average incremental classifica- tion accuracy over tasks. (b) Final classification accuracy after all tasks. 4.6.3 Full Pipeline Results Figure 4.5 shows results for our end-to-end pipeline for unsupervised incremental class learning. In incDFM, the experience replay coreset stores \ OOD t samples and their assigned pseudolabels, see section 4.4.2 Thus, we propose an upper-bound baseline, Oracle, which employs the same classifier and experience replay strategy but uses real ground truth novelties( OOD t ) for training and for populating the coreset. This is equivalent to stopping error propagation. We also com- pare to the multi-task (MT) upper-bound which trains all classes for the dataset jointly, without continual learning. Firstly, figure 4.5(b) shows that our experience replay baseline using ground- true labels (Oracle - dark gray) is reliably close to upper bound MT for all datasets, suggesting a consistent mitigation of forgetting through time by using coreset-based replay only. Yet, most importantly, we see that incDFM (red) is very close to the upper-bound Oracle for all tasks and 64 AUROC CIFAR-10 CIFAR-100 Estimated OOD Yes No Yes No incDFM 98.2 98.2 95.9 96.1 DFM 74.0 83.0 63.8 65.2 Mahal 75.2 75.8 58.6 58.8 Softmax 66.6 73.8 45.7 67.2 ODIN 81.6 87.3 56.0 69.5 (a) (b) Figure 4.6: (a) Error Propagation from using \ OOD t /(yes) vs. ground-truth OOD t /(no). (b) incDFM iterations and recruitment % (Cifar10 averaged across tasks). datasets, despite using only pseudolabels. In contrast, all other baselines incur a significant drop in classification performance through time. The reason is likely due to the compounded effect of error propagation since they provide a very suboptimal novelty detection performance across tasks (refer back to Figure 4). Poorer \ OOD t estimates per task will propagate wrong pseudolabels for training and for coreset storage, adding detrimental noise to the overall training and increasingly hurting performance through time. 4.6.4 Ablation and Hyper-parameter sensitivity study in incDFM 4.6.4.1 Error Propagation in continual OOD detection: We analyze here the effect of using estimated, \ OOD t , versus ground truth, OOD t , samples for knowledge consolidation, Figure 4.6(a). \ OOD t estimates contain a degree of error, i.e., ID samples that are erroneously pseudolabeled as novel. Or, instead, too many OOD samples labeled as old. When this error percentage grows too large (as is often the case for hard OOD/ID splits), it be- gins to detrimentally and progressively affect the ability to perform OOD detection at subsequent tasks. We call the continual compounded effect ”error propagation” and in incDFM it is largely minimized due to incremental recruitment which maintains prediction errors low throughout tasks. 65 New:Old incDFM DFM Mahal Softmax ODIN 1:1 98.2 72.1 72.2 63.5 79.7 1:2 97.0 56.6 59.5 46.2 62.0 1:3 95.9 49.3 52.4 39.3 51.7 1:4 95.0 44.2 48.1 34.6 45.1 Table 4.3: AUPR Scores with task data imbalanced towards more old samples (Cifar10). In contrast, we can see that classification based OOD detectors, e.g. ODIN and Softmax, are particularly vulnerable to error propagation. 4.6.4.2 Incremental Recruitment sensitivity in incDFM: The number of maximum iterations within a task and the percentage of recruitment from estimated remaining samples at each iteration are hyperparameters in incDFM. We analyze the sensitivity to each in figure 4.6(b). Note that when iterations is equal to 1 in the x axis, we fall back to single thresholding per task, same as in our ablated baseline DFM. Iterative recruitment seems to peak in performance roughly at about 5 iterations for CIFAR-10 and we observed a similar trend across all datasets. Moreover, performing two iterations is already an 22% improvement when compared to single thresholding as in simple DFM. Alternatively, incDFM is less sensitive to recruitment percentage and follows an intuitive trend where, for very low recruitment percentages, it takes more iterations to converge (yellow line - 15% recruitment rate). Overall, incDFM with 10 iterations achieves up to a 39.4% improvement over simple DFM. 4.6.4.3 Mixing Ratio of New/Old in each task: In previous experiments we kept each task with a balanced number of old and new data. How- ever, increasing the ratio of old to new data can have a detrimental effect in precision and recall performance. Old classes can be interpreted as distractors and more distractors can make novelty detection harder. We show the effect of data imbalance on performance in Table 4.2. Overall in- cDFM is much more robust to imbalances than other baseline methods. From a 1:1 to a 1:4 new 66 to old ratio in the unlabeled pool, incDFM decreases only 3.3% in performance (AUPR scores) whereas baselines have a decrease between 33% (ODIN) to 41.4% (Softmax). 4.7 Conclusion This paper presented a novel, self-supervised continual novelty detector. In contrast to the prevail- ing novelty detection approaches that operate in a static setting, we designed a method capable of handling realistic, non-stationary conditions with recurrent exposure to new classes of data. Using cumulative consolidated knowledge of what is in-distribution up until the new task, our method incrementally estimates a statistical novelty detection model associated to the new task by itera- tively recruiting the most certain novel samples and updating itself to progressively enable better estimates. Extensive experimentation in the challenging task-incremental continual learning set- ting shows state of the art performance in continual novelty detection, minimizing catastrophic forgetting and error propagation at each task through time. Acknowledgments This work was supported by the National Science Foundation (grant numbers CCF-1317433 and CNS-1545089), C-BRIC (one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA), and the Intel Corporation. The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof. 67 Chapter 5 Shape Encoding for Object Recognition in Artificial Agents 5.1 Abstract Basic level object recognition in humans is based primarily on global and local shape. In contrast, deep neural networks seem to rely much more heavily on other visual features (texture, color, etc) instead of shape. In this work we seek to disentangle the reliance on shape and non-shape cues for different object recognition models. To this end, we propose experimental benchmarks, e.g. iLabShape and ShapeY , to evaluate and measure the degree and quality of the shape encoding of a model including invariance to viewpoint representation. Moreover, we investigate how engineered shape features compare to or enhance traditional end-to-end learned deep features for shape-based categorization. Overall, we show that using a DNN on top of a high-dimensional contour structure containing shape-rich information leads to an enhanced performance when compared to employing conventional RGB input. Expanding on this, we develop a first instantiation of ShapeRNet, a higher-order shape conjunction extractor from contour input. In future research, ShapeRNet can be greatly optimized to then act as a shape-preprocessor for state-of-the-art OR algorithms. 5.2 Introduction Over the last few decades, “Deep neural networks” (DNNs) have occupied a central role in Object recognition (OR) for Computer Vision (CV) [90, 91]. State-of-the-art DNN-based OR systems 68 can now yield high performance on large-scale image classification tasks. Specifically, they now achieve up to 99% top-5 accuracy using the multi-million image dataset, Imagenet, a feat that 10 years ago would have seemed impossible. Yet, there are still numerous challenges for artificial OR. Though loosely inspired by biol- ogy, DNN-based OR systems display certain undesirable performance characteristics that con- trast sharply with human capabilities. For instance, trained DNs are highly susceptible to image perturbations – e.g. noise, blur, weather effects, adversarial perturbations - most of which no human would have any issue with [92]. DNNs are also highly overconfident. To illustrate, a high-performing DNN trained on ImageNet consistently found non-existent structure in random patterns classifying 70% of those random patterns as horses at a high confidence. Additionally, DNNs are susceptible to catastrophic forgetting, where learning new knowledge (e.g., new object classes) can cause old classes to no longer be recognized almost instantaneously. Overall, these behaviors evince the limits of generalizability and robustness of trained DNNs when compared to their biological counterparts. We hypothesize that a major limitation in current OR-DNN mod- els is that they lack the strong representational biases that support object recognition in biological vision systems. In order to perform well on standard benchmark tests, DNs must grow to enor- mous sizes, leading to models that are massively over-parameterized, lack generalizability and can exhibit worrying brittleness. Several studies suggest that the underlying classification strategies used by DNNs differ greatly from those employed by the human visual system and that this also renders them susceptible to fail- ures not found in biological systems. Basic level object recognition in humans is based primarily on global and local shape - as can be found in line drawings of an object, for example [93–96]. By “basic level” we refer to broader OR categories such as “person”, chair”, “horse”, “kitchen”, etc. Consistent with this, texture and color plays almost no role in basic level categorization of objects and scenes for humans. However, conventional DNNs trained on natural-image datasets have been shown to under-represent global shape. Instead, DNNs rely most heavily on texture, color, local 69 shape and context cues [97, 98]. In one manifestation of their limited capability at shape represen- tation, most trained DNNs have difficulty recognizing abstract representations of shape, such as line drawings (see [99] for a detailed comparison of human and DN classification behavior). We hypothesize that a main reason for poor performance at shape-based recognition in DNNs is that shape is naturally harder to learn and encode then other recognition cues. To build a line- drawing-like abstraction of an object class first requires layers to reliably extract contours, which is in itself a difficult image processing problem. Shape-defining object contours are difficult to extract from natural images because they are quite distantly – and quite nonlinearly – related to raw image pixels. As a reference, DNNs are known to learn Gabor-like filters in early layers. Such filters are strongly activated by textures, which cover the natural world, yet textures on average carry very little shape information. Moreover, object shape is then described by further highly complex global and local higher constellation of features, built from contours. Naturally then, DNNs empirically prefer to use their vast numbers of parameters to opportunistically exploit simple cues for image classification, such as texture and color [98]. Main Contributions: Current DNN-OR models lack robustness (being fooled by myriad in- put perturbation types), have poor generalization across datasets, and are mostly dependent on non-shape cues. Insofar as shape is critical for human basic-level OR, we hypothesize that many of the current shortcomings arise precisely from the lack of proper shape representational biases in these models. In line with this, we propose that providing OR models with more adequate shape encoding should improve robustness and generalizability of OR tasks. In this work: (1) In order to disentangle shape from non-shape cues, we propose experiments and benchmarks to evaluate and measure the degree and quality of the shape encoding of a model, including invariance to viewpoint representation; (2) To enforce a shape-only classification task, we pre-process RGB im- age inputs using a commercial-grade contour extractor that nominally replicates V1 functionality. The contour extracted input is then fed into a conventional DNN for the latter to extract higher- order shape-based features. Our results suggest that using contour-like input enhances shape-based 70 classification performance when compared to RGB input. The results were obtained using our iL- abShape shape evaluation benchmark; (3) It is likely that the most difficult portion of shape-based learning comes after the detection of basic shape features (contours), i.e., detecting and binding contours into higher-order feature conjunctions and pooling operations which together form the basis of a complex and useful shape-based representation for classification. As such, instead of using a DNN as a higher order shape feature extractor we design a specialized zero-free-parameter higher-order shape extractor, ShapeRNet, on top of the previously mentioned contour extractor. In ShapeRNet we build higher-order feature conjunction detectors based on statistics of natural oc- curring high-order conjunctions that are chosen to be gradually viewpoint-invariant. The latter is enforced via careful engineering with specialized domain-knowledge. The guiding principle here is that we will only demand of the network that it produces similar shape codes for inputs that have similar shape content, as defined by our built-in viewpoint variation tolerances. We evaluate ShapeRNet using a carefully constructed shape benchmark, miniShapeY , which is able to quali- tatively assess viewpoint invariance and contrast invariance of an OR representation. Moreover, we also analyzed how our ShapeRNet embedding compares to a conventional pre-trained DNN embedding. 5.3 Background 5.3.1 Human Object Recognition Relies on Shape Information Object recognition in humans is attributed to the ventral temporal cortex (VTC) pathway [100] which begins with simple and complex cell encodings in V1/V2 and extends to neurons in higher upstream areas representing complex shape features containing partial invariance to changes in viewpoint, size and translation [101, 102]. The VTC pathway culminates in a categorical organi- zation in the IT cortex (monkeys) and LOC (humans), where different regions respond selectively to complex stimuli such as faces, bodies, tools, scenes, among other categories. Moreover, in hu- mans, the LOC has been shown to respond to intact objects with clear shape representation and 71 not to scrambled objects. Overall, these pathways can be interpreted as transforming low-level vi- sual input into view-invariant object descriptors that enable perceptual categorization. Such object descriptors would contain high-level shape information that can be compared to stored memory representations and combined with top-down signals such as goal relevance and emotional mean- ing [103]. From a mathematical perspective, early visual areas such as V1/V2/LGN form mani- folds that are highly curved and tangled but through successive invariance operations, the system achieves a representation in IT cortex that allows resulting object manifolds to be separated by simple geometric hyperplanes [104]. Evidence that object classification in the human visual system relies heavily on object shape representation dates to early studies in modern cognitive neuroscience [93, 94, 96, 105]. For in- stance, humans have been shown to rely on shape to recognize objects in the absence of other visual information such as texture and shading [106]. In fact, both adults and children preferen- tially use shape to categorize novel objects when presented with conflicting color and texture cues. Additionally, VTC fMRI responses show correlated selectivity for categories with shared shape features such as among human faces and animal representations, reinforcing a theory of visual- shape encoding. On the whole, human shape representations seem to be significantly robust to changes in viewpoint, contour perturbations as well as simple deformations due to bending and stretching suggesting that our visual system heavily relies on global shape properties and not only on local contour information [107]. Nonetheless, the visual representation of shape is a very nontrivial problem, requiring the re- duction of an essentially infinite-dimensional object (shape geometry) to a few perceptually mean- ingful dimensions. Key information for object recognition has long been attributed to line draw- ings, representations of object shape very distantly and nonlinearly related to original pixel images [93, 96]. Note that human infants can efficiently categorize line drawings without any prior expe- rience [99], suggesting that the ability to abstract “form” from bounding contours is likely innate, a product of evolution. Yet, how shape itself is represented for object recognition remains an open 72 question. Studies such as [107], have hypothesized that the visual system incorporates a 3D skele- tal descriptor to determine object identity, showing that such models correlate to object similarity judgements in humans. Other models posit that a higher-order shape representation results from the combined information of several intermediate-stage adjoining local orientations and positions that form higher-order features such as curves or even more complex conjunctions [108]. The lat- ter thus points to a shape representation potentially derived from a very efficient contour encoding system. 5.3.2 Deep Network Object Recognition Does not necessarily follow human- like patterns Deep Neural Networks differ in their classification error patterns when compared to humans. When evaluated on object classification of Imagenet, human errors typically arise from confusions among similar classes (e.g. one breed of dog with another) or from confusions as to which object in a cluttered scene needs to be identified (e.g. in a picture of a table containing books, pens, and paper, the correct label may be only pen). In contrast, DNNs often list among their top-5 choices objects with no seeming similarity, such as a bird and a needle [109] and their misclassifications show a similar lack of pattern. Additionally, DNNs can be made to misclassify samples simply using filters slightly altering coloration or texture of an image, whereas humans are much less likely to be fooled by such modifications. DNNs also struggle to learn more abstract representations such as line drawings [99]. The most prevalent theory of learning in DNNs is that they combine low-level features (e.g. edges – such as in Gabor filters) towards increasingly complex shapes at higher layers. Yet, recent findings suggest that DNNs trained on natural image datasets such as Imagenet rely mostly on texture information instead. For example, they have been shown to classify texturized images very well even when global shape information is eliminated [97, 110]. Using only texture information (Gram matrix) from a DNN trained on ImageNet and applying a classifier on top of this represen- tation results in hardly any loss in classification performance [110]. In contrast, DNN performance 73 suffers drastically from elimination of texture cues [109]. Moreover, recent DNN models such as [97] that use constrained receptive field sizes throughout all layers are able to reach surprisingly high accuracies on ImageNet, even though this limits recognition to only small local patches and inhibits global shape integration. These findings suggest that DNs rely much less on global shape information than previously thought. One may then speculate if this different representational bias limits transferability of knowledge in DNNs, both in the case of cross-dataset transfer and perhaps also underlying semantic drift (overwriting) of knowledge in continual learning. In ac- cordance with this revised vision, very recent work by [98] suggests that eliminating texture cues in fact leads to better transferability of DNN knowledge. This result emphasizes the notion that a shape-based encoding is much more generalizable than a texture-based representation for object categorization. 5.4 Benchmarks to study Shape reliance in Object recognition models 5.4.1 Towards a pure shape recognition benchmark via randomizations of non-shape cues - iLabShape benchmark Our first attempt at a shape recognition benchmark used the iLab 3-D object dataset [111] which contains 10 object categories of toy vehicles (car, race-car, pick-up truck, airplane, military, heli- copter, monster truck, van and semi). Each object instance has 88 3D views that span 11 azimuth angles and 8 turntable rotations in the plane, besides including several different backgrounds per instance. The iLab dataset allowed for certain control over background and pose availability which were both attractive to formulate a shape-based benchmark. iLabShape benchmark: We developed a benchmark to disentangle shape from the most common shape-extrinsic cues (texture, color, etc) used in OR. In our benchmark, termed “iL- abShape”, we attempted to eliminate common non-shape cues by randomizing both color and 74 Figure 5.1: Hue and Polarity transformations used to establish a shape-exclusive classification benchmark. contrast-polarity of the original iLab RGB images. Applying this procedure generated unpre- dictably colorized images that nonetheless contained all original shape information. The full iL- abShape benchmark contained images with altered hue palettes and randomly flipped polarity. In figure 5.1 we illustrate the effect of both hue rotation and polarity inversion on iLab images. 5.4.2 Measuring Shape Recognition Capacity quantitatively using Nearest Neighbor Matching – ShapeY benchmark To be able to generalize based mostly on shape cues, an OR model should have a robust invariance profile. In other words, it must produce similar internal visual codes when familiar objects or scenes are viewed from different perspectives, under different lighting conditions, and/or with different backgrounds. We wanted to be able to quantitatively assess the degree of shape invariance of an OR model. As such, access to a dataset with 3D viewpoint variation was fundamental. Unfortunately, the iLab dataset lacked precise pose information, which prompted us to create our own custom 3D dataset, ShapeY [112]. Moreover, based on this dataset, we designed a novel shape benchmarking procedure. 75 Figure 5.2: Viewpoint Exclusion in miniShapeY chair category- Positive match candidates (PMCs) for view 8 (blue box) out of 11 in the series with CVT = ’pw’. Rows show all 8 series containing ’pw’. The difficulty of the matching task is controlled by excluding positive match candidates in the ”vicinity” of the reference view in viewpoint space. The ”exclusion zone” shown (red shading) is for an exclusion radius re = 2. This image is a courtesy of lab member Jong Woo Nam. 5.4.2.1 miniShapeY Image Set The ShapeY dataset was created using Blender and publicly available 3D models from Shapenet.org [113]. In this work we use a smaller version of ShapeY , which we call miniShapeY dataset, com- posed of 10 distinct basic level object categories (chair, airplane, plant, etc.). Each category has 321 3D views. In miniShapeY , each 256x256 image depicts a single object, grey in color, against a black or white background. Object views are grouped into ”series” representing different combi- nations of viewpoint transformations (CVTs). Each series is centered on a common ”origin view” of the object, with 5 viewpoint steps moving away from the origin in both directions for a total of 11 views per series. Five types of rigid transformation were used (x, y, pitch, roll and yaw; scale changes were excluded so as to preserve object detail that would be lost at smaller scales), leading to 31 possible CVTs (31 = 5 transformations chosen 1, 2, 3, 4, or 5 at a time). In each viewpoint step the object was transformed simultaneously along all dimensions in the CVT. For example, in 76 Figure 5.3: Contrast Exclusion in the ShapeY benchmark– from a query image, one is only allowed to match to contrast reversed images within the same object category. Other objects (distractors) have the same background as the query. This will examine if the encoding is more sensitive to background or the underlying shape. This image is a courtesy of lab member Jong Woo Nam. a series combining ”x” and ”roll”, each step in the series came with a horizontal shift of 3.3% of the image width, combined with 9 degrees of image plane rotation. 5.4.2.2 ShapeY benchmark Our ShapeY benchmark seeks to measure the degree of shape recognition performance of an OR model based on nearest neighbor view matching within the model’s embedding space. Given an input image, a response is scored as ”correct” if the closest match is to another view of the same object and ”incorrect” if the closest match is to a ”distractor” from a different object category. ShapeY evaluates invariance by computing viewpoint exclusion and contrast exclusion sets: (1) Viewpoint Exclusion - Overall, our benchmark enforces the condition that there should be no sin- gle view of any other object that matches an input better than the best-matching same-object view. Thus, given a particular view of a lamp, even if 99 out of 100 of the closest views in the database are images of the same lamp, if the single closest match is a view of a boat, the trial is scored as an error. Moreover, this benchmark allows task difficulty to be finely controlled through the use of ”exclusions”. In the case of a viewpoint exclusion, we choose an exclusion radius “re”, and then eliminate as positive match candidates (PMCs) all same-object views surrounding, and 77 Figure 5.4: Histogramed Top-5 outputs of Imagenet-trained Alexnet when presented with toy car test samples from iLab. Under those conditions, the most frequent class predicted was syringe, followed by measuring cup and envelope, neither of which bear any sensible shape-resemblance to toy cars. Moreover, no car or vehicle class stand within the top-5. therefore most similar to, the input view, up to “re” steps along a designated set of transforma- tion dimensions. Figure 5.2 illustrates this. Measuring the decline in matching performance as re increases allows us to quantify the degree of 3D viewpoint variation that the shape-representing system can tolerate before false matches to similar-looking distractor objects begin to increase in frequency; (2) Contrast Exclusion - In addition to ignoring modest changes in viewpoint, a shape- based recognition system must be capable of ignoring changes in non-shape cues, including the colors and textures of objects and backgrounds, changes in lighting conditions, etc. We quantified the ability to cope with these types of changes through the use of ”appearance exclusions”. An ex- ample of an appearance exclusion would be a ”contrast exclusion”, in which object views rendered in the original format with black backgrounds could only be matched to views of themselves with light backgrounds (similarly to the transformation in the iLabShape benchmark 3.1.). As such, the ShapeY benchmark includes every object view in both white and black backgrounds. Given a ”reference” view, all same-object views with dark backgrounds were excluded as match candidates (removed from PMC), forcing the system to recognize the same shape despite the change in back- ground. All views of other objects were not subject to the exclusion, however, and were available to match in the original black background only. Figure 5.3 illustrates contrast exclusion. 78 Overall, we believe the pairwise view matching criterion of ShapeY is a good starting point for evaluating recognition competence. Since rating the similarity of two views of an object without knowing the object class, is most likely a more fundamental capacity then categorization itself. For instance, it would seem very paradoxical if a system could correctly classify objects, while being unable to rate the similarity of two object views. 5.5 Experiments and Results 5.5.1 Enforcing shape-exclusive Object recognition in conventional DNNs Due to the importance of shape in human basic-level object recognition, we propose a series of experiments to disentangle shape cues from shape-extrinsic (texture, color, etc), and measure the degree of shape-reliance of an OR model. 5.5.1.1 Lack of concept generalization between a DNN trained on Real-world images (Imagenet) and tested on Toy vehicles (iLab) In our preliminary experiments, we analyze whether training a conventional DNN on realistic representations of objects (photos) is enough to classify more abstract representations (vehicle toys). In other words, so the learned representations encompass the abstract notion of a class. We used a DNN (Alexnet) trained on Imagenet and presented it with pictures of toy cars from the iLab Dataset [111]. The top-5 predictions were histogramed and the upper 20 most frequent classes are shown. Imagenet has various vehicle classes. However, note that out of the top-5 outputs, none are vehicles and moreover, many have no understandable similarity to the original input object in terms of shape. This experiment follows a similar rationale to that of [109] who presented pictures of sketches to an Alexnet trained on regular Imagenet. They also reported complete lack of generalizability from photos to sketches. Overall, these results corroborate the intuition that deep networks have poor cross-dataset generalizability. 79 5.5.1.2 Training a DNN with regular unaltered iLab images and testing on iLabShape images Next, we analyzed how well a pre-trained DNN (pre-trained on Imagenet) could generalize to iLab- Shape. For these experiments, we used a well-known convolutional neural network, Alexnet [75]. When training Alexnet on a 70K version of original iLab images, we obtained a performance of 75.61% when tested also an unmodified test set containing original RGB images. However, when applying the manipulations that compose iLabShape, we saw performance drop drastically, partic- ularly with polarity scrambling. These results (table 5.1) further attest to the lack of knowledge generalizability and suggest a heavy reliance on shape-extrinsic learning strategies. Randomization at Test Time None Hue Polarity 75.61 56.25 27.23 Table 5.1: Performance of Alexnet trained on regular iLab and tested on randomized iLab images. 5.5.1.3 Training a DNN with randomized iLabShape images and testing on original unperturbed iLab images Randomization During Training None Hue Polarity Both 75.61 76.73 76.78 77.21 Table 5.2: Performance of Alexnet trained on randomized iLab images and tested on original iLab images. Conversely, we show that a network trained on randomized images achieves equivalent or even superior performance (when both transformations are applied), similar to results reported by [98]. These results (table 5.2) suggest that DNs can be forced to relinquish non-shape cues and rely more strongly on a shape based representation. In fact, forcing the network towards a shape based representation resulted in a small performance boost suggesting less overfitting to the training data. 80 Figure 5.5: [left] A Contrast Image – contains pure contour information projected back to 2D im- age space. [right] Visualization of the high-dimensional shape-rich contour input, eliminating the orientation information and keeping only the locations of contours in the pyramidal scale format. 5.5.2 Using contour descriptors (explicit object boundaries) to train DNNs Given the importance of contours for object classification, we hypothesized that pre-processing images with a commercial-grade contour detector that nominally replicates V1 functionality could further benefit the learning task since it would provide an even stricter shape-only representation. Furthermore, we speculated that by relieving the DN from the burden of extracting object contours from pixels (something that our custom contour-detection network excels at), the remaining layers of the DN could be used for other more complex shape-agnostic features and enhance classifi- cation capacity. The custom contour-extraction network generates a multi-dimensional vector of spatial contour probabilities. The full contour descriptor contains 24 orientations x 49 shapes (e.g. 1500 features) per pixel and across 5 pyramidal scales of the original image. Additionally, we also evaluated performance when using a greyscale “contrast image” that contains pure contour infor- mation. The contrast image is derived from the high-dimensional structure but projected back onto the original flat image space via a kernel smoothing operation. In these experiments, we employed a 40K image subset of our iLabShape benchmark. The subset was selected to include neutral back- grounds and to keep object instances who’s categorization was clear (unambiguous labels). Both the contour structures and the contrast images were generated from this same subset. 81 5.5.2.1 Comparing the contour descriptor input to RGB using networks matched for number of parameters The major obstacle in comparing performance across different stimuli (e.g. RGB versus high- dimensional contour) was the mismatch in input shape. While an RGB image has a spatial dimen- sion and 3 depth channels, the high-dimensional contour structure is much deeper (1500 channels). Moreover, to reduce dimensionality, we built a compacted version of the contour descriptor which grouped basic shape information such as straights versus curves as well as neighboring orienta- tions. Note that pooling across orientations, in this case reducing from 24 to 12 angular subsets, is not explicitly performed in CNNs, but likely crucial in embedding angle invariance to represen- tations. Additionally, since the contour features are very sparsely located, we also pooled across space. The resulting contour descriptor contained a smaller spatial dimension as well as 60 depth channels, notably, 12 orientations x 5 shape types. Our first approach to accommodating the shape mismatch problem was to employ different CNNs for each input type separately but adjusting layer parameters so as to have an overall match in number of parameters, e.g. 15 million trainable weights. By equating model complexities, we could provide a fair ground for comparing across stimuli. Our results showed that learning with contour descriptor inputs led to an improved recognition performance, when compared to RGB input. Resulting in an increase from 60.07% (RGB) to 72.3% (Contours), see figure 5.6. The monochromatic contrast images performed equivalently to RGBs. These results suggest that first, a CNN can be constrained to learn exclusively from high- dimensional shape information and in fact supersede performance of a model trained with only pixel information. 5.5.2.2 Comparing Shape-rich Contour input to RGB input using a custom residual network (RESNET) adaptable to both contour and RGB inputs A valid criticism to the previous approach is that one can never be sure if equating model complex- ity by parameter size is enough to ensure a fair comparison. With this in mind, we implemented 82 Figure 5.6: Performance of CNNs trained with raw iLabShape RGB images (RGB), pre-processed contours (Contours) as well as Contrast images. Different CNNs were used for each of the inputs, but were adjusted to keep model complexity constant, e.g. same number of parameters. a residual network of fixed size (RESNET-18) specialized to accommodate deep contours. We addressed the channel disparity among contours and RGB images by stacking copies of the 3 RGB channels (RGB-shape-equivalent) until reaching the same input depth. Here we argued that even though stacked channel copies contain redundant information, it still allows the network to poten- tially build different feature detectors for each copy. In addition, we padded the spatial dimension of the contour descriptor with zeros to equate sizes. Overall, these modifications made it possi- ble to use the same residual network classifier architecture for both inputs, reducing model-driven differences in the final performance. We found the network trained with our contour descriptor input outperformed the RGB trained network (65.41% vs. 58.12% accuracy), suggesting it may require fewer modifiable parameters to achieve similar classification performance. We also experimented with combining RGB and con- tour descriptor jointly but due to the disparate nature of the two inputs, performance was inferior to using the contour alone. These results call for future research to look further into the comple- mentarity between domain-knowledge inspired shape-rich input and traditional ML tools such as DNNs. 83 5.5.3 Beyond Basic Contour descriptors In section 5.5.1 we showed that when using a pure-shape evaluation benchmark (iLabShape), a DNN provided with contour descriptors performed better then with raw RGB input. However, the contour descriptor features in section 5.5.1 were of very basic nature. We contend that learning higher-order shape features, the pairing of these more basic contours, may be even harder than ex- tracting the contours themselves. Higher-order feature conjunction, including pooling operations and long-range contextual effects, form the basis of a complex and useful shape-based representa- tion for classification [102, 108]. In section 5.5.1 we had left the job of combining these basic contour-derived features the DNN receiving the contour descriptor input. Yet, do we really guarantee that a DNN is the best higher order feature extractor? In this section, rather, we speculate that quality high-order shape descrip- tors can likely be found via unsupervised exploration of natural image instead of using a DNN. The possible advantages of such an approach are varied, including that it does not require labeled samples and furthermore, is built to be generalizable across tasks (relies only on unsupervised natural-image statistics) and not specialized to a particular task. 5.5.3.1 A Custom hierarchical Shape Feature Extractor – ShapeRNet We develop a higher-order feature conjunction detector, ShapeRNet, based on natural image con- tour statistics, creating features that can be gradually viewpoint-invariant. A core motivation for this approach, based on unsupervised contour statistics, is that humans learn new concepts with very little supervision, e.g. a child can generalize the concept of an elephant after seeing only one exemplar in a book for instance. This is in stark contrast to end-to-end deep learning systems which require hundreds of thousands of training samples and yet still struggle to create represen- tations that are truly abstract and generalizable, e.g. cannot identify a toy elephant or a sketch of an elephant as being representations of class “elephant”. The gist of ShapeRNet is that it will progressively bind and pool contour shape features across several stages to create shape-selective and viewpoint-invariant (to a determined degree) object 84 representation. We use the custom contour detector described in section 5.5.1 to generate contour descriptors that will be used as input to our higher-order ShapeRNet. 5.5.3.2 ShapeRNet Input Layer The first layer of ShapeRNet produces features that act as elementary building blocks for higher- order conjunctions. We call them first-order shape features. The major guiding rule for first- order features was to allow for progressive view-invariant encoding. We found that adding too much shape complexity at this layer was in general problematic for viewpoint invariance. As such, we build simple first-order features by pooling in space and orientation the oriented contour occurrences provided by the custom contour extractor (vid section 5.5.1). The overall first-order features thus spanned 180 degrees, with edge polarity being made redundant. Orientation and Space pooling are described in detail below. Figure 5.7: Dependency of the 3 contour angles in defining the final first-order feature orientation. Here, zero-degree straight-like contours non-intuitively contain individual contours with middle angles of 30 degrees for example. Orientation Pooling: An important aspect of forming first-order features is pooling orienta- tions of individual contour elements. Each of the original 49 contour shapes has a middle orien- tation defined out of 24 angles (360 degrees). This middle orientation was initially considered as independent of the two flapper angles. However, we later found this to be incorrect, as can be seen in figure 5.7. For example, if we consider a straight line of a real image, we hoped to ensure that most, if not all, first order shapes detected were grouped in the straight-like category, or in the same orientation category of the line. Say, if it was a line oriented by 60 degrees, then we wanted contours aligned to 60 degrees to be preponderant. However, when we considered the middle an- gle to be independent, we observed that a straight line in an image contained several sigmoidal shapes and even bowl-like shapes. This behavior was undesirable since it made first-order feature 85 occurrences very little agnostic of object shape and also led to drastic noise amplification in latter layers of our model. In fact, this initial error led us into a long detour to design a visualization code to aid in the development of model hyper-parameters. In our current proposed model, we will consider the dependencies between the 3 angles to de- fine the actual orientation of the first-order feature. The orientations of each contour will thus be defined as the best fitting angle to a straight line connecting the end-points of the whole contour. For example, in Figure 5.7, we show a contour defined by -30, -30, +30 angles and even though the middle angle is set at -30, we will consider the contour as being most closely approximated by a zero-degree orientation than a 30-degree orientation. Overall, our model will rely on carefully de- signed orientation pooling since we view this as a basis for viewpoint invariant object recognition. DNNs do not have an explicit orientation invariant pooling layer and this omission demands mas- sive dataset augmentation and a resulting encoding containing redundant feature detectors. Here, we will opt to pool neighboring orientations and result in 12 overlapping 30 degree blocks, cov- ering 180 degrees. The remaining 180 degree orientations are redundant since the features in this case have no polarity. Spatial Pooling: Spatial pooling is performed akin to standard pooling in DNNs. Spatial pool- ing parameters are chosen together with second-order feature detection parameters and considering scale dimensions. The general idea is that we want to embed translational invariance progressively and co-registered with scale invariance. First-order features: elementary elements generated from orientation and spatial pooling de- scribed above. We generate a total of 12 unique elements (generalized orientations) from 0 to 180 degrees. Thus, the output descriptor is a 12-dimensional array over space. 5.5.3.3 Higher-order Layers: Detection of high-order conjunctions and invariance pooling The layers following will consist of two main steps: Detection and Grouping. (1) Detection: corresponds to finding and binding conjunctions (pairs) of lower-order features. Detection hyper- parameters are carefully set to guarantee that the detected contours cannot overlap in origin, that is, 86 Figure 5.8: Example pictures generated by the visualization code developed to aid in hyper- parameter tuning and model design. In this case, we plot the prevalent contour orientation oc- curring per super-pixel of space and orientation. The images correspond to a 256x256 fine-grained scale (left) and a coarser 128x128 scale (right). Note that different scales have non-redundant in- formation about shape. belong to the same original detection before all the pooling (grouping) operations. (2) Grouping: a combination of pooling operations intended to ensure further degree of invariance. The grouping parameters are carefully chosen to create a desired amount of invariance. There is always a balance between preserving discriminatory capacity and ensuring invariance. Ultimately, we want the final feature representation to have high shape discriminatory capacity: should have a high number of different feature conjunctions that are concomitantly reliable and with reasonable degree of invariance. We proceed through two sequential (hierarchical) stages of detection and grouping, resulting in 2nd-order features and 4th-order features, respectively. The 4th order features are used as our final object shape representation. Figure 9 illustrates detection of 4th-order conjunctions from underlying 2nd-order features. 2nd Order Detection and Grouping: Detects (binds) first-order features that co-occur in accordance with pre-set hyper-parameters. The latter are set to try to guarantee each first-order input should belong to different origins (not the same underlying raw contour). For example, imagine first-order features that employed 4x4 spatial pools. In this case, we stipulate that a 2nd- order conjunction will be formed only if the first-order pools are separated by at least a 12-pool radius. Smaller distances may include signals from the same underlying pool, i.e., same source. 87 The 2nd-order detections define an external orientation parameter, which at this stage is set to a 30- degree discriminability insofar as the first-order features are themselves at a 30-degree resolution. After second-order features are detected, we pool neighboring external orientations by 30 degrees, resulting in only 3 final angle pools of 60 degrees. We also pool across space, similarly to ??. 2nd-order Features: At the stage, we have 122 possible conjunctions per external orientation. Recall that first-order feature orientations were spaced in blocks of 15 degrees resulting in 12 possible first-order orientations. We keep 12 external orientations as well, spaced by 15 degrees. 4th Order Detection and Grouping: Detects (binds) 2nd-order features that co-occur in ac- cordance with pre-set distance hyper-parameters. The latter are set to try to guarantee each 2nd- order input should belong to different origins (not the same underlying raw contour), as illustrated in figure 5.9. The 4th order conjunctions have an external orientation, formed by the axis be- tween the underlying 2nd order features. After detection, we group neighboring 4th order features according to the external orientations. We can also perform spatial pooling akin to in section ??. 4th-order Features: Overall, the number of possible 4th order conjunctions is very large. For example, if we group all external orientations together, to enforce maximum planar rotational invariance, we are left with all possible pairs of 2nd order features, 126 3 million possible features. We found that sub-sampling a random subset of these features, 500K yielded similar results than the full set of possible features. 5.5.4 Evaluating Invariance profile of high-order shape descriptors from ShapeRNet using ShapeY We evaluate ShapeRNet using the ShapeY benchmark described in section 5.4.2.2 As the final shape-descriptor, we use ShapeRNet’s 4th order features with complete grouping, i.e. complete orientation and spatial pooling. The representation corresponds to the counts of all unique 4th order features in a given input image. Since there are millions of unique 4th order images, we sub-sample 300K unique features. For a given query image, the ShapeY benchmark establishes an exclusion radius “re”, removing images from neighboring viewpoints up until a distance of “re” for 88 Figure 5.9: Detection of 4th-order features. Conjunctions are formed between 2nd-order features if located at appropriate approximate distances. In the scheme, we illustrate the detection of a fourth order conjunction formed by two pairs of 2nd-order features at 16x16 spatial pools spaced by 26 units in the horizontal external orientation (zero degree). Note that the 2nd-order features themselves include extensive pooling. a measure of viewpoint invariance. ShapeY also enforces contrast invariance: each query image can only match to reversed background views of its object class, whereas it is allowed to match to the same background for any other class. Overall, it computes the correlation between final shape descriptors of all images in the ShapeY dataset but excludes views for the same object as the query that are within “re” distance while simultaneously enforcing contrast reversal of same-object views. The image with highest correlation is named the top-1 match, which is considered “correct” only if it corresponds to a neighboring view of the same object class among the allowable views (beyond “re”). Figure 5.10 shows the top-1 nearest neighbor matching errors when varying the ShapeY exclusion distance along one transform dimension (planar translations -x,y; planar rotation r; depth rotations p, w). Overall, we observed that error rates were worse for depth rotations (p,w) than image plane rotations (r); and much worse for rotations than translations (x,y). ShapeNet features were invariant to contrast reversal. 89 Figure 5.10: Top-1 Nearest neighbor matching error over ShapeY , plotted against the exclusion radius re. Results are using ShapeNet 4th order shape descriptors. ShapeRNet was invariant to contrast exclusion. Note that our system is very robust to planar translation (x,y) but can struggle with depth rotations (p,w) after an exclusion distance of 3 ( 30 degrees). The effective exclusion distance value is actually larger then 30 degrees because series walk the same number of steps in each dimension. For example, a distance of 2 along X can lead to distances larger than 2 along series including X and another dimension, e.g. XY , XYW, from which the distance of 2 is enforced along all dimensions of each series. 5.5.5 Comparing the shape invariance profiles of ShapeRNet and conventional DNNs (Resnet 50) using ShapeY In this experiment, we evaluate the shape invariance profile of a conventional pre-trained DNN, i.e. ResNet50 pretrained on Imagenet. We compare the invariance profile to that of our proposed ShapeRNet in Figure 5.11 using the ShapeY benchmark. The curves in bold and triangle reference ShapeRNet whereas dashed lines with circles reference Resnet50 results. The left plot (a) tests only for positional invariance and does not include contrast reversal. The latter is included in (b). We further separate the results according to the dimensionality of the exclusion distance - transformations involving either 1 (red), 2 (green), or 3 (blue) transformation dimensions shown in Figure 5.11. For instance, in the green series, the exclusion distance involved walking jointly along two dimensions, e.g. xy, xr, pw,.... Naturally, the more dimensions you include in the exclusion 90 Figure 5.11: Top-1 nearest neighbor matching error over ShapeY for both Resnet50 (dashed circle) and ShapeNet (bold triangle). The color code separates exclusion transformations involving either 1 (red), 2 (green), or 3 (blue) transformation dimensions. (a) Does not include contrast reversal. (b) Includes contrast reversal. transformations, the larger the effective exclusion distance, which explains larger relative errors for the blue curves in both ShapeRNet and Resnet50. Overall, our first attempt to generate shape features in ShapeRNet had a mixed level of effi- cacy: (1) Our ShapeRNet features were more invariant to contrast then resnet50 features. In fact, ShapeRNet features show complete invariance to contrast reversal, as can be seen by comparing the bold-triangle curves in plots (a) and (b) in figure 5.11, i.e. the curves are practically identical. This robustness was expected since the contour extractor used at layer 0 of ShapeRNet was not overly sensitive to polarity/contrast. On the other hand, Resnet50 displayed very poor contrast invariance, shown as a large jump from curves in (a) and (b). For example, in figure 5.11 (b), with an exclusion distance of 0 (can pair to the closest neighbor with flipped polarity) the Resnet50 embedding paired according to background similarity 60% of the time (error of 40%) instead of shape-similarity; (2) ShapeRNet features were less positionally invariant then Resnet50 features. In fact, we can see from panel (a) in Figure 5.11 that after an exclusion distance of 3, top-1 errors climbed steeply for our ShapeRNet features. If we analyze Figure 5.11 In conjunction with figure 5.10, it seems likely that the steep error climb is due to primarily to non-planar transformations. 91 In sum, our first attempt ShapeRNet features lacked the robustness to position changes that we ultimately want to achieve but was completely invariant to non-shape perturbations such as contrast reversal. We are currently working on optimizing ShapeRNet features to drive down the error rate to positional perturbations. To this end, we need to enhance our understanding of the underlying binding and pooling operations to maximize performance. 5.6 Conclusion In summary, due to the prominence of shape features in human OR we analyze to what extent computational Object Recognition Models depend themselves on shape for categorization. In order to disentangle shape from non-shape cues, we propose experiments and benchmarks (iLabShape and ShapeY) to evaluate and measure the degree and quality of the shape encoding of a model, including invariance to viewpoint representation and non-shape cues such as contrast reversal. Furthermore, we investigate how engineered shape features compare to or enhance traditional end- to-end learned deep features for shape-based categorization. Overall, we show that using a DNN on top of a high-dimensional contour structure containing shape-rich information leads to an enhanced performance when compared to employing conventional RGB input. Moreover, based on these findings we created a first version of ShapeRNet that built higher-order shape conjunctions from these basic contour features. We open an ample line of research into how to design higher-order shape features of a truly generalizable (non-dataset specific) nature. In future research, ShapeRNet can be greatly optimized to then act as a shape-preprocessor for state-of-the-art OR algorithms. 92 Chapter 6 General Conclusion The human brain being is the most powerful continual learner system. In fact, there is still much to be discovered about how the brain encodes and manages memory through a lifespan. This dissertation illustrates how neuroscience can inform artificial intelligence and guide modeling choices. Specifically, this work mostly focused on developing algorithms for artificial lifelong (continual) learning, taking inspiration from the known neurobiological underpinnings of human lifelong memory. In this dissertation, we provided novel theoretical and algorithmic advances to the artificial lifelong learning field. Yet, when compared with human capacity for lifelong learn- ing, AI has much to catch up to. There are many outstanding questions regarding how to further improve artificial lifelong learning. One of our persisting central questions is how to improve continual replay as a strategy for forgetting mitigation in DNNs. In the first chapter of this dissertation, we employed hybrid gen- erative and stored replay in our model CloGAN [17]. Interestingly, even with all recent advances in the AI CL literature, simple stored replay is still the go-to method for most task agnostic class incremental learning [18], the most difficult form of CL formulation. We contend that how the brain manages replay for long-term consolidation can offer valuable inspiration for more efficient algorithmic replay models. For instance, inspired by “dreaming”, can we use generative replay or modern contrastive-like techniques to augment stored representations of recent “novel” events – in other words, develop a more efficient buffer augmentation algorithm routine for replay and interleaved learning. Moreover, do we have to store the entire experience for replay? Or just the 93 most important gist of each experience? Perhaps an AI agent does not need to store original inputs but rather, some condensed or abstracted version of it. For reference, in biological hippocampal re- play, episodes are replayed in compressed time and can have only partial overlap with the original perceived experience. On another note, we question what past experiences (long-term memory) should be recalled when consolidating novel memories. For AI, this question can be reformulated as: what are the most useful long-term memories (stored or generated samples) to interleave with novel data that will increase convergence rate and minimize forgetting? Another central topic for further research is how to create and then modulate generalizable features for continual tasks. In traditional end-to-end DNN training, the features are learnt from scratch for a given task. However, with the progressive availability of extensive natural image datasets, i.e. Imagenet [114], and better training regimes, i.e. contrastive learning [115], pre- trained and often frozen feature extractors are becoming more widespread. In a very recent work [116] the authors present evidence for vast reuse capabilities of pretrained deep features without need for finetuning. For continual learning, an essential question is then how to extract the optimal features for solving each continual task – task selective reweighing [54]. Algorithmically, does this involve attentional modulation mechanisms acting on pretrained feature extraction deep networks? Furthermore, how can an improved shape bias for object recognition improve overall embedding generalizability? In sum, we hope to tackle these outstanding questions in future work and motivate others to do so. 94 References 1. Kolb, B. & Gibb, R. Brain plasticity and behaviour in the developing brain. Journal of the Canadian Academy of Child and Adolescent Psychiatry 20, 265 (2011). 2. Hubel, D. H. & Wiesel, T. N. Cortical and callosal connections concerned with the vertical meridian of visual fields in the cat. Journal of neurophysiology 30, 1561–1573 (1967). 3. Abbott, L. F. & Nelson, S. B. Synaptic plasticity: taming the beast. Nature neuroscience 3, 1178–1183 (2000). 4. Bienenstock, E. L., Cooper, L. N. & Munro, P. W. Theory for the development of neu- ron selectivity: orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience 2, 32–48 (1982). 5. Miller, K. D. & MacKay, D. J. The role of constraints in Hebbian learning. Neural compu- tation 6, 100–126 (1994). 6. Song, S., Miller, K. D. & Abbott, L. F. Competitive Hebbian learning through spike-timing- dependent synaptic plasticity. Nature neuroscience 3, 919–926 (2000). 7. Engert, F. & Bonhoeffer, T. Dendritic spine changes associated with hippocampal long-term synaptic plasticity. Nature 399, 66–70 (1999). 8. Matsuzaki, M., Honkura, N., Ellis-Davies, G. C. & Kasai, H. Structural basis of long-term potentiation in single dendritic spines. Nature 429, 761–766 (2004). 9. McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review 102, 419 (1995). 10. Rudy, J. W. & O’Reilly, R. C. Conjunctive representations, the hippocampus, and contextual fear conditioning. Cognitive, Affective, & Behavioral Neuroscience 1, 66–82 (2001). 11. Himmer, L., Sch¨ onauer, M., Heib, D. P. J., Schabus, M. & Gais, S. Rehearsal initiates sys- tems memory consolidation, sleep makes it last. Science advances 5, eaav1695 (2019). 12. Klinzing, J. G., Niethard, N. & Born, J. Mechanisms of systems memory consolidation during sleep. Nature neuroscience 22, 1598–1610 (2019). 13. Kumaran, D., Hassabis, D. & McClelland, J. L. What learning systems do intelligent agents need? Complementary learning systems theory updated. Trends in cognitive sciences 20, 512–534 (2016). 14. Kafkas, A. & Montaldi, D. How do memory systems detect and respond to novelty? Neuro- science letters 680, 60–68 (2018). 15. Lisman, J. E. & Grace, A. A. The hippocampal-VTA loop: controlling the entry of informa- tion into long-term memory. Neuron 46, 703–713 (2005). 16. Shohamy, D. & Adcock, R. A. Dopamine and adaptive memory. Trends in cognitive sciences 14, 464–472 (2010). 95 17. Rios, A. & Itti, L. Closed-loop memory GAN for continual learning. Proceedings of the 28th International Joint Conference of Artificial Intelligence (IJCAI), 2019. (2019). 18. Rios, A. & Itti, L. Lifelong Learning Without a Task Oracle in 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI) (2020), 255–263. 19. Flesch, T., Balaguer, J., Dekker, R., Nili, H. & Summerfield, C. Comparing continual task learning in minds and machines. Proceedings of the National Academy of Sciences 115, E10313–E10322 (2018). 20. French, R. M. Catastrophic forgetting in connectionist networks. Trends in cognitive sci- ences 3, 128–135 (1999). 21. McCloskey, M. & Cohen, N. J. in Psychology of learning and motivation 109–165 (Elsevier, 1989). 22. Robins, A. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7, 123–146 (1995). 23. Li, Z. & Hoiem, D. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40, 2935–2947 (2017). 24. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 3521–3526 (2017). 25. Zenke, F., Poole, B. & Ganguli, S. Continual learning through synaptic intelligence in In- ternational Conference on Machine Learning (2017), 3987–3995. 26. Fernando, C., Banarse, D., Blundell, C., Zwols, Y ., Ha, D., Rusu, A. A., et al. Pathnet: Evo- lution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017). 27. Shin, H., Lee, J. K., Kim, J. & Kim, J. Continual Learning with Deep Generative Replay. Advances in Neural Information Processing Systems (2017). 28. Rebuffi, S.-A., Kolesnikov, A., Sperl, G. & Lampert, C. H. icarl: Incremental classifier and representation learning in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (2017), 2001–2010. 29. Lopez-Paz, D. & Ranzato, M. Gradient episodic memory for continual learning. Advances in neural information processing systems 30 (2017). 30. Nguyen, C. V ., Li, Y ., Bui, T. D. & Turner, R. E. Variational continual learning. arXiv preprint arXiv:1710.10628 (2017). 31. Kemker, R. & Kanan, C. Fearnet: Brain-inspired model for incremental learning. arXiv preprint arXiv:1711.10563 (2017). 32. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C. & Wermter, S. Continual lifelong learning with neural networks: A review. Neural Networks 113, 54–71 (2019). 33. Wu, C., Herranz, L., Liu, X., van de Weijer, J., Raducanu, B., et al. Memory replay gans: Learning to generate new categories without forgetting. Advances in Neural Information Processing Systems 31 (2018). 34. Odena, A., Olah, C. & Shlens, J. Conditional image synthesis with auxiliary classifier gans in International conference on machine learning (2017), 2642–2651. 35. Kingma, D. P., Mohamed, S., Jimenez Rezende, D. & Welling, M. Semi-supervised learning with deep generative models. Advances in neural information processing systems 27 (2014). 36. Azadi, S., Olsson, C., Darrell, T., Goodfellow, I. & Odena, A. Discriminator rejection sam- pling. arXiv preprint arXiv:1810.06758 (2018). 96 37. MacKay, D. J., Mac Kay, D. J., et al. Information theory, inference and learning algorithms (Cambridge university press, 2003). 38. Shin, H., Lee, J. K., Kim, J. & Kim, J. Continual learning with deep generative replay. Advances in neural information processing systems 30 (2017). 39. Goldberg, D. E. & Deb, K. in Foundations of genetic algorithms 69–93 (Elsevier, 1991). 40. LeCun, Y ., Bottou, L., Bengio, Y . & Haffner, P. Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE 86, 2278–2324 (1998). 41. Xiao, H., Rasul, K. & V ollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017). 42. Netzer, Y ., Wang, T., Coates, A., Bissacco, A., Wu, B. & Ng, A. Y . Reading digits in natural images with unsupervised feature learning (2011). 43. Cohen, G., Afshar, S., Tapson, J. & Van Schaik, A. EMNIST: Extending MNIST to hand- written letters in 2017 international joint conference on neural networks (IJCNN) (2017), 2921–2926. 44. Farquhar, S. & Gal, Y . Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733 (2018). 45. Cheung, B., Terekhov, A., Chen, Y ., Agrawal, P. & Olshausen, B. Superposition of many models into one. Advances in neural information processing systems 32 (2019). 46. Wen, S., Rios, A., Ge, Y . & Itti, L. Beneficial Perturbation Network for designing general adaptive artificial intelligence systems. IEEE Transactions on Neural Networks and Learn- ing Systems (2021). 47. Zeng, G., Chen, Y ., Cui, B. & Yu, S. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence 1, 364–372 (2019). 48. Mallya, A., Davis, D. & Lazebnik, S. Piggyback: Adapting a single network to multiple tasks by learning to mask weights in Proceedings of the European Conference on Computer Vision (ECCV) (2018), 67–82. 49. Gepperth, A. & Gondal, S. A. Incremental learning with deep neural networks using a test- time oracle. in ESANN (2018). 50. V on Oswald, J., Henning, C., Sacramento, J. & Grewe, B. F. Continual learning with hyper- networks. arXiv preprint arXiv:1906.00695 (2019). 51. Aljundi, R., Chakravarty, P. & Tuytelaars, T. Expert gate: Lifelong learning with a network of experts in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (2017), 3366–3375. 52. Van de Ven, G. M. & Tolias, A. S. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734 (2019). 53. Chen, Z. & Liu, B. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 12, 1–207 (2018). 54. Petrov, A. A., Dosher, B. A. & Lu, Z.-L. The dynamics of perceptual learning: an incremen- tal reweighting model. Psychological review 112, 715 (2005). 55. Dhillon, G. S., Chaudhari, P., Ravichandran, A. & Soatto, S. A baseline for few-shot image classification. arXiv preprint arXiv:1909.02729 (2019). 56. Carpenter, G. A., Grossberg, S. & Rosen, D. B. Fuzzy ART: Fast stable learning and cate- gorization of analog patterns by an adaptive resonance system. Neural networks 4, 759–771 (1991). 97 57. Vakil-Baghmisheh, M.-T. & Paveˇ si´ c, N. A fast simplified fuzzy ARTMAP network. Neural processing letters 17, 273–316 (2003). 58. Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M. & Tuytelaars, T. Memory aware synapses: Learning what (not) to forget in Proceedings of the European Conference on Computer Vision (ECCV) (2018), 139–154. 59. Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks (2017). 60. Liang, S., Li, Y . & Srikant, R. Enhancing the reliability of out-of-distribution image detec- tion in neural networks (2018). 61. Gal, Y . & Ghahramani, Z. Dropout as a bayesian approximation: Representing model un- certainty in deep learning in international conference on machine learning (2016), 1050– 1059. 62. Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncer- tainty estimation using deep ensembles in Advances in neural information processing sys- tems (2017), 6402–6413. 63. Hendrycks, D., Mazeika, M. & Dietterich, T. Deep Anomaly Detection with Outlier Expo- sure in International Conference on Learning Representations (2019). 64. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders in Advances in neural information processing systems (2016), 4790–4798. 65. Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., et al. Likelihood ratios for out-of-distribution detection in Advances in Neural Information Processing Systems (2019), 14707–14718. 66. Kwon, G., Prabhushankar, M., Temel, D. & AlRegib, G. Novelty Detection Through Model- Based Characterization of Neural Networks in IEEE International Conference on Image Processing (2020), 3179–3183. 67. Ahuja, N. A., Ndiour, I. J., Kalyanpur, T. & Tickoo, O. Probabilistic modeling of deep features for out-of-distribution and adversarial detection in Bayesian Deep Learning work- shop, NeurIPS (2019). 68. Lee, K., Lee, K., Lee, H. & Shin, J. A simple unified framework for detecting out-of- distribution samples and adversarial attacks in Advances in Neural Information Processing Systems (2018), 7167–7177. 69. Yoon, J., Yang, E., Lee, J. & Hwang, S. J. Lifelong Learning with Dynamically Expandable Networks (2018). 70. Du, X., Charan, G., Liu, F. & Cao, Y . Single-net continual learning with progressive seg- mented training in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) (2019), 1629–1636. 71. Sun, J., Yang, L., Zhang, J., Liu, F., Halappanavar, M., Fan, D., et al. Gradient-based Novelty Detection Boosted by Self-supervised Binary Classification. arXiv preprint arXiv:2112.09815 (2021). 72. Aljundi, R., Reino, D. O., Chumerin, N. & Turner, R. E. Continual Novelty Detection. arXiv preprint arXiv:2106.12964 (2021). 73. Mundt, M., Hong, Y . W., Pliushch, I. & Ramesh, V . A wholistic view of continual learn- ing with deep neural networks: Forgotten lessons and the bridge to active and open world learning. arXiv preprint arXiv:2009.01797 (2020). 98 74. Ndiour, I., Ahuja, N. A. & Tickoo, O. Out-Of-Distribution Detection With Subspace Tech- niques And Probabilistic Modeling Of Features. arXiv preprint arXiv:2012.04250 (2020). 75. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images (2009). 76. Van Horn, G., Cole, E., Beery, S., Wilber, K., Belongie, S. & Mac Aodha, O. Bench- marking representation learning for natural world image collections in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), 12884–12893. 77. Nilsback, M.-E. & Zisserman, A. Automated flower classification over a large number of classes in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing (2008), 722–729. 78. Quattoni, A. & Torralba, A. Recognizing indoor scenes in 2009 IEEE Conference on Com- puter Vision and Pattern Recognition (2009), 413–420. 79. Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The caltech-ucsd birds-200- 2011 dataset (2011). 80. Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3d object representations for fine-grained cate- gorization in Proceedings of the IEEE International Conference on Computer Vision Work- shops (2013), 554–561. 81. Maji, S., Rahtu, E., Kannala, J., Blaschko, M. & Vedaldi, A. Fine-grained visual classifica- tion of aircraft. arXiv preprint arXiv:1306.5151 (2013). 82. Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J. & Zisserman, A. The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, 98–136 (2015). 83. De Campos, T. E., Babu, B. R., Varma, M., et al. Character recognition in natural images. VISAPP (2) 7 (2009). 84. Hsu, Y .-C., Shen, Y ., Jin, H. & Kira, Z. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data in Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (2020), 10951–10960. 85. Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T. & Wayne, G. Experience replay for con- tinual learning. Advances in Neural Information Processing Systems 32 (2019). 86. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition in Pro- ceedings of the IEEE conference on computer vision and pattern recognition (2016), 770– 778. 87. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P. & Joulin, A. Unsupervised learn- ing of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems 33, 9912–9924 (2020). 88. Gallardo, J., Hayes, T. L. & Kanan, C. Self-supervised training enhances online continual learning. arXiv preprint arXiv:2103.14010 (2021). 89. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). 90. V oulodimos, A., Doulamis, N., Doulamis, A. & Protopapadakis, E. Deep learning for com- puter vision: A brief review. Computational intelligence and neuroscience 2018 (2018). 91. Sinha, R. K., Pandey, R. & Pattnaik, R. Deep learning for computer vision tasks: a review. arXiv preprint arXiv:1804.03928 (2018). 99 92. Ma, X., Niu, Y ., Gu, L., Wang, Y ., Zhao, Y ., Bailey, J., et al. Understanding adversarial attacks on deep learning based medical image analysis systems. Pattern Recognition 110, 107332 (2021). 93. Biederman, I. Recognition-by-components: a theory of human image understanding. Psy- chological review 94, 115 (1987). 94. Biederman, I. & Ju, G. Surface versus edge-based determinants of visual recognition. Cog- nitive psychology 20, 38–64 (1988). 95. Kourtzi, Z. & Kanwisher, N. Representation of perceived object shape by the human lateral occipital complex. Science 293, 1506–1509 (2001). 96. Snodgrass, J. G. & Vanderwart, M. A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. Journal of experimental psychology: Human learning and memory 6, 174 (1980). 97. Brendel, W. & Bethge, M. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760 (2019). 98. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A. & Brendel, W. ImageNet- trained CNNs are biased towards texture; increasing shape bias improves accuracy and ro- bustness. arXiv preprint arXiv:1811.12231 (2018). 99. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252 (2015). 100. Goodale, M. A. & Milner, A. D. Separate visual pathways for perception and action. Trends in neurosciences 15, 20–25 (1992). 101. Haushofer, J., Livingstone, M. S. & Kanwisher, N. Multivariate patterns in object-selective cortex dissociate perceptual and physical shape similarity. PLoS biology 6, e187 (2008). 102. Riesenhuber, M. & Poggio, T. Models of object recognition. Nature neuroscience 3, 1199– 1204 (2000). 103. Kravitz, D. J., Saleem, K. S., Baker, C. I., Ungerleider, L. G. & Mishkin, M. The ventral visual pathway: an expanded neural framework for the processing of object quality. Trends in cognitive sciences 17, 26–49 (2013). 104. DiCarlo, J. J. & Cox, D. D. Untangling invariant object recognition. Trends in cognitive sciences 11, 333–341 (2007). 105. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M. & Boyes-Braem, P. Basic objects in natural categories. Cognitive psychology 8, 382–439 (1976). 106. Bell, J., Gheorghiu, E., Hess, R. F. & Kingdom, F. A. Global shape processing involves a hierarchy of integration stages. Vision Research 51, 1760–1766 (2011). 107. Ayzenberg, V . & Lourenco, S. F. Skeletal descriptions of shape provide unique perceptual information for object recognition. Scientific reports 9, 1–13 (2019). 108. Fukushima, K. Neocognitron: A hierarchical neural network capable of visual pattern recog- nition. Neural networks 1, 119–130 (1988). 109. Ballester, P. & Araujo, R. M. On the performance of GoogLeNet and AlexNet applied to sketches in Thirtieth AAAI conference on artificial intelligence (2016). 110. Gatys, L., Ecker, A. S. & Bethge, M. Texture synthesis using convolutional neural networks. Advances in neural information processing systems 28 (2015). 100 111. Borji, A., Izadi, S. & Itti, L. ilab-20m: A large-scale controlled object dataset to investigate deep learning in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), 2221–2230. 112. Nam, J. W., Rios, A. S. & Mel, B. W. ShapeY: Measuring Shape Recognition Capacity Using Nearest Neighbor Matching. arXiv preprint arXiv:2111.08174 (2021). 113. Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015). 114. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. Imagenet: A large-scale hierar- chical image database in 2009 IEEE conference on computer vision and pattern recognition (2009), 248–255. 115. Tian, Y ., Sun, C., Poole, B., Krishnan, D., Schmid, C. & Isola, P. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems 33, 6827–6839 (2020). 116. Evci, U., Dumoulin, V ., Larochelle, H. & Mozer, M. C. Head2Toe: Utilizing intermediate representations for better transfer learning. arXiv preprint arXiv:2201.03529 (2022). 101 Appendices A Supplementary Materials - Closed Loop Memory GAN for Continual Learning A.1 Per Task Performance for MNIST and FASHION In Figure 6.1, we show per task accuracies along time, starting from the moment when they are first learned. Here, CloGAN is shown to produce stable performance throughout consecutive tasks. We also show results for Memory Replay GANs [MeRGAN - Wu et al, 2018] and Elastic Weight Consolidation [EWC - Kirkpatrick et al, 2017]. Figure 6.1: Accuracies per task for MNIST (A) and FASHION (B). Our method, CloGAN is shown using a memory of size of 0.16% (100 images) for MNIST and FASHION. We compare to the performance of MeRGAN, EWC and FGD 102 For CloGAN with MNIST, all past tasks maintain high acuracies consistently throughout learn- ing of new classes. With FASHION, which is a notably harder dataset, not all tasks behave equally well. In particular, task 4 displays a somewhat aberrant behaviour, inducing a dip in accuracy for all tasks. Moreover, this phenomena reflects both in CloGAN as well as MeRGAN. In fact, MeRGAN seems to be unable to properly learn this task. Additionally, MeRGAN accuracies for FASHION for tasks 1 and 2 are clearly underperforming CloGAN at the end of training, likely due to image degradation from GAN to GAN transfer. Overall, CloGAN results are a significant improvement when compared to the baseline of FGD (Fine-Tuning by Gradient Descent) and even EWC. A.2 Stochastic Up-Sampling for MNIST and FASHION We show the effect of stochastic up-sampling in CloGAN for MNIST and FASHION in Figure 6.2. In MNIST, observable improvement only occurs for smaller buffer sizes. We conjecture that this may be due to relatively low intra-class variability in MNIST such that a small buffers already sample each class quite well. Additionally, the positive gap between our method and the two baselines increases as more tasks are added. For FASHION, task 4 has a somewhat aberrant behaviour and relating to Figure 2.3 in the main text as well as Figure 6.2 in the supplementary, we observed that this uncharacteristic behaviour is likely due this task being more challenging than all others. In more details, task 4 corresponds to shirts and tennis shoes (sneakers), with the former being starkly difficult to distinguish from task 2 long-sleeved pullovers, both visually similar classes. We speculate that in this particular case, multi-task overperforms since when all classes are trained simultaneously, a network can find an easier parameterization to disambiguate. In the continual setting, this would be harder since the network is already biased to perform well for task 2. 103 Figure 6.2: Stochastic Up-Sampling of CloGAN for MNIST (A) and FASHION (B). CloGAN- Frozen corresponds to training the AC-GAN continuously with a memory but with no stochastic generation. We also compare to the MT condition in which training is re-started at each task, eliminating forward-transfer of possible shared task features A.3 EWC Complementary Results and Training Details In order to compare incremental learning of disjoint incremental MNIST we replicated the Per- muted MNIST results of the authors of EWC. Previously we showed how EWC quickly derails for incremental MNIST. However, for permuted MNIST, we obtain similar results as those stated in the original paper in which catastrophic forgetting does not occur, see Figure 6.3. This discrepancy between the experiments is likely due to the difference in output mapping. Permuted MNIST has a fixed output mapping: for all tasks there are exactly K nodes corresponding to the K classes. On the other hand, in our scenario of incremental class learning, outputs of Softmax are always null for unseen tasks making the mapping increase as tasks accumulate. This can result in an acute weight rearrangement which may be more difficult to regularize. Table 6.1 displays training parameters used in EWC. The results shown correspond to training in a convolutional neural network classifier with identical architecture as the combined Discriminator- Classifier in AC-GAN but with one output node less, since a pure classifier does not evaluate Real/Fakse attribution. To compute the Fisher Matrix we allow for a sample size of 1000 images to be saved, but we also tested with values ranging from 200 to 1000 obtaining equivalent results. We also tested EWC using a simple multi-layer perceptron and obtained similar values. 104 Figure 6.3: EWC Accuracies Per Task with Permuted MNIST and Incremental MNIST. For the Permuted experiment, we reproduce the qualitative results of the original paper [Kirkpatrick et al, 2017] whereas our Incremental Single-Headed MNIST experiment causes EWC performance to quickly derail whenever a new task is learned. Table 6.1: EWC training parameters Convolutional Network Multi-Layer Perceptron Hyperparameters Values Hidden layers (Classifer) 5 Convolutional layers 2 Linear Hidden Layer Activation Leaky ReLu ReLu Dropout 0.5 0.2-0.5 Optimizer Adam lr: 0.0002, betas=(0.5, 0.999) SGD lr 0.001 Mini-Batch Size 50 50 Fisher Matrix Sample Size 200-1000 200-1000 A.4 Image Filtering - Rejection Sampling In our main model, we made use of the Class-Conditioned Filtering Method (CFM) described in the main text. Here we describe an additional filtering protocol. In discriminator-based rejection sampling one indirectly infers a distribution p d (x) of real data by using an alternate available ap- proximated distribution. In this case, the available proposal distribution is given by a generator p g (x), which ideally should approximate p d (x). Using the GAN framework given by Goodfellow et al, 2014 and making the assumption that the discriminator is defined by a sigmoid σ applied to 105 a logit output ˆ D(x) and trained via cross-entropy then we write that after full training, the discrim- inator output can ideally be written as (1,2). σ( ˆ D(x))= p d (x) p d (x)+ p g (x) (A.1) ˆ D(x)= p d (x) p g (x) (A.2) In this framework, rejection sampling implies calculating an estimate of M = Max x p d (x) p g (x) and re- jecting samples with proportionally to p d (x) Mp g (x) . For our application, the goal was to sample from the generator at each mini-batch and acquire quality samples to be used in closed-loop replay. There- fore, we drew samples from the generator on-the-fly but rejected a fraction using the discriminator to then attain a closer approximation of the true data distribution p d (x). A.4.1 Soft Rejection Filtering - SRF In practice we cannot acquire a perfect discriminator ˆ D(x), we cannot guarantee (1,2). In the naive implementation of rejection sampling, we pick some threshold T and reject ˆ D(x)< T . How- ever, this does not guarantee that we will recover p d (x). In this scenario, we typically reject too many samples and under-populate the tails of p d (x). In our implementation, we choose a dynamic threshold T as a percentileγ of ˆ D(x) at each mini-batch. A.4.2 Discriminator Rejection Sampling - DRS Azadi et al, 2018 propose an alternative to SRS which may better approximate p d (x). In this alter- nate formulation, since it is impractical to actually compute M since it presumes infinite sampling from p d (x), they approximate this quantity by computing the logit responses over a large sample of the real distribution instead, ˆ D M (x). Note that under the previous formulation, if ˆ D M (x) is to high, the acceptance probability will be close to zero and sampling will become impractical. In order to avoid this they compute the logit of a different function, F ,given by (3,4). 106 σ(F(x))= e ˆ D(x)− ˆ D M (x) (A.3) F(x)= ˆ D(x)− ˆ D M (x)− log(1− e ˆ D(x)− ˆ D M (x) ) (A.4) To further guarantee good sampling over p d (x), they add a parameter γ modulating overall acceptance probability (5). F(x)= ˆ D(x)− ˆ D M (x)− log(1− e ˆ D(x)− ˆ D M (x)− ε )− γ (A.5) In our implementation, we use a dynamic parameterγ given by a percentile of the distribution of F(x). The parameterε is added for numerical stability. A.4.3 Filtering Results Overall, we found that CFM, DRS and SRF all perform equivalently well. See table 6.2 with comparisons of CloGAN with memory parameter of 0.1% using MNIST. Hence, since class con- ditional filtering (CFM) has a much faster running time, we opted for carrying out only CFM in our final model. Table 6.2: % Correct As A Function Of Closed-Loop Filtering Method CloGAN CFM 90.70 DRS 89.97 SRF 89.25 DRS + CFM 89.97 SRF + CFM 88.76 107 A.5 Conditional Convolutional V AE A.5.1 Closed-Loop Replay Figure 6.4: A) Average Accuracies comparing CloGAN and V AE-Loop. B,C) Generated Images for CloGAN and V AE-Loop respectively. The denomination f rozen refers to training without closed-loop replay, only with the memory buffer. Note that while CloGAN exhibits a significant gap between the closed− loop and Frozen variants, the same does not occur for V AE. A plausible alternative to using a GAN would be to use a variational auto encoder (V AE) instead [Kingma et al, 2014]. In order to test if a V AE based model could exhibit similar behaviour as our CloGAN we compared a Closed-Loop V AE implementation where we substituted the AC- GAN component by a conditional convolutional V AE (CCV AE-Loop). Nonetheless, we observed that beside CCV AE-Loop underperforming CloGAN, it also does not offer any advantage over its Frozen variation (CCV AE-Frozen is trained without closed-loop image replay, only through a dynamic buffer), see Figure 6.4. For these reasons in our final model and main text we restricted the analysis to CloGAN. A.5.2 Implementation CCV AE-Loop was implemented using a Convolutional Encoder and Decoder architecture, see table 6.3. In order to adapt a convolutional architecture to conditional image generation, we add 108 1 layer of fully connected (fc) weights before the 2 fc parameterization layers of the encoder and, to the decoder, 2 fc layers before the convolutional section. Thus, to apply conditioning on labels, we feed a one-hot label representation as input to the first fully connected layer of the encoder as well as to the first fully connected layer of the decoder. This procedure approximates the conventional non-convolutional conditional V AE implementation [Sohn et al, 2015] and can generate good quality images as shown in Figure 6.4.C). Table 6.3: CCV AE Architecture Hyperparameters Values Convolutional layers (Encoder) 4 Convolutional layers (Decoder) 4 Fully Connected layers (Encoder) 3 Fully Connected layers (Decoder) 2 Optimizer Adam Mini-Batch Size 10-50 A.6 CloGAN Network and Data Parameters Table 6.4: AC-GAN Architecture Hyperparameters Values Hidden Layers (Generator) 4 Convolutional layers Hidden layers (Discriminator) 5 Convolutional layers Hidden Layer Activation Leaky ReLu Dropout p=0.5 (Discriminator) Optimizer Adam lr: 0.001 - 0.0002, betas=(0.5, 0.999) Mini-Batch Size 10-50 Table 6.5: Datasets Parameters MNIST FASHION SVHN EMNIST Classes 10 10 10 26 Objects Digits Clothes Real House Numbers Letters Training Data 60.000 60.000 49840 (Balanced) 124.800 Balanced Yes Yes Yes Yes 109 The detailed architecture of the AC-GAN used in CloGAN can be read in table 6.4. We exper- imented with various batchsizes and learning rates. Additionally, we list the details of the datasets used in table 6.5. B Supplementary Materials - Lifelong Learning Without a Task oracle B.1 Permanent Memory Usage Across Models For all task mappers, we defined memory usage as bytes required to store parameters and any appended data which will be used by the mapper at all times (permanent). This is different then transient RAM usage during training. As such, we list here the memory computation formulas per task mappers: Nearest Means Classifier (NMC), Gaussian Mixtures Model Classifier (GMMC), Unsupervised Fuzzy ART Classifier (ART), Supervised Fuzzy ART Classifier (ARTMAP), Per- ceptron with Coreset Replay (PCR), Perceptron with Replay of Task-Specific Embeddings (PCR- E), KM-heads [49] and AE-gates [51]. We do not include the Entropy task mapper [50] because it’s permanent memory usage originating from task-mapping itself is zero. Table 6.6: Permanent Memory Usage for Task Mappers Task Mapper Memory Formula NMC (ours) ∑ T task=1 N p (task)× M p with M p = X mapper GMMC (ours), ART (ours), ARTMAP (ours) ∑ T task=1 N p (task)× M p with M p = 2X mapper PCR (ours) (N i × X mapper )+(X mapper × H)+(H× T) PCR-E (ours) (N coreset × X main )+((∑ T task=1 E f (task))× H)+(H× T) KM-heads (ours and baseline) ∑ T task=1 (N p (task)× M p (task)) with M p (task)= E f (task) AE-gates (baseline) 2× X mapper × H× T Here, T stands for number of tasks, N p (task) for number of prototypes for each task and M p for size of prototype. For PCR and PCR-E, N coreset stands for number of images inside the coreset, H for size of hidden layer in perceptron or in a single-layer autoencoder (AE-gates) and E f (task) 110 Table 6.7: Permanent Memory Usage for Other Model Components Other Models Memory Usage PSP backbone T× PSP cost where PSP cost = 15,968 bytes BD backbone T× BD cost where BD cost = 4,808 bytes PSP+BD backbone T× (PSP cost + BD cost ) where(PSP cost + BD cost )= 20,776 bytes Vanilla-Replay, GEM N coreset × X main EWC 0 ∗ *Method does not require permanent storage, only transient (Fisher Matrices) to the size of the final, logit embedding per task. Additionally, X mapper refers to the size of the feature embedding used as input to the task mapper, whereas X main to the feature embedding size used for the main classifier (PSP-BD). In most cases, X mapper = X main since the feature embedding network is shared between main classifier and task-mapper. However, the exception is in the case of the 8-dsets experiments, which employ an Alexnet backbone. In this case, for task-mapping we further pool the final convolutional features over space, resulting in X mapper = 256 but keep X main = 256× 6× 6. We found that for task mappers the reduction in accuracy due to pooling over space was negligible, but the same did not hold when pooling was applied to the input used for fine-grained classification. All task mappers in table 6.6, with the exception of KM-heads(baseline), use a PSP-BD clas- sifier for fine-grained categorization. Thus, the full pipeline models sum a total of 20,776 bytes per task which come from PSP-DB parameters, according to table 6.7. For 8-dsets, this PSP-BD cost is equivalent to 166 kilo-bytes, for cifar-100, to 207 kilo-bytes and for Permuted-MNIST-25, to 399 kilo-bytes. Table 3.1 in the main paper shows memory usage taking into account PSP-BD parameters as well as task-mapper specific parameters. B.2 Choice of Feature Extraction Embedding We experimented with different network embeddings for the 8-dsets and Cifar-100 experiments (table 6.8). All embeddings are pretrained on Imagenet and then frozen before being tested on our experiment’s data. One of the considerations was to choose an embedding whose final X mapper size was small enough but would still yield good performance. For the Permuted-MNIST experiment 111 we did not use a feature extraction embedding because permuted images deviate too strongly from natural image statistics. Table 6.8: Embedding Network Experiment Alexnet Resnet-18 Resnet-34 8-dsets 58.0 54.3 55.4 Cifar-100 75.1 75.6 76.9 B.3 Coreset Replay Task Mapper Parameter Variations We experimented with the number of hidden layers in the perceptron classifier in PCR and PCR-E. In fact, we found that 1-layer perceptrons outperformed deeper perceptrons. We show layer vari- ations when applied to the PCR algorithm in table 6.9. Similar results were obtained for PCR-E. We hypothesize that this performance gap may be due to less overfitting on shallower perceptrons since here we use small memory coresets (memory size of 140 for 8-sets, 180 for cifar100 and 120 for Permuted-MNIST). Table 6.9: PCR Hidden Layers* Layers 8-dsets Permuted-MNIST-25 Cifar100 64 - 1 layer 90.2 98.8 57.3 64 - 2 layers 88.6 91.3 52.3 128 - 1 layers 90.1 98.9 58.6 128 - 2 layers 87.8 93.6 53.9 *Results shown for PCR with a coreset of fized size: 140 for 8-sets, 180 for cifar100 and 120 for Permuted-MNIST. We also experimented with other coreset building techniques such as homogeneous coreset sampling according to center means and ART prototypes, but they did not perform as well as a simpler homogeneous sampling across task labels (used in the main paper). We hypothesize that since the coresets used were very small, it becomes more important to guarantee homogeneous task sampling than feature-level prototype diversity. Results in table 6.10 show performance of PCR task mapping when applied to the 8-dsets. Here, since we are only analyzing task classification performance, we employ an embedding from Resnet-18. Note that for 8-dsets in the main paper, we 112 use Alexnet for embedding since it worked best for fine-grained classification, see supplementary section 2. Table 6.10: Coreset Building Techniques - Task Classification* Homogeneous Sampling Across Tasks 94.7 K-Means - Homogeneous Sampling Across Clusters Number of Clusters 5 20 40 80 160 Accuracy 94.7 94.0 94.5 94.7 94.1 Fuzzy ART - Homogeneous Sampling Across Prototypes Vigilance 0.9 0.895 0.89 0.88 0.87 Accuracy 90.1 89.5 90.6 89.6 89.4 *Results shown an embedding using Resnet-18 for 8-dsets. B.4 AE-gates Implementation We adapted the Aljundi et al [51] algorithm to our experimental benchmarks. We used undercom- plete single-layer autoencoders (AEs) preceded by standardization and a sigmoid nonlinearity as per the official implementation. The input to each AE comes from an Imagenet pretrained feature extractor backbone (except for permuted MNIST), the same used for all our task mappers. More specifcally, 8-dsets: Average pooled Alexnet final layer of size 256; Cifar-100: Resnet-34 penulti- mate layer of size 512. Each task-specific AE was trained for 30 epochs using ADAM and a slowly decaying learning rate starting at 0.01. For each task, the best model (on validation set) was saved to be used as a task gate during test time. Task Accuracy 0 0.2 0.4 0.6 0.8 1 AE Latent Dimension (Memory of AE-gates Mapper in KB) 10 (164KB) 20 (327KB) 50 (819KB) 100 (1638KB) 200 (3276KB) Task Accuracy 0.996 0.997 0.998 0.999 1 AE Latent Dimension (Memory of AE- gates Mapper in KB) 3 (470KB) 5 (784KB) 10 (1568KB) 50 (7840KB) Task Accuracy 0.3 0.4 0.5 0.6 0.7 0.8 AE Latent Dimension (Memory of AE-gates Mapper in KB) 10 (409KB) 50 (2048KB) 100 (4096KB) 8-dsets Permuted-MNIST Cifar-100 Figure 6.5: AE-gates task mapper performance: Task classification accuracy is plotted as a function of the autoencoder latent dimension and resulting mapper memory size in Kilobytes (KB). 113 In order to choose the appropriate latent dimension size (H), we computed results varying H for each benchmark, see figure 6.5. For 8-dsets and Cifar-100, we select the best H according to the memory-performance tradeoff as measured by: score= A− αM, similarly as to the procedure for our own proposed task mappers. Here A stands for task classification accuracy, M for memory storage usage andα is the weight of memory usage, which was set to 10 − 7 . With this heuristic, for 8-dsets we use H of 100 and for cifar-100, H is set equal to 50. For Permuted-MNIST we choose H equal to 5 since at this point the mapper achieves 100% accuracy. B.5 Training and optimization For all fine-grained classification training parameters were set according to table 6.6. Table 6.11: Full Pipeline Training Hyperparameters 8-dsets Permuted-MNIST-25 Cifar100 Feature Extractor Alexnet None Resnet-34 X mapper 256 784 512 X main 9,216 (256x6x6) 784 512 Epochs per Task 100 10 35 Optimizer SGD with lr: 0.0004 - 0.0001, momentum=0.5 Mini-Batch Size 64 for all C Supplementary Materials - Incremental Deep Feature Modeling for Continual Novelty Detection C.1 Intra-dataset Novelty Detection Results Here we provide a clarification about novelty scores reported in results. In the main paper Fig 4a, we show AUROC scores at each task from the input train data D t = OOD t ∪ ID t . incDFM uses the unlabeled train data to incrementally recruit novel samples and estimate novelty. We can also compute AUROC or AUPR scores using the test sets for each subset of the data corresponding to ID and OOD samples, only for sake of evaluation, D test t = OOD test t ∪ ID test t . These are the most 114 A) Cifar10 AUROC B) Cifar100 AUROC D) EMNIST C) iNaturalist Figure 6.6: Intra-dataset Novelty Detection - AUROC scores per task using test set. The test set is equivalent, in proportions (ratio old:new), to unlabeled train data used to fit to detected old samples. Also, in the case of incDFM, this is evaluated after all iterations are performed on the unlabeled train data. unbiased scores since the OOD detector will not have been exposed to these samples ever before in previous tasks as holdout ID data (ID t ). Also, for fairness, in the case of incDFM we compute the test set AUROC and AUPR scores after completing all iterations for novelty estimation with the train set. That is, we use the test set only for testing after incDFM has completed all of its training and do not use it to compute incDFM intermediate parameters T new i . We reported the average over all tasks for the AUROC and AUPR test scores in the main paper Fig 4b table. All other experiments (exception figure 4.4a in main paper) report results using the test sets. Here, in Figure C.1, we show the full per task result curves using test evaluation data and can observe that incDFM over-performs baselines through all tasks. The trend observed using test sets and train sets is similar. 115 C.2 Inter-dataset Novelty Detection Results 8 dataset sequence AUROC Figure 6.7: Inter-dataset Novelty Detection (8 dataset sequence) - AUROC scores per task using test set equivalent, in proportions (ratio old:new), to unlabeled train data. In the case of incDFM, this is evaluated after all iterations are performed on the unlabeled train data. In figure C.2 we show AUROC scores per task for the 8 dataset sequence using test evaluation data. For Odin and Softmax baselines, we report results for the task-independent implementation. C.3 Estimating the stopping point for incremental novelty recruitment in incDFM To estimate a stopping point to incremental recruitment, we use a validation set that contains only in-distribution (ID/old) samples and is updated by the algorithm at every task t, u val t = F({V val k },k< t) whereF is the feature extractor. In practice, at each task we reserve a small percentage of detected novel samples for validation and do not use them for fitting any parameters. The validation set is used to estimate, at each iteration, the total number of OOD samples left in the unlabeled pool, N new i,le ft by a principle of exclusion, i.e, we set high validation threshold P val on the high percentile range and estimate: N new i,le ft = Count(S i > Percentile(S val i ,P val )) (C.1) indices new i = argsort i (S i )[: R] if N new i,le ft > 0 (C.2) 116 P val 95 85 75 cifar10 94.6 94.0 93.4 cifar100 87.4 91.2 90.0 emnist 95.6 95.2 94.3 iNaturalist 89.4 89.7 88.8 Table 6.12: incDFM F1 Scores - averaged across all tasks in intra-dataset class incremental exper- iments - when varying P val Incremental recruitment cannot exceed N new i at each iteration. R is the recruitment percent per iteration. S i are the composite scores for the unlabeled train data and S val i are the composite scores for validation data. Note that both S i and S val i are computed equivalently using consol- idated {T k ,k < 1} and incDFM’s previous iterations new task parametersT t,i− 1 . For S val i this means: S val i,old = min k FRE(u val t ,T k ),k< t (C.3) S val i,new = FRE(u val t ,T t,i− 1 ) (C.4) S val i = S val i,old λS val i,new (C.5) (C.6) Thus, validation scores also are affected by incremental estimation of \ OOD t since its samples, which are all ID, will tend to have increasingly higher S new,val i− 1 values as the estimation of \ OOD t improves. We show a hyperparameter sweep over a few percentile values from the validation set in Table 1. Overall, setting a high percentile value (ex: 95th or 85th percentile) tended to yield best results across datasets even though the difference between F1 scores in the 95-75 percentile range was subtle. Good results with high percentile values aligns with the assumption that OOD and ID data tend to separate over the course of iterations: at high val-ID thresholds, you still obtain a high precision and recall value for novel (OOD) data. 117 C.4 Thresholding in Baselines - Hyperparameter Sweep P val 95 75 55 35 DFM 40.5 71.0 84.5 79.6 Mahal 42.8 67.7 74.8 73.8 Softmax 23.1 56.1 70.1 68.1 Odin 27.7 59.4 66.4 62.0 (a) P val 95 75 55 35 DFM 13.6 54.2 72.8 81.7 Mahal 24.7 49.3 60.4 65.9 Softmax 11.6 43.5 64.4 54.7 Odin 11.4 31.7 56.7 50.6 (b) P val 95 75 55 35 DFM 16.9 70.8 84.8 88.4 Mahal 27.0 57.8 66.8 70.0 Softmax 14.3 49.3 57.8 61.9 Odin 18.8 48.1 64.7 66.2 (c) P val 95 75 55 35 DFM 22.0 57.6 73.71 70.0 Mahal 20.4 47.8 66.3 59.4 Softmax 36.3 68.2 66.3 65.0 Odin 28.7 68.4 71.77 67.7 (d) Figure 6.8: Average F1 scores for baselines across tasks, when varying the validation threshold used during \ OOD t estimate - (a) Cifar10, (b) Cifar100, (c) emnist, (d) iNaturalist - The threshold is set as a percentile P val of the validation set, the latter containing only ID data. For all four baselines, we select \ OOD t for task t by applying a single threshold on the corre- sponding generated uncertainty scores (scores i ), as originally intended in the original implemen- tation of these baselines. In our case, the threshold is chosen based on a validation set containing only in-distribution samples. The threshold is chosen to be equivalent to a certain percentile value of the validation set, P val . As such, \ OOD t is estimated by: Scores i ,Scores val i = OODMethod(u t ,u val t ) (C.7) \ OOD t = indices new ={i|Scores i > Percentile(Scores val i ,P val )} (C.8) For fairness, we employ the same validation set u val t =F({V val k },k< t) used by incDFM (same as described in main paper), whereF is the feature extractor. For all baselines we perform a hyperparameter sweep over thresholds, results are shown in Figure 6.8. In the main paper we report best results for each baseline. We show that the single-threshold baseline novelty detectors, in general, tend to perform better with a low threshold. This is likely because ID t and OOD t scores are very enmeshed and a high threshold will result in very low recall value, insufficient for novelty characterization going forward. 118 Cifar10→ SVHN incDFM Mahal Softmax Frozen-Resnet-50-SWA V 99.9 93.1 88.2 Finetune-Resnet-50-SWA V 99.3 95.03 71.4 Frozen-Resnet-50 99.8 60.0 87.3 Finetune-Resnet-50 99.9 87.7 76.0 Table 6.13: AUROC scores for offline OOD estimation C.5 Feature Extraction Network We experimented with different feature extraction networks. Overall, incDFM and baselines on average performed best with a frozen Resnet50 backbone pretrained using contrastive learning [SWA V - [87]] in comparison to fine-tuning, for both continual and offline experiments. Table 2 compares amongst feature extraction approaches (plastic/finetune vs frozen) for offline OOD detection (see main paper section 5.1). The results on Table 2 are aligned with recent advances in the transfer/adaptive learning lit- erature suggesting that most features needed for natural-image datasets can be found in rich pre- trained-on-imagenet backbones [116]. Moreover, in a task-independent CL setting, using a coreset to estimate past data can lead to overfitting. Thus, freezing the backbone has become a common practice in the CL literature [18, 46]. For OOD methods relying on classification (Odin, Softmax), we use a plastic/trainable 1-hidden-layer MLP (hidden dimension of 4096 units) to learn the class mapping. Similarly for the end-to-end unsupervised class-incremental classification pipeline. C.6 End-to-end Unsupervised class-incremental classification Pipeline C.6.1 Memory Coreset We employ a similar memory coreset building scheme as in [17]. We keep a small memory coreset with embeddings and pseudolabelsC+1 corresponding to past tasks’ \ OOD k ,k< t (novelty detec- tions - see main paper section 3.2). At each task, a selection method is employed to choose which samples detected as novel will go into the coreset, with the aim to maximize sample heterogeneity since the coreset has a fixed size. We use K-means clustering per pseudolabel to select samples for storage and for removal. At each task, when \ OOD t is detected, we run k-means clustering, 119 super-labeling the embeddings of \ OOD t as one of K clusters. At the time of insertion into the memory coreset, we select equal numbers of samples from each cluster. Additionally, if the core- set is full we compute the space needed for new samples and remove an equivalent number of old embeddings. We do this by assessing their stored super-cluster labels and removing equal amounts of samples per novelty pseudolabel and per cluster, thereby preserving heterogeneity. By storing the per-novelty, cluster assignment superlabels we also avoid repeating the clustering operation. C.6.2 Experience replay At each task, our model is trained using the current tasks’s predicted \ OOD t and select memory embeddings of past tasks’ \ OOD k ,k< t present in the memory coreset. This forms an extended training set S t that is used to minimize the cross-entropy loss for classification (equations 9,10). Note that we use the novelty pseudolabels as targets for the classifier. S t = OOD t ∪λ mem OOD memory t− 1 (C.9) θ ∗ t = min θ t L(θ,S t ) (C.10) The memory component can be given a weighted importance, λ mem . We typically set λ mem to reflect the proportion of classes present in the coreset. Observation - OOD baselines that rely on classification: We employ the same procedure described in 5.1. and 5.2. to train Odin and Softmax novelty detectors in intra-dataset class incre- mental experiments. C.7 Inter-Dataset novelty detection using the 8 datasets In this experiment, we wanted to analyze the ability of incDFM and baselines to detect novelty continually in the setting where each novelty is an entire new dataset. This proposed CL scenario is closer to the traditional offline OOD/ID detection, which typically also consider an entire novel 120 dataset as OOD data. In the main paper we compare this inter-dataset experiment with our intra- dataset continual learning. For the inter-dataset experiment we consider a sequence of eight tasks each being one of 8 object recognition datasets as in [58]. Each of the 8 datasets in the order presented: 1. Oxford Flowers [77] for fine-grained flower classification with 102 classes; 2. MIT Scenes [78] for indoor scene classification with 67 classes; 3. Caltech-UCSD Birds [79] for fine- grained bird classification with 200 classes; 4. Stanford Cars [80] for fine-grained car classification with 196 classes; 5. FGVC-Aircraft [81] for fined-grained aircraft classification with 70 classes; 6. VOC actions [82], the human action classification subset of the VOC challenge 2012 with 10 classes; 7. Letters, the Chars74K datasets [83] for character recognition in natural images with 62 classes; and 8. the Google Street View House Number SVHN dataset [42] with 10 classes. The total 8 dataset sequence contains a total of 227,597 pictures in 717 classes. 121
Abstract (if available)
Abstract
In this thesis we present four research projects that draw inspiration from neurobiology to design artificial lifelong learning and machine vision algorithms. The first project, inspired by biological hippocampal replay, develops CloGAN, an algorithm for class incremental continual learning that employs hybrid generative and stored experience replay to mitigate forgetting. Our second project addresses hierarchical continual learning, where a lifelong agent benefits from having a separate algorithm for coarse level object classification and specialist partitions of a deep neural network capable of fine-grained class discrimination. In our third project, we develop a continual novelty detector and integrate it to end-to- unsupervised class incremental learning. incDFM, our novelty detector model, functions incrementally to gradually build confidence and improve its novelty predictions. Lastly, in our final project, we explore complementary topics to lifelong learning for machine vision such as shape representation for artificial object recognition. Overall, the work developed in this thesis provides novel contributions to the field of artificial lifelong learning which is in turn crucial to enabling the next generation of real-world artificial intelligence agents.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Biologically inspired approaches to computer vision
PDF
Interaction between Artificial Intelligence Systems and Primate Brains
PDF
Towards learning generalization
PDF
Algorithms and systems for continual robot learning
PDF
Pretraining transferable encoders for visual navigation using unlabeled datasets
PDF
RANSAC-based semi-supervised learning algorithms for partially labeled data, with application to neurological screening from eye-tracking data
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Learning controllable data generation for scalable model training
PDF
Machine learning in interacting multi-agent systems
PDF
Transfer learning for intelligent systems in the wild
PDF
Functional magnetic resonance imaging characterization of peripheral form vision
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Verification, learning and control in cyber-physical systems
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Robust and adaptive online decision making
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
No-regret learning and last-iterate convergence in games
PDF
Learning shared subspaces across multiple views and modalities
Asset Metadata
Creator
Rios, Amanda Sofie (author)
Core Title
Neuroscience inspired algorithms for lifelong learning and machine vision
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Degree Conferral Date
2022-08
Publication Date
07/24/2022
Defense Date
05/10/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,computer vision,continual learning,lifelong learning,machine learning,multi-task learning,Neuroscience,novelty detection,OAI-PMH Harvest,out-of-distribution detection
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Hirsch, Judith (
committee chair
), Itti, Laurent (
committee member
), Mel, Bartlett (
committee member
)
Creator Email
amandasofierios24@gmail.com,sofie.rios@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375398
Unique identifier
UC111375398
Legacy Identifier
etd-RiosAmanda-10948
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Rios, Amanda Sofie
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
computer vision
continual learning
lifelong learning
machine learning
multi-task learning
novelty detection
out-of-distribution detection