Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning controllable data generation for scalable model training
(USC Thesis Other)
Learning controllable data generation for scalable model training
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING CONTROLLABLE DATA GENERATION FOR SCALABLE MODEL TRAINING by Yunhao Ge A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2023 Copyright 2024 Yunhao Ge Dedication To my parents, family, advisors, friends, and everyone who supported me throughout my journey. ii Acknowledgements I would like to express my profound gratitude to my advisor, Prof. Laurent Itti, for his always support, guidance, and help. His mentorship allowed me the freedom to explore various research directions while providing invaluable help during challenging times. Our collaborative journey through interesting and challenging research avenues has been immensely rewarding. More than just a mentor in research, Prof. Itti’s exemplary behavior has taught me invaluable life lessons about being a good mentor, father, husband, and kind-hearted individual. I am also deeply thankful to Prof. Jiajun, my advisor at Stanford during my visiting period. His guidance was crucial in broadening my horizons and teaching me the importance of focus in conducting high-quality, crisp research. Special thanks go to my collaborators and advisors during my internships: Dr. Vibhav Vineet and Dr. Neel Joshi at Microsoft; Dr. Jie Ren, Dr. Jiaping Zhao, Dr. Balaji Lakshminarayanan, and Prof. Ming-Hsuan Yang at Google Research; Dr. Yin Cui, Dr. Ming-Yu Liu, and Dr. Tsung-Yi Lin at Nvidia Research; Dr. Jinsung Yoon and Dr. Sercan O. Arik at Google Cloud AI; and Dr. Ziyan Wu at UII. I am grateful for the exciting research opportunities and experiences they provided. I extend my heartfelt appreciation to all my collaborators, with whom I have had the privilege of publishing papers, especially Jiashu Xu, Brian Nlong Zhao, Yao Xiao, Zhi Xu, Xingrui Wang, Hong-Xing "Koven" Yu, Chengshu Li, Cem Gokmen, Ruohan Zhang, Sami Abu-El-Haija, Yuecheng Li, Ao Xu, Shuo iii Ni, Di Wu, Yuhang Xiao, Yunkui Pang, Gan Xin, and others. I would also like to acknowledge the numerous M.Sc. and undergraduate students at USC who have contributed in various capacities to our work. My sincere thanks to my committee members Prof. Greg Ver Steeg, Prof. Yan Liu, Prof. Nicolas Schweighofer, and Prof. Ram Nevatia for their invaluable feedback and guidance throughout my dissertation process. I am grateful to Amazon for the support of the Amazon ML Fellowship at the USC-Amazon Center on Secure and Trusted Machine Learning. Special thanks to Prof. Salman Avestimehr and Vincent Ponzo for their assistance. This fellowship significantly contributed to my freedom in research exploration. I extend my gratitude to all the alumni, staff, and members at the iLab at the University of Southern California for their support during my academic journey. Lastly, I owe a deep sense of gratitude to my parents for their unwavering love and support. I am also immensely thankful to my girlfriend, Jialin Dong, for her constant support and encouragement. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxxv Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Large AI models need more and better quality data . . . . . . . . . . . . . . . . . . . . . . 1 1.2 AI generated data to train models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Part 1: Diverse Methods of Controllable Data Generation . . . . . . . . . . . . . . . 5 1.3.2 Part 2: On-demanded Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.3 Part 3: Enhancing Model Performance and Generalization through Model Explainability 8 1.3.4 Part 4: Model Parameter as Special “Data” for Lifelong Learning . . . . . . . . . . 10 Chapter 2: Pose Augmentation: Class-agnostic Object Pose Transformation for Object Recognition . 12 2.1 Introduction and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Object Pose Transforming Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Eliminate-add structure of the generator . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Pose-eliminate module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Continuous pose transforming training . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 Network Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Object Pose Transformation Experiments . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Object Recognition Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Class-agnostic Object Transformation Experiment . . . . . . . . . . . . . . . . . . 26 2.4.4 Object Pose Significance on Different Object Recognition Tasks . . . . . . . . . . . 28 2.4.5 Generalization to Imagenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 3: Zero-shot Synthesis with Group-Supervised Learning . . . . . . . . . . . . . . . . . . . 31 v 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Group-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Datasets admissible by GSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2 Auxiliary tasks via Multigraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Group-Supervised Zero-Shot Synthesis Network (GZS-Net) . . . . . . . . . . . . . . . . . 37 3.4.1 Auto-Encoding along relations in M . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Disentanglement by swap Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.3 Training and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Qualitative Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.1 Fonts Dataset & Zero-shot synthesis Performance . . . . . . . . . . . . . . . . . . . 41 3.5.2 Zero-shot synthesis on ilab-20M and RaFD . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Quantitative Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6.1 Quantifying Disentanglement through attribute co-prediction . . . . . . . . . . . . . 43 3.6.2 Distance of synthesized image to ground truth . . . . . . . . . . . . . . . . . . . . . 45 3.6.3 GZS-Net Boost Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3.1 Zero-shot Foreground Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Language-driven Context Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.2.1 Zero-shot scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.2.2 Few-shot scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.3 CLIP Sanity Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.4 Cut-paste Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.2 More Baselines and Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.3 Synthetic data distribution complements the real data distribution. . . . . . . . . . . 70 4.4.4 Generalization to more tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.5 Compositionality in Synthetic Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Chapter 5: EM-Paste: EM-guided Cut-Paste for Image-level Weakly Supervised Instance Segmentation 76 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 EM-guided Foreground Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.2 Background (Context) Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.3 Compositional Paste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4.2 Weakly-supervised Instance Segmentation . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.3 Weakly-supervised Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 91 vi 5.4.4 Weakly-supervised Instance Segmentation on Long-tail Dataset . . . . . . . . . . . 92 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Chapter 6: 3D Copy-Paste: Physically Plausible Object Insertion for Monocular 3D Detection . . . . 95 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.1 Monocular 3D Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.2 3D Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.3 Illumination Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3.1 Where and how: Physically Plausible Position, Pose, and Size Estimation . . . . . . 101 6.3.1.1 Ground Plane Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3.1.2 Constrained Insertion Parameter Search . . . . . . . . . . . . . . . . . . . 103 6.3.2 What Illumination is on the object . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.2.1 Spatial-varying Illumination Estimation and Retrieval . . . . . . . . . . . 105 6.3.2.2 Environment Map Refinement . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.3 Dataset Augmentation with Insertion and Downstream Model Training . . . . . . . 106 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4.1 Dataset and Model Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4.2 Physically-plausible position, pose, size, and illumination leads to better monocular detection performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4.3 Ablation study on the influence of insertion illumination and position on monocular 3D object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 7: DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models . . 114 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.1 Text-to-image Diffusion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.2 Personalized text-to-image Generation . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2.3 Prompt Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.1 Text-to-Image Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3.2 Prompt Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.3 Learning Prompt Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.4.1 Diverse Personalized Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.4.2 Generation Quality and Diversity Evaluation . . . . . . . . . . . . . . . . . . . . . 126 7.4.3 Controllability of Prompt Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.4.4 Applying to Text-to-3D Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.4.5 Applying to Synthetic Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . 134 7.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Chapter 8: Neural-Sim: Learning to Generate Training Data with NeRF . . . . . . . . . . . . . . . 136 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 vii 8.3 Neural-Sim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3.1 Backprop through data generation from NeRF . . . . . . . . . . . . . . . . . . . . . 143 8.3.1.1 Tool 1: Reparametrization of pose sampling . . . . . . . . . . . . . . . . 144 8.3.1.2 Tool 2: Twice-forward-once-backward . . . . . . . . . . . . . . . . . . . 146 8.3.1.3 Tool 3: Patch-wise gradient computation . . . . . . . . . . . . . . . . . . 147 8.3.2 Nerf-in-the-wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.4.1 NeRF to generate data for downstream tasks . . . . . . . . . . . . . . . . . . . . . . 149 8.4.2 YCB-synthetic dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.4.3 YCB-in-the-wild dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.4.4 YCB Video dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.4.5 Interpretability of Neural-Sim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.5 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Chapter 9: BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation . . . . . . . 158 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2.1 Real Indoor Scene RGB-D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2.2 3D Reconstruction Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2.3 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.2.4 3D Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 9.3 BEHAVIOR Vision Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.3.1 Extended BEHAVIOR-1K Assets . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 9.3.2 Customizable Dataset Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.3.2.1 Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.3.2.2 Dataset Generation Process . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.4 Applications and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.4.1 Parametric Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.4.2 Holistic Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.4.3 Object States and Relations Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 174 9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Chapter 10: A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts178 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 10.3 Visual Reasoning Explanation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.3.1 Visual Concept Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.3.2 Graph Reasoning Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 10.3.2.1 Representing Images as SCGs . . . . . . . . . . . . . . . . . . . . . . . . 185 10.3.2.2 Imitate the Reasoning Process of NN . . . . . . . . . . . . . . . . . . . . 186 10.3.3 Visual Decision Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 10.4 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 10.4.1 Visual Reasoning Explanation Experiment . . . . . . . . . . . . . . . . . . . . . . . 190 10.4.2 Logic Consistency between VRX and NN . . . . . . . . . . . . . . . . . . . . . . . 191 10.4.3 Interpretation Sensitive of Visual and Structure . . . . . . . . . . . . . . . . . . . . 192 10.4.4 Model Diagnosis with VRX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 viii Chapter 11: How to interpret, teach, and interact with neural networks . . . . . . . . . . . . . . . . . 199 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 11.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 11.2.1 Neural Network shows reasoning logic to human . . . . . . . . . . . . . . . . . . . 204 11.2.2 Human improves network’s performance with HNI . . . . . . . . . . . . . . . . . . 205 11.2.2.1 Experimental results on six image classification tasks . . . . . . . . . . . 207 11.2.2.2 Experimental results on ImageNet classification tasks . . . . . . . . . . . 208 11.2.3 Zero-shot learning: Human teach network to learn new object through HNI . . . . . 209 11.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11.4 Methods: Human-network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 11.4.1 Network-to-Human . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 11.4.2 Human-to-network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 11.4.2.1 Human modifies c-SCG . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 11.4.2.2 Training Graph Reasoning Network (GRN) with Human’s Logic . . . . . 217 11.4.2.3 Transfer Reasoning Logic to Network with Partial Knowledge Distillation 218 Chapter 12: Contributions of Shape, Texture, and Color in Visual Recognition . . . . . . . . . . . . . 222 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 12.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 12.3 Humanoid Vision Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 12.3.1 Humanoid Image Preprocessing and Feature Extraction . . . . . . . . . . . . . . . . 226 12.3.2 Humanoid Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 12.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 12.4.1 Effectiveness of Feature Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 12.4.2 Effectiveness of Humanoid Neural Network . . . . . . . . . . . . . . . . . . . . . . 231 12.4.3 Human Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 12.4.4 Contributions Attribution in Different Tasks . . . . . . . . . . . . . . . . . . . . . . 233 12.5 More Humanoid Applications with HVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 12.5.1 Open-world Zero-shot Learning with HVE . . . . . . . . . . . . . . . . . . . . . . 234 12.5.2 Cross Feature Imagination with HVE . . . . . . . . . . . . . . . . . . . . . . . . . 236 12.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Chapter 13: Improving Zero-shot Generalization and Robustness of Multi-modal Models . . . . . . . 240 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 13.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 13.3 Zero-shot inference failure case analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 13.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 13.4.1 Self-consistent zero-shot confidence estimation . . . . . . . . . . . . . . . . . . . . 246 13.4.2 Top-down and bottom-up label augmentation using WordNet hierarchy . . . . . . . 248 13.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 13.5.1 Our proposed confidence score is better suited for selective prediction than baselines 250 13.5.2 Using hierarchy to help improve zero-shot accuracy on low confidence subset . . . . 252 13.5.3 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Chapter 14: Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 ix 14.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 14.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 14.2.2 Our methods: OOD scores utilizing in-domain and OOD label sets . . . . . . . . . . 261 14.2.3 Extension to customized in- and out-of-distribution label sets . . . . . . . . . . . . . 263 14.2.4 One-class OOD detection in mixed in- and out-of-distribution multi-object images . 263 14.2.5 The connection between S-max_in_prob and Smax_logit_diff . . . . . . . . . . . . 264 14.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 14.3.1 Our scores outperform the baselines on one-class OOD detection tasks . . . . . . . . 266 14.3.2 Customized in-domain and OOD label sets help to improve performance . . . . . . 268 14.3.3 OOD detection in mixed in-domain and OOD multi-object images . . . . . . . . . . 269 14.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 14.5 Conclusion and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Chapter 15: Invariant Structure Learning for Better Generalization and Causal Explainability . . . . . 272 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 15.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 15.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 15.3.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 15.3.2 Learning framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 15.3.3 Generalizing to self-supervised setting . . . . . . . . . . . . . . . . . . . . . . . . . 281 15.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 15.4.1 Supervised learning tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 15.4.1.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 15.4.1.2 Real-world data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 15.4.2 Self-supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 15.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 15.6 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Chapter 16: Lightweight Learner for Shared Knowledge Lifelong Learning . . . . . . . . . . . . . . 290 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 16.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 16.2.1 Lifelong Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 16.2.2 Multi-task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 16.2.3 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 16.2.4 Other methods that may help solve SKILL . . . . . . . . . . . . . . . . . . . . . . 296 16.3 Shared knowledge in lifelong learning (SKILL) . . . . . . . . . . . . . . . . . . . . . . . . 296 16.4 SKILL-102 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 16.5 Lightweight Lifelong Learner for SKILL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 16.6 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 16.7 Shared Knowledge Accumulation, Reuse and Boost . . . . . . . . . . . . . . . . . . . . . . 312 16.7.1 Corrective approach to task overlap/synergy . . . . . . . . . . . . . . . . . . . . . . 312 16.7.2 Learning approach to task overlap/synergy . . . . . . . . . . . . . . . . . . . . . . 314 16.7.3 Further boost with Head2Toe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 16.8 Discussion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 16.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Chapter 17: CLR: Channel-wise Lightweight Reprogramming for Continual Learning . . . . . . . . . 321 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 x 17.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 17.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 17.3.1 Channel-wise Lightweight Reprogramming . . . . . . . . . . . . . . . . . . . . . . 327 17.3.2 CLR for Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 17.4 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 17.4.1 Dataset and Baselines: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 17.4.2 Accuracy on the first tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 17.4.3 Average accuracy after learning all 53 tasks . . . . . . . . . . . . . . . . . . . . . . 333 17.4.4 Parameter and computation cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 17.4.5 Influence of different immutable backbone . . . . . . . . . . . . . . . . . . . . . . 335 17.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 17.6 Details of our 53-dataset for continual learning and performance . . . . . . . . . . . . . . . 336 17.7 Channel-wise linear reprogramming ability . . . . . . . . . . . . . . . . . . . . . . . . . . 337 17.8 Bootstrapping results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 17.9 More experiments to explore the trade-off between parameter and performance . . . . . . . 341 17.10Transfer learning with CLR-based model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 17.11CIFAR-100 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Chapter 18: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 xi List of Tables 2.1 Average Mean squared error (MSE; lower is better) and peak-signal-to-noise ratio (PSNR; higher is better) for different methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Poses used in the pose-unbalanced (P-UB) training dataset to train OPT-Net . . . . . . . . . 25 2.3 Different training and testing datasets for object recognition . . . . . . . . . . . . . . . . . . 26 2.4 Testing object recognition accuracy (%) of each class after trained on different training dataset. Comparing S-P-B and SA-P-B with P-UB shows how much classification improves thanks to adding synthesized images for missing poses in the training set, reaching or surpassing the level of when all real poses are available (P-B). Our synthesized poses yield better learning than traditional data augmentation (A-P-UB) . . . . . . . . . . . . . . . . . 26 2.5 Overall object recognition accuracy for different training dataset in RGB-D . . . . . . . . . 27 2.6 Object recognition overall accuracy for different datasets . . . . . . . . . . . . . . . . . . . 28 3.1 Disentangled representation analysis. Diagonals are bolded. . . . . . . . . . . . . . . . . . 45 3.2 Average metrics between ground-truth test image and image synthesized by models, conducted over the Fonts dataset. We report MSE (smaller is better) and PSNR (larger is better). 46 4.1 Desired quality of context generation method: images should be high quality and diverse, less human involvement, generalization of the images for any new environment, scalable, explainable, privacy-preserving, and compositional. . . . . . . . . . . . . . . . . . . . . . 51 4.2 Six manually designed templates for generating foreground images zero-shot. Here will be replaced by label names such as “bus”. The design philosophy is to put objects in a clean background for easy foreground extraction. . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Sixteen handcraft templates for generating coherent background images zero-shot. The full template is “A real photo of <context>” where <context> is substituted with one of the above 16 places. The design philosophy is to create natural images but without any interested objects (thus “empty”) since we would not have segmentation labels for those objects if they are generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 xii 4.4 We harness T2I (Stable Diffusion) to generate a large-scaled high-quality synthetic foregrounds and backgrounds, and improve VOC object detection. Column mAP is computed as the average of IoU ranging from 50 to 95 with step size 5. . . . . . . . . . . . . . . . . . 60 4.5 Stable Diffusion generated foregrounds and contextual backgrounds enhance object detection on COCO dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.6 Detail statsitics of synthetic datasets created for VOC. . . . . . . . . . . . . . . . . . . . . . 64 4.7 Our approach is robust to foreground extraction methods. . . . . . . . . . . . . . . . . . . . 68 4.8 Instance Segmentation for VOC (left) and COCO (above). Our methods generalize to other tasks and are competitive even in 1 shot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.9 We highlight that our synthesized data together with 70 % amount of real data achieves better performance than full (100 %) set of real data only. This highlights the benefit of our approach in reducing total human efforts. Syn (ours) means ru-DALLE synthesized 1500 diverse images (use UW as CDI). Top row terms are: CC: Coca Cola, CM: Coffee mate, HB: honey bunches, HS: hunt’s sauce, MR: mahatma rice, NV1: nature V1, NV2: nature V2, PO: palmolive orange, PS: pop secret, Pbbq: pringles bbg, RB: red bull. . . . . . . . . . . . . . . 73 4.10 Contextual synthetic backgrounds produced by our approach significantly enhance object instance detection accuracy across three datasets. . . . . . . . . . . . . . . . . . . . . . . . 73 4.11 Even if the user provides out-of-distribution CDI, our approach is able to produce a synthetic dataset tailored towards actual test distribution by in-domain intervention. . . . . . . . . . . 74 5.1 Metrics for instance segmentation models on Pascal VOC 2012 val set. Here F means fully supervised, B and I mean bounding box and image level label based weakly supervised methods respectively. We highlight the best mAP with image level label in green , and bounding box label in blue . Our method outperforms prior SOTA image level methods. Further our method achieves better performance than some of the prior bounding box SOTA, although bounding box method has access to a lot more information about object instances. . 89 5.2 Weakly supervised instance segmentation on COCO val2017. Models here use image-level label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 Ablation study on PASCAL VOC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Object detection on Pascal VOC 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.1 Statistics of external 3D objects from Objaverse [107]. . . . . . . . . . . . . . . . . . . . . 107 xiii 6.2 ImVoxelNet 3D monocular object detection performance on the SUN RGB-D dataset with different object insertion methods. When inserting randomly, the accuracy of the downstream object detector drops, i.e., the detector suffers from random insertions (which may have collisions, occlusions, incorrect lighting, etc.). In contrast, by only applying physically plausible position, size, and pose, performance significantly improved (41.80%). Further, when plausible lighting and shadows are added, our 3D copy-paste improves the accuracy of the downstream detector to a new state-of-the-art accuracy (43.79%). We use mAP (%) with 0.25 IOU threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Per class average precision (AP) of ImVoxelNet 3D monocular object detection performance on SUN RGB-D dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 ImVoxelNet 3D monocular object detection performance on the ScanNet dataset with different object insertion methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5 ImVoxelNet 3D monocular object detection performance on SUN RGB-D dataset with different illumination during insertion rendering. All experiments use the same ImVoxelNet model, insertion also uses our proposed physically plausible position, size, and pose. . . . . 110 6.6 Ablation study of global context influence on ImVoxelNet monocular 3D object detection performance on SUN RGB-D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.1 Our method achieves the best quality and diversity automatic metrics across 12 scenarios. Mean metrics are reported with standard deviations shown in subscript. . . . . . . . . . . . 126 7.2 Classification accuracy on different real test sets after training a classifier on synthetic ImageNet (IN) generated by a given method. When training on images from our method, the resulting classifier performs better on the respective test sets, indicating that the images synthesized by our method allowed the classifier to learn those object categories better. . . . 132 8.1 Large scale YCB-synthetic experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.2 YCB-Video performance. Observe large improvement of the proposed Neural-Sim approaches before and after optimization over the baselines. . . . . . . . . . . . . . . . . . 155 9.1 Comparison of real and different types of synthetic datasets to BEHAVIOR Vision Suite. Camera View indicates whether images can be rendered from arbitrary viewing angles. Obj Pose indicates whether object layout can be modified. Obj State indicates whether object physical states (e.g. open/close, folded) and semantic states (cooked, soaked, etc) can be modified. CV toolkit indicates whether utility functions are provided to sample camera poses that satisfy certain constraints (those that capture half-open kitchen cabinets filled with grocery items, for instance). Visual Quality indicates how photorealistic the images are. . . 162 9.2 We generate up to 200-500 short video clips with diverse scene configurations for parametric evaluation (Section 9.4.1). Each video clip varies along one continuous axis with respect to a single target object. On average, each video has 300 frames. . . . . . . . . . . . . . . . . . 170 xiv 9.3 A comprehensive evaluation of SOTA models on four vision tasks. Our synthetic dataset can be a faithful proxy for real datasets as the relative performance between different models closely correlates to that of the real datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.4 Classification results on the real test set. Task-specific training on syntactic data boosts performance on real images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 9.5 Classification results on held-out synthetic eval set and real test set for our method adapted from [382]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 10.1 VRX model helps correction. Out of 119 images initially misclassified by Xception, only 5 remain misclassified after VRX-guided image editing. Over 30% of the samples have missing concepts and over 95% of them have been correctly explained. In contrast, 117 and 115 images remain misclassified after substituting bad concepts with random image patches, or substituting good concepts with other good concepts from other images from the same class.192 10.2 Testing set accuracy comparison for VRX boost original model performance. All numbers are in %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 11.1 Human improves a network’s performance with HNI: experiments on six different image classification tasks (performance is tabulated as percent correct classification). . 206 12.1 “Original” column means the accuracy of Resnet18 on the original images as our upper bound. Shape, texture and color columns represent the accuracy of feature nets. “all” means results of our HNN that combines the 3 feature nets. It approaches the upper bound, suggesting that the split into 3 feature nets preserved most information needed for image classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 12.2 Contributions of features from HVE and humans’ recognition accuracy. . . . . . . . . . . . 232 12.3 Class-specific bias for each class in iLab-20M . . . . . . . . . . . . . . . . . . . . . . . . . 234 12.4 Open-world zero-shot accuracy and FID of cross-features imagination. . . . . . . . . . . . . 236 13.1 CLIP (ViT-B/16) and LiT (ViT-B/32) zero-shot top-1 accuracy comparison between baseline and ours (w/ hierarchy). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 13.2 Generalizability to non-ImageNet datasets (CLIP (ViT-B/16) zero-shot top-1 accuracy). . . 253 13.3 Generalizability to different backbones with CLIP. . . . . . . . . . . . . . . . . . . . . . . 254 13.4 CLIP (ViT-B-16) zero-shot top-1 accuracy comparison with prompt ensemble. . . . . . . . . 254 13.5 Effect of threshold of confidence score on zero-shot accuracy. . . . . . . . . . . . . . . . . 255 14.1 Comparison between the proposed scores and the baseline methods. . . . . . . . . . . . . . 263 xv 14.2 One-class OOD detection across datasets for various in-domain cases evaluated using AUC↑. Our scores consistently outperform the baselines for detecting samples from unseen classes and under distribution shift. Note that ImageNet-A does not have person images (N/A in Table below). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 14.3 One-class OOD detection for dog sub-type terrier using different Cin and Cout label sets. More fine-grained label sets help to improve the performance for both scores. . . . . . . . . 268 14.4 Identifying in-domain and OOD mixed multi-object images using mixture score g(x) defined based on different OOD scores. None of the single scores can identify mixed images. New scores based on bounding box detection improve the performance, and our scores outperform the baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 15.1 Synthetic tabular data experiments in supervised learning setting. Note that black-box MLP and CASTLE can’t provide DAGs. ISL yields lower MSE for ID and OOD, and lower SHD. 285 15.2 Synthetic tabular data counterfactual simulation experiments. MSE is shown for various counterfactual outcomes, obtained by modifying the ‘counterfactual source’ variables. . . . . 285 15.3 Supervised learning experiments on real-world data. Note that MLP and CASTLE cannot provide DAGs (and thus don’t have SHD). . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 15.4 The impact of number of environments for ISL in supervised learning setting. . . . . . . . . 287 15.5 Self-supervised causal graph discovery on the Sachs and Insurance datasets. . . . . . . . . . 288 16.1 Analysis of computation expenditures and accuracy for our approach and the baselines, to learn all 102 tasks (with a total of 5,033 classes, 2,041,225 training images) in a single agent. Here we select LLL, no BB, MAHA as reference (1x CPU usage) since it is the fastest approach, yet still has higher accuracy than all baselines. For our approach, MAHA leads to slightly higher accuracy than GMMC, at roughly the same computation cost. All baselines perform worse that our approach, even though they also requires more computation than our approaches that do not use BB. BB adds significatly to our computation cost, but also leads to the best accuracy when used with MAHA. . . . . . . . . . . . . . . . . . . . . . . . . . 310 xvi 16.2 Analysis of computation and network expenditures for our parallelized LLL approach and our parallelized SUPSUP, to learn all T = 102 tasks. Our approach supports any number of agents N such that 1 ≤ N ≤ T. Maximum speedup is expected when N = T and each agent learns one task, then shares with all others. Here, we report numbers for T = 102, N = 51, and each agent learns 2 tasks in sequence. Note that in our approach, accuracy is not affected by N, only the amount of parallelization speedup increases with N. Note how in this table we still report MACs but taking parallelization into account (e.g., teacher CPU for N agents is single-agent CPU divided by N). Teacher CPU: Time to learn tasks from their training datasets, plus to possibly prepare data for sharing (e.g., compute GMMC clusters). Communications: Our LLL agents communicate either GMMC clusters or Mahalanobis training images, while our modified SUPSUP communicates masks. Here we assume that there is a communication bottleneck at the receiver (student): the shared data from 100 tasks needs to be received serially, over a single networking interface for each student. Hence our communication figures are for all the shared data from all other tasks apart from those an agent learned itself. We convert communication costs to equivalent MACs by assuming 1,000 MACs per byte transmitted. BB adds a small extra communication cost, to transmit the biases. Student CPU: For GMMC, students do not do any extra work (hence, student CPU is 0); for Mahalanobis, students compute a covariance matrix for all 102 tasks. Speedup factor: is just total MACs for single agent divided by total MACs for parallel agents and by N. All approaches here achieve near perfect parallelization (> 0.99N, where 1.0N is perfect). Accuracy: In addition to being faster when BB is not used, our LLL variants still all outperform the parallel SUPSUP in accuracy, by a large margin (> 10%). . . . . . . . . . 312 16.3 Boosted LLL learning when previously learned weights from similar classes can be used to initialize learning of new classes. We repeat the experiment with either learning from all images in the training set of the new task, or only 10, 5, or 3 images per class. Overall, re-using previously learned weights of similar classes boosts accuracy, usually (but not always) more so when the new task is only learned from a few exemplars (which is much faster than learning from scratch from the whole dataset). . . . . . . . . . . . . . . . . . . . 316 17.1 Comparison of 53-dataset with other benchmark datasets including Cifar-100 [301], F-CelebA [445], Fine-grained 6 tasks [479] [318], [404], [299], [484], [131]. Note that our 53-dataset covers the 8-dataset, F-CelebA and part of the Fine grained 6 tasks. . . . . . . . . 330 17.2 Extra parameter expenditures and computation cost analysis. We treat the computation cost of SGD as the unit, and the computation costs of other methods are normalized by the cost of SGD. PSP’s low computation cost comes from using a Resnet-18 backbone instead of Resnet-50 which is its original form. For EWC, though the final model size does not increase, the performance is poor and N fisher matrices are needed during training; EWC-online updated the way of updating the fisher matrix and only requires one fisher matrix during training. ER maintain a memory buffer that includes five images per class from the tasks that have already been seen, we spread the size in bytes of the image buffer over the 53 tasks to obtain the amount of extra parameters per task. SUPSUP requires a 3MB mask for each task. 335 17.3 Influence of different task-agnostic immutable parameter. Both supervised learning and self-supervised learning could contribute a relatively good immutable parameter for our method, which shows that our method is robust to different backbones. . . . . . . . . . . . . 336 xvii 17.4 We applied our method on CIFAR-100 dataset with 10 tasks, each containing 10 classes with comparisons to baselines from CCLL, using ResNet-18 as the backbone. . . . . . . . . . . . 344 xviii List of Figures 1.1 Exponential growth of number of parameters in Deep Learning models. From https: //towardsdatascience.com/the-rise-of-cognitive-ai-a29d2b724ccc . . . . . . . . 2 1.2 Pipeline of utilizing generated data for model training . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis overview on learning controllable data generation for scalable model training. . . . . 5 2.1 Object pose transformation with OPT-Net. The first column shows input images from the test dataset, and the remaining columns show target pose images transformed by OPT-Net. Integer poses (1, 2, 3, 4, 5, 6 in red) are defined in the training dataset, while decimal poses (1.5, 2.5, 3.5, 4.5, 5.5 in green) are new poses, which shows OPT-Net can achieve continuous pose transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 (a) Discrete predefined pose images sample. (b) Predefined sample poses and pose change along pitch and yaw axes. (c) Given any pose (1st and 8th columns), OPT-Net can transform it along pitch and yaw axes to target poses (remaining columns) . . . . . . . . . . . . . . . 16 2.3 Flow of OPT-Net, consisting of three modules: eliminate-add structure generator G, discriminator D, and pose-eliminate module. (a) Pose transformation sketch (b) Origin to target pose transformation. In the pose ‘eliminate’ part, G takes in the original pose image and first uses both implicit regularization and the explicit pose-eliminate module to eliminate pose information of the input, yielding a pose-invariant canonical representation. Then, in the pose ‘add’ part, the representation features are concatenated with a target pose mask and the target pose image is synthesized. D learns to distinguish between real and fake images and to classify real images to their correct pose. (c) Training OPT-Net: G first maps the original pose image to target pose and synthesizes a fake image, then G tries to reconstruct the original pose image from the fake image given the original pose information. . . . . . . 16 2.4 Object pose transform comparison for StarGAN and OPT-Net. . . . . . . . . . . . . . . . . 23 2.5 Generalization results of OPT-Net on RGB-D dataset pretrained on iLab-20M. . . . . . . . . 27 2.6 Top 8 ImageNet images for each pose predicted by discriminator in OPT-Net without finetune. 29 xix 3.1 Zero-shot synthesis performance of our method. (a), (b), and (c) are from datasets, respectively, iLab-20M, RaFD, and Fonts. Bottom: training images (attributes are known). Top: Test image (attributes are a query). Training images go through an encoder, their latent features get combined, passed into a decoder, to synthesize the requested image. Section 3.4.2 shows how we disentangle the latent space, with explicit latent feature swap during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 (a) Samples from our proposed Fonts dataset, shown in groups. In each group, we vary one attribute but keep others the same. (b) (Sub-)multigraph of our Fonts dataset. Each edge connect two examples sharing an attribute. Sets S1 and S2 cover sample i. . . . . . . . . . . 36 3.3 Architecture of GZS-Net, consisting of an encoder E: maps sample onto latent vector, and a decoder D: maps latent vector onto sample. The latent space is pre-partitioned among the attribute classes (3 shown: identity, pose, background). (a, left) considered examples: a center image (x, red border) and 3 images sharing one attribute with it, as well as a no overlap image sharing no attributes (x¯, black border). (a, right) standard reconstruction loss, applied for all images. (b) One-overlap attribute swap: Two images with identical values for one attribute should be reconstructed into nearly the original images when the latent representations for that attribute are swapped ("no-op" swap; left: identity; middle: pose; right: background). (c) Cycle swap: given any example pair, we randomly pick an attribute class j. We encode both images, swap representations of j, decode, re-encode, swap on j again (to reverse the first swap), and decode to recover the inputs. This unsupervised cycle enforces that double-swap on j does not destroy information for other attributes. . . . . . . 38 3.4 Zero-shot synthesis performance compare on Fonts. 7-11 and 18-22 columns are input group images and we want to combine the specific attribute of them to synthesize an new images. 1-5 and 12-16 columns are synthesized images use auto-encoder + Exhaustive Swap (AE+ES), β-VAE + Exhaustive Swap (β-VAE+ES), β-TCVAE + Exhaustive Swap (β-TCVAE+ES), auto-encoder + Directly Supervision (AE+DS) and GZS-Net respectively. 6 and 17 columns are ground truth (GT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Zero-shot synthesis qualitative performance on ilab-20M. Columns left of the dashed line are output by methods: the first five are baselines, followed by three GZS networks. The baselines are: (1) is an auto-encoder with direct supervision (AE+DS); (2, 3, 4) are three GAN baselines changing only one attribute; (5) is starGAN changing two attributes. Then, first two columns by GZS-Net are ablation experiments: trained with part of the objective function, and the third column is output by a GZS-Net trained with all terms of the objective. starGAN of [92] receives one input image and edit information. ELEGANT uses identity and background images. Others use all three inputs. . . . . . . . . . . . . . . . . . . . . . . . . 44 3.6 GZS-Net zero-shot synthesis performance on RaFD. 1-2 and 6-7 columns are the synthesized novel images using auto-encoder + Directly Supervision (AE+DS) and GZS-Net respectively. Remaining columns are training set images with their attributes provide. . . . . . . . . . . . 45 3.7 (a) Dataset details for training object recognition task, where the x-axis represents different identities (1004) and the y-axis represents the backgrounds (111) and poses (6) each purple and brown pixel means our dataset covers the specific combination of attributes. (b) object recognition accuracy (%) on 37469 test examples, after training on (augmented) datasets. . 46 xx 4.1 (a) Comparison of DALL-E for detection pipeline and traditional human-centric pipeline (b) Using pure synthetic data from the text-to-image model (syn) could lead on-par performance with using all real data (real), mixing real and synthetic (syn + real) gives strong performance gains (+22.5 mAP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 (a) Foreground generation: (top row, Section 4.3.1) verbalizes class name into templates understandable by T2I models [440, 468], which synthesize desired foreground images with easy-to-separate backgrounds. Off-the-shelf foreground/background segmentation methods are then used to extract foreground segments from foreground images. (b) Background generation: (bottom row, Section 4.3.2) an image captioning method (e.g. SCST [454]) captions user-provided images (CDIs). Context words (e.g. “grass field”) are extracted and the augmented caption is feed into T2I to generate background images. (c) CLIP [434] is used (Section 4.3.3) to maintain the quality of both foregrounds and backgrounds, as well as ensure that the generated images do not have unwanted class. (d) Finally, we composite (Section 4.3.4) the foreground segments and background images to obtain synthetic images with accurate labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 When CDIs can not perfectly describe the real test scenario, the compositional property of language can help to correct the context description. For instance, if the initial description contains noisy information “man and a woman”, we can remove the noisy information to generate a congruent context description. Images with a red frame show the generated image without language intervention and the green frame shows the images after the intervention. . 56 4.4 Our synthetic dataset generation is agnostic to different models and backbones. . . . . . . . 65 4.5 Contextual backgrounds generated by our approach provide valuable cues. . . . . . . . . . . 66 4.6 Our generated synthetic foregrounds (fg) are high-quality and diverse, and adding more helps. On the other hand, CLIP filtering and context extraction are crucial tricks to ensure the quality of synthetic backgrounds (bg). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.7 Mixing of real and synthetic data further improves downstream models. . . . . . . . . . . . 68 4.8 Synthetic data distribution complements the real data distribution. Our foreground generation helps even more on the highly occluded classes. . . . . . . . . . . . . . . . . . . . . . . . . 69 4.9 Pseudo-labeled synthetic images generated by our pipeline. . . . . . . . . . . . . . . . . . . 71 4.10 Contextual backgrounds generated from user-provided CDI. We note that even if the user provides as little as 1 CDI, our approach can still generate large-scale coherent images. . . . 72 5.1 Step 1 of foreground extraction. (a) Entity Segmentation extracts segments from images. (b) Grad-CAM highlights a region based on the given label, and the center of moments (white dot on the image) is calculated for the highlighted region. (c) For all eligible segments, we compute the pixel-wise average distance to the center of the region highlighted by Grad-CAM. (d) We select n segments that have the shortest distances to the center. (e) All n foreground candidate segments are filtered using the classifier network, and we select the foreground with highest predicted probability. . . . . . . . . . . . . . . . . . . . . . . . . . 80 xxi 5.2 Step 2 and 3 of foreground extraction. (a) Each extracted foreground is passed to the classifier, and a latent representation of the image is extracted using a bottleneck layer. (b) Using the mean of all latent representations, we keep k% representations that are close to the mean and rule out outliers. (c) The mean is updated after ruling out the outliers. (d) For each image, latent representations of all eligible segments are obtained by the classifier network. (e) The segment with the highest cosine similarity to the updated mean is selected as the new foreground of the image. (f) After obtaining a new set of foregrounds, they are used as input of step 2 of the next iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Illustrative example of Space Maximize Paste algorithm. In this example, four foreground objects are pasted on the background image that contains an aeroplane. In part (b) the max inscribing circle is found from contour based on region without aeroplane. We emphasize that the contour is found only based on image level, using process described in Section 5.3.1. Note that the person is scaled to match the size of the circle found in part (b), and a random rotation is performed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Long-tail instance segmentation setting and results. . . . . . . . . . . . . . . . . . . . . . . 92 6.1 Overall pipeline of physically plausible object insertion for monocular 3D object detection: Our approach copies external 3D objects (e.g., from Objaverse [107]) and pastes them into indoor scene datasets (e.g., SUN RGB-D [522]) in a physically plausible manner. The augmented indoor scene dataset, enriched with inserted 3D objects, is then used to train monocular 3D object detection models, resulting in significant performance improvements. . 97 6.2 3D Copy-Paste method overview: Our method (a) processes the input RGB image and depth data to reconstruct floor planes that can accommodate inserted objects. (b) Using the reconstructed planes and information about objects in the original scene, we estimate a physically plausible position, pose, and size for the inserted objects, ensuring they do not collide with existing objects. (c) We predict the spatially-varying lighting of the scene. (d) By registering the insertion position determined in (b) to spatially-varying lighting, our light estimation module (d) refined an HDR environment map to represent the lighting information for the inserted objects. (e) The insertion rendering module takes the position, pose, size, and lighting as input and inserts a 3D object into the real scene, adjusting the object’s lighting and shadows accordingly to ensure it seamlessly integrates as a natural and coherent part of the scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Visualization of different illumination on inserted objects. . . . . . . . . . . . . . . . . . . . 111 6.4 Qualitative results on the SUN RGB-D dataset. . . . . . . . . . . . . . . . . . . . . . . . . 111 7.1 DreamDistribution learns a prompt distribution D∗ that represents a distribution of descriptions corresponding to a set of reference images. We can sample new prompts from D∗ or modified D∗ by text-guided editing to generate images of diverse new instance that follows the visual attributes of reference training images (top). We can also apply a learned distribution flexibly to, for example, a pretrained text-to-3D model, and generate diverse new 3D assets following the reference images (bottom). . . . . . . . . . . . . . . . . . . . . . . 115 xxii 7.2 Overview of for learning a prompt distribution. We keep a set of K learnable soft prompts and model a distribution of them at the CLIP text encoder feature space. Only prompts are learnable, CLIP encoder and the T2I diffusion model are all fixed. We use a reparameterization trick to sample from the prompt distribution and update the learnable prompts through backpropagation. The training objective is to make the generated images aligns with the reference image. An additional orthogonal loss is incorporated to promote differentiation among learnable prompts. For inference, we similarly sample from the prompt distribution at text feature space to guide the pretrained T2I generation. . . . . . . . . . . . 115 7.3 Comparison of results with existing methods. Given a set of training images (typically 5-20, we only show 4 here), we compare generation results with other existing methods. We use Stable Diffusion version 2.1 for all methods. As can be seen on the bottom row, our method is able to generate more diverse and coherent images (also quantitatively analyzed by automatic and human evaluation in Section 7.4.2). . . . . . . . . . . . . . . . . . . . . . . . 125 7.4 Human Evaluation on image diversity (Section 7.4.2) aligns with automatic evaluation (Table 7.1). Our method shows significantly greater diversity, which may explain why it was able to better train image classifiers in Table 7.2. . . . . . . . . . . . . . . . . . . . . . . . . 127 7.5 Effect of scaling the variance of a learned prompt distribution. Image diversity increases as the scaling factor γ increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.6 Composition of prompt distributions using linear interpolation between Chinese painting and Van Gogh. Mixing ratio changes linearly from left to right. The middle columns show mixtures of two styles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.7 Results on text-editability of our methods. Left column shows samples of reference images, right columns are generated results with corresponding prompts. . . . . . . . . . . . . . . . 130 7.8 3D generation results by learning a prompt distribution over the reference images and then inference using MVDream [506] (without extra texts). . . . . . . . . . . . . . . . . . . . . 132 7.9 3D generation results by learning a prompt distribution over the reference images and then inference with text-guided editing using MVDream [506]. . . . . . . . . . . . . . . . . . . 133 8.1 (a) On-demand synthetic data generation: Given a target task and a test dataset, our approach “Neural-sim" generates data on-demand using a fully differentiable synthetic data generation pipeline which maximises accuracy for the target task. (b) Train/test domain gap causes significant detection accuracy drop (yellow bar to gray bar). We dynamically optimize the render parameters (pose/zoom/illumination) to generate the best data to fill the gap (blue bar). 137 8.2 Neural-Sim pipeline: Our pipeline finds the optimal parameters for generating views from a trained neural renderer (NeRF) to use as training data for object detection. The objective is to find the optimal NeRF rendering parameters ψ that can generate synthetic training data Dtrain, such that the model (RetinaNet, in our experiments) trained on Dtrain, maximizes accuracy on a downstream task represented by the validation set Dval. . . . . . . . . . . . . 141 xxiii 8.4 A concrete example to one time sample, starting form a particular value of ψ, we can follow reparametrization sampling and obtain a pose. Each sample represents a pose that is input in NeRF to render one image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.3 Bin sampling: We first discretize the pose space into a set of k bins, which we will then sample to generate the view parameters for the NeRF. To backpropagate through the sampling process, we approximate the sample from the categorical (i.e. bin) distribution by using a Gumble-softmax “reparameterization trick”. Within each bin we sample uniformly. . . . . . 145 8.5 Neural-Sim performance on YCB-Synthetic. When there are distribution gap between train and test sets ((a) pose (b) zoom (c) illumination gap), with the gap increase, object detection faces larger accuracy drop (black line). With the help of Neural-Sim (NSO) in blue line, the performance drop are filled. Observe improvement of NSO over LTS [477] (red line) and Auto-Sim [35] (green line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.6 Performance of Neural-Sim on the YCB-in-the-wild dataset. We observe that the Neural-Sim optimization (NSO) can consistently achieve 20% to 60% improvement in accuracy over our method without optimization (NS) case and large improvements over LTS (up to 58%) and Auto-Sim (up to 60%). Here each bin on x−axis represents bin from which test data is generated. We observe large improvement in both single-modal and multi-modal test data. . 153 8.7 Visualization provides evidence that proposed Neural-Sim (NSO) approach generates interpretable outputs. In the shown example, test images are sampled from distribution bin 1 as dominant bin. For Neural-Sim optimization (NSO), initial training pose distributions are uniform and bin 4 as dominant bin. Observe the bin distribution at the optimization - the final bin distribution at the end of Neural-Sim training matches with the test bin distribution. 156 9.1 Overview of BEHAVIOR Vision Suite (BVS), our proposed toolkit for computer vision research. BVS builds upon extended object assets and scene instances from BEHAVIOR-1K [323], and provides a customizable data generator that allows users to generate photorealistic, physically plausible labeled data in a controlled manner. We demonstrate BVS with three representative applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.2 Overview of Extended BEHAVIOR-1K Assets: Covering a wide range of object categories and scene types, our 3D assets have high visual and physical fidelity, and rich annotations of semantic properties, allowing us to generate 1,000+ realistic scene configurations. . . . . . . 164 9.3 Parametric evaluation of object detection models on five example video clips. Selected frames from the clips are shown on the left, with the target object highlighted in magenta. Average Precision (AP) for our baseline models in Section 9.4.2 are plotted on the right. Since BVS allows for full customization of scene layout and camera viewpoints, we can systematically evaluate model robustness to changes in object articulation, lighting conditions, visibility, zoom (object proximity), and pitch (object pose). As we can see, current SOTA models are far from robust to these axes of variation, and we encourage researchers that develop new vision models to use BVS for debugging and parametric evaluation. . . . . . . . . . . . . . 169 xxiv 9.4 Mean performance of open-vocab object detection and segmentation models across five axes. The larger the colored envelope is for a model, the more robust it is. With the help of BVS, new vision models can be systematically tested for their robustness along these five dimensions and beyond: our users can easily add new axes of domain shift with only a few lines of code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.5 Holistic Scene Understanding Dataset. We generate 10,000 videos across 1,000 scene instance, each scene instance with 10 different camera trajectories. For each image, BVS generates a wide variety of labels (scene graphs, segmentation masks, depth, etc) shown on the right. On average, each video is 1 minute long with 3,000+ frames. . . . . . . . . . . . . 171 9.6 Sample images of each class from our generated synthetic and collected real datasets. . . . . 175 10.1 An example result with the proposed VRX. To explain the prediction (i.e., fire engine and not alternatives like ambulance), VRX provides both visual and structural clues. Colors of visual concepts (numbered circles) and structural relationships (arrows) represent the positive or negative contribution computed by VRX to the final decision (see color scale inset). (a): The four detected concepts (1-engine grill, 2-bumper, 3-wheel, 4-ladder) and their relationships provide a positive contribution (blue) for fire engine prediction. (b, c): Unlike (a), the top 4 concepts, and their relationships, for ambulance/school bus are not well matched and contribute negatively to the decision (green/yellow/red colors). . . . . . . . . . 179 10.2 Pipeline for Visual Reasoning Explanation framework. (a) The Visual Concept Extractor (VCE) discovers the class-specific important visual concepts. (b) In original NN, the representation of the top N concepts is distributed throughout the network (colored discs and rectangles). (c) Using Visual Concept Graphs that are specific to each image class, our VRX learns the respective contributions from visual concepts and from their spatial relationships, through distillation, to explain the network’s decision. (d) In this example, the concept graphs colored according to contributions from concepts and relations towards each class explain why the network decides that this input is a Jeep and not others. . . . . . . . . . . . 181 10.3 Concept discovery with and without Grad-Cam filter. . . . . . . . . . . . . . . . . . . . . . 184 10.4 Decision comparison between original NN and proposed GRN. . . . . . . . . . . . . . . . . 188 10.5 (a) Class-specific importance weights eji highlight the important concept relationships for different classes (b) eji reveals the information transformation between concepts, which shows the dependency between concepts: concept 1 and 2 contribute most information to other concepts, which makes them the 2 most discriminating concepts for a fire engine. . . . 189 10.6 Visual Reasoning Explanation and logic consistency experiment example. . . . . . . . . . . 189 10.7 Interpretation from VRX is sensitive to visual and structure aspects. (a) visual sensitive (b) structure sensitive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 10.8 Model diagnosis and improving performance . . . . . . . . . . . . . . . . . . . . . . . . . 195 10.9 Diagnosis and improvement experiment on iLab-20M. . . . . . . . . . . . . . . . . . . . . 196 xxv 11.1 Network-to-human path in our approach shows the reasoning logic of a network to a human, using a Structural Concept Graph as "language". Four examples of object classes are shown (one per row). In each one, we highlight the four most important visual concepts according to the original network (different colors), in three instance images. These visual concepts are the ones that most influence the decision of the original network, as discovered through an automated analysis of the network (visual concept extractor). Aggregating these instance-level concepts produces a class-level structural concept graph (c-SCG; rightmost column) for each object class, which captures the most discriminative visual concepts or parts for that class, according to the original neural network, as well as their relationships. Sometimes, the most important concepts for the original network are wrong, possibly because of spurious correlations in the training data, or for some other reasons detailed below (e.g., red circle on top row is a patch of background foliage, which may have often appeared next to school buses during training, but actually is not part of school bus; likewise for a patch of grass with Zebra). . . . . . . . . . . . . . . . . . . . . . 200 11.2 Pipeline for the proposed Human-Network Interface. Top arrow represents the network-to-human path, which shows the reasoning logic of the original network (a) to a human, using structural concept graphs (SCG) as a language. It consists of a Visual Concept Extractor (b), which discovers the most important visual concepts for the network, and a Graph reasoning Network (GRN; c), which aggregates and summarizes concepts and their relationships from many training images into a single class-level structural concept graph (c-SCG) for each class. Bottom arrow represents the Human-to-network path, which changes the network’s decision making through human intervention, which is made easy and intuitive by allowing humans to interact with the c-SCG. This path consists of three steps: (1) humans (d) inspect and possibly change a given c-SCG, using their common sense, domain knowledge, and understanding of how spurious correlations may cause errors, to fix errors in the c-SCG. For example, at top-right, tree foliage was used by the original network to recognize school buses, but this is likely a spurious correlation in the training set (many school buses were shown in front of trees); conversely, a wheel was used by the original network, but is not ideal because it is not discriminative of other wheeled vehicles. Humans can choose to substitute these visual concepts with others from the pool extracted by the Visual Concept Extractor, initially ranked less important by the network. Humans can also modify the edges of the c-SCG, to add, remove, or correct relationships between visual concepts. (2) The framework then trains GRN with human logic (e), and (3) transfers human knowledge to network by partial knowledge distillation (f). The revised network has exactly the same structure as the original, but its weights have been modified following the human interaction. We show in our results that this pipeline is effective at rapidly (in terms of human effort) and interactively correcting network mistakes. . . . . . . . . . . . . . . . . . . . . . 204 xxvi 11.3 Humans can improve a network’s performance with HNI. We conduct large-scale experiments on the ImageNet dataset, which contains 1,000 real-world classes. (a) Confusion matrix of a 1,000-class original GoogleNet image classification network trained on ImageNet. Most of the errors are within each of 12 super-classes that correspond to groups of related classes (e.g., mammals, vehicles, birds, etc). There are two main challenges: (1) How to correct the errors and improve accuracy within a super-class (local logic) with the help of human involvement? (2) How to maintain the performance of all other classes in the 1,000 classes? We show results of two large-scale experiments, for the superclasses of vehicles (23 classes, total 23,000 training images) and mammals (13 classes, 13,000 images), to show how one can use HNI to improve performance within a superclass without degrading performance of other classes. We first consider the super-class of vehicles (b1), with original accuracy over these 68.78% (c1). For each class, the network-to-human pass was used to show the reasoning logic of the original network as a c-SCG to a human operator (d1). The operator spotted and corrected any reasoning errors of the network. The human-to-network pass then distilled the human-modified logic back to the original network with the help of Graph neural network and partial knowledge distillation. Performance was improved on the vehicle classes, without degradation of non-vehicle classes (e1), demonstrating how humans could use their own knowledge to correct reasoning errors of the network and improve network accuracy. The same process is also shown for the superclass of mammals (b2, c2, d2, e2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 11.4 Zero-shot learning: Human users teach network to learn to encode new objects with HNI. Section 11.2.3 provide explanation for each step. Bottom results shows the performance of Zero-shot learning with HNI. The original ResNet-18 network (pretrained on ImageNet) trained with images of objects A-H cannot identify new objects I, J, K in the test set. Humans can teach the ResNet-18 to encode and recognize new objects I, J, K wih HNI. . 210 11.5 Pipeline of training Graph Reasoning Network with Human modified c-SCG. Given input I, we conduct multi-resolution segmentation and concept match based on humanmodified c-SCG. In the concept matching step, we attempt to match the c-SCG of each class of interest to the concepts extracted from the current input image. Color circles represent the matched concepts for each class of interest. Black dummy nodes denote undetected concepts. For example, for the input image shown, all concepts for the Fire Engine class were matched, but only 2 concepts of Ambulance could be found in the image, and only 1 concept of School Bus. Subsequently, the GRN aggregates all matched concepts and uses those to support its predictions. (Section 11.4.2.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 11.6 The pipeline of Partial Knowledge Distillation. Different from traditional knowledge distillation, partial knowledge distillation adopts two teachers with different expertise: GRN (teacher 1) focuses on classes of interest (6 classes in this example), and the fixed original network (teacher 2) focuses on the remaining classes (the classes we don’t want to change, 14 classes in this example). After distillation with different temperatures and concatenation, we can use both soft labels and hard labels to train the student model. . . . . . . . . . . . . 219 xxvii 12.1 (a): Contributions of Shape, Texture, and Color may be different among different scenarios/tasks. Here, texture is most important to distinguish zebra from horse, but shape is most important for zebra vs. zebra car. (b): Humanoid Vision Engine takes dataset as input and summarizes how shape, texture, and color contribute to the given recognition task in a pure learning manner (E.g., In ImageNet classification, shape is the most discriminative feature and contributes most to visual recognition). . . . . . . . . . . . . . . . . . . . . . . 223 12.2 Pipeline for humanoid vision engine (HVE). (a) shows how will humans’ vision system deal with an image. After humans’ eyes perceive the object, the different parts of the brain will be activated. The human brain will organize and summarize that information to get a conclusion. (b) shows how we design HVE to correspond to each part of the human’s vision system. . . . 226 12.3 Pipeline for extracting texture feature: (a) Crop images and compute the overlap ratio between 2D mask and patches. Patches with overlap > 0.99 are shown in a green shade. (b) add the valid patches to a patch pool. (c) randomly choose 4 patches from pool and concatenate them to obtain a texture image It . . . . . . . . . . . . . . . . . . . . . . . . . . 228 12.4 T-SNE results of feature encoders on their corresponding biased datasets . . . . . . . . . . . 230 12.5 Sample question for the human experiment. (a) A test image (left) is first converted into shape, color, and texture images using our feature extractors. (b) On a given trial, human participants are presented with one shape, color, or texture image, along with 2 reference images for each class in the corresponding dataset. Participants are asked to guess the correct object class from the feature image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 12.6 Processed CUB and iLab-20M dataset examples . . . . . . . . . . . . . . . . . . . . . . . . 233 12.7 The zero-shot learning method with HVE. We first describe the novel image in the perspective of shape, texture, and color. Then we use ConceptNet as common knowledge to reason and predict the label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 12.8 (a) The structure and training process of the cross-feature retrieval model. Es, Et , Ec are the same encoders in Section 12.3.2. The feature agnostic net then projects them to shared feature space for retrieval. (b) The process of cross-feature imagination. After retrieval, we design a cross-feature pixel2pixel GAN model to generate the final image. . . . . . . . . . . 237 12.9 Imagination with shape, texture, and color feature input (columns I, II, III). Line (a): input feature. Line (b): retrieved features given (a). Line (c): imagination results with HVE and our GAN model. Line (d): results of baseline 3 pix2pix GANs. Line (e): original images to which the input features belong. Our model can reasonably “imagine" the object given a single feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 13.1 Typical failure modes in the cases where top-5 prediction was correct but top-1 was wrong. 242 13.2 Our zero-shot classification pipeline consists of 2 steps: confidence estimation via self-consistency (left block) and top-down and bottom-up label augmentation using the WordNet hierarchy (right block). See Algorithms 4 and 5 for pseudocode. . . . . . . . . . . 245 xxviii 13.3 ROC plots (left column) show that our proposed confidence score is better at distinguishing correct and incorrect predictions and results in higher AUC scores than baselines for both CLIP (ViT-B/16) (a) and LiT (ViT-B/32)(c). Selective prediction curves (right column) show that our proposed confidence score is better at abstaining incorrect predictions and as a result the accuracy of the remaining set is higher than the baselines for both CLIP (ViT-B/16) (b) and LiT (ViT-B/32) (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 14.1 We study the one-class OOD detection problem where OOD can be anything not in-domain. Example of building a dog detector with some known dogs Cin ={Husky, Papillon, Dobermann} and known non-dogs Cout ={Cat, Bird, Person}. When deploying such a one-class detector in the real world, it’s important for it to be robust to several different types of shifts: (1) Unknown in-domain classes include new species of dogs, and unknown OOD such as wolves, bunnies, violins, (2) Multi-object cases (cats along with dogs, persons along with dogs), (3) Covariate shift (drawings of dogs, painting of a bird, cartoon car, etc). . . . . 257 14.2 Our methods utilize in-domain and OOD label sets. When in-domain classes Cin are comprised only of dog breeds, a butterfly may be mistaken for a Papillon dog, possibly due to the similar shape and color to the dog’s ears. However, when a OOD set Cout is included consisting of the class name “butterfly”, or a more precise butterfly type “red admiral butterfly”, or a text description “a photograph of red admiral butterfly”, the image embedding’s similarity with this label pushes down the probabilities with the in-domain dog breeds, correctly identifying the image as OOD. Thus our method S-max_in_prob has better separation between in-domain and OOD compared to the baseline S-max_prob, as shown on the left 2D histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 14.3 Our methods detect OOD at bounding box level. Images having mixture of in-domain and OOD objects are identified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 15.1 A motivational example. (a) For the image label Y (1 means the label is "cow" and 0 otherwise), X1 and X2 represent the causal parents about the image details (here, shape and texture), and X3 (background type where 1 indicates the presence of grass and 0 otherwise) represents a factor that isn’t causal to Y. S(·) is sigmoid function. In this example, texture (X2) is twice as causal to Y than shape (X1). (b) The relationship between Y and X3 vary across environments; since the conditional dependence is not consistent across environments X3 may not be treated as a major causal factor for Y. (c) We utilize the Mean Squared Error (MSE) as a metric to assess the prediction error for ’Y’. This is carried out by using the projected causal parent of ’Y’ as features inputted into a two-layer neural network. Smaller MSE value implies that the causal parents variables used for prediction are more precise. Our proposed method ISL yields more accurate discovery of the underlying causal relation – here, it correctly identifies X1 and X2 but not X3 as the causal factors of Y, improving the explanation quality and prediction accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . 274 xxix 15.2 Top: The proposed Invariant Structure Learning (ISL) framework. Given raw data, we build different environments using unsupervised clustering, unless the data source information is provided. For different environments, each ISL module outputs a summarized DAG to represent the learned invariant structure. An aggregation mechanism then selects the optimal predictor based on a graph structure that reflects the causal mechanisms in the data more accurately. During training, the constraint on the Y prediction across environments helps learning an invariant structure. Consequently, the learned DAG leads to a superior predictor. Bottom: Details of the ISL module. θ Y 1 is the invariant structure of P a(Y ) shared across all modules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 15.3 ISL in self-supervised setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 15.4 Visualization of discovered causal structure in (a) supervised and (b) self-supervised settings. Blue solid arrows are overlapped edges between our results and GT, red solid arrow denotes the edges that we can identify but with wrong direction, green solid arrow denotes our proposed edge that is not contained for, yellow dash arrows denotes our missing edges that GT contains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 16.1 SKILL vs. related learning paradigms. a) Multi-task learning [70]: one agent learns all tasks at the same time in the same physical location. b) Sequential Lifelong Learning (S-LL) [338]: one agent learns all tasks sequentially in one location, deploying LL-specific machinery to avoid task interference. c) Federated learning [380]: multiple agents learn the same task in different physical locations, then sharing learned knowledge (parameters) with a center agent. d) Our SKILL: different S-LL agents in different physical regions each learn tasks, and learned knowledge is shared among all agents, such that finally all agents can solve all tasks. Bottom-right table: Strengths & weaknesses of each approach. . . . . . . . . . . . . . 294 16.2 (a) SKILL-102 dataset visualization. Task difficulty (y-axis) was estimated as the error rate of a ResNet-18 trained from scratch on each task for a fixed number of epochs. Circle size reflects dataset size (number of images). (b) Comparison with other benchmark datasets including Visual Domain Decathlon [445], Cifar-100 [301], F-CelebA [283], Fine-grained 6 tasks [479] [575], [404], [299], [484], [131] c) Qualitative visualization of other datasets, using the same legend and format as in a). . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 xxx 16.3 Algorithm design. Top: overall pipeline, where agents are deployed in different regions to learn their own tasks. Subsequently, learned knowledge is shared among all agents. Bottom: Zoom into the details of each agent, with 4 main roles: 1) Training: agents use a common pre-trained and frozen backbone, stored in ROM memory at manufacturing time (gray trapezoid with lock symbol). The backbone allows the agent to extract compact representations from inputs (e.g., with an xception backbone, the representation is a latent vector of 2048 dimensions, and inputs are 299 × 299 RGB images). Each agent learns a task-specific head (red triangle) for each new task. A head consists of the last fully-connected layer of the network plus our proposed LL beneficial biasing units (BB) that provide task-dependent tuning biases to all neurons in the network (one float number per neuron). During training, each agent also learns a GMMC or Mahalanobis task anchor which will form a task mapper. 2) Share knowledge with other agents: each agent shares the learned task-specific head, Beneficial Bias (BB), and GMMC module (or training images for Mahalanobis) with all other agents. 3) Receive knowledge from other agents: each agent receives different heads and GMMC/Mahalanobis task mapper anchors from other agents. All heads are stored in a head bank and all task anchors are consolidated to form a task mapper. 4) Testing: At test time, an input is first processed through the task mapper. This outputs a task ID, used to load up the corresponding head (last layer + beneficial biases) from the bank. The network is then equipped with the correct head and is run on the input to produce an output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 16.4 Accuracy on task 1 (learning to classify 102 types of flowers) as a function of the number of tasks learned. a) Comparison between our methods. b) Comparison between our best and other baselines. Our approach is able to maintain accuracy on task 1 much better than the baselines as more and more tasks are learned: while our approach does suffer some interference, task 1 accuracy remains to within 90% of its initial best even after learning 101 new tasks (for the 4 LLL variants, BB=beneficial biases, MAHA=Mahalanobis Distance task mapper, GMMC=GMMC task mapper). In contrast, the accuracy of EWC, PSP, and several other baselines on task 1 catastrophically degrades to nearly zero after learning just 10 new tasks, even though we granted these methods a perfect task oracle. The best performing baseline, ER, is of the episodic buffer type (a fraction of the training set of each task is retained for later rehearsing while learning new tasks), with an un-bounded buffer that grows by 10 images/class. This methods does incur higher (and increasing) training costs because of the rehearsing. Note how SUPSUP does not experience any degradation on task 1, which is a desirable feature of this approach. However, a drawback is that SUPSUP is not able, even from the beginning, to learn task 1 as well as other methods (50.64% accuracy vs. over 90% for most other approaches). We attribute this to SUPSUP’s limited expressivity and capacity to learn using masks over a random backbone, especially for tasks with many classes. Indeed, SUPSUP can perform very well on some other tasks, usually with a smaller number of classes (e.g., 91.93% correct on SVHN, 93.18% on Brazillian Coins, 99.11% on UMNIST Face Dataset). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 16.5 Normalized accuracy on the first 10 tasks (one per curve color) as up to 20 additional tasks are learned. Our LLL approach is able to maintain high normalized accuracy on the first 10 tasks, while all other baselines except SUPSUP suffer much stronger catastrophic interference. SUPSUP is a special case as there is no interference among successive tasks when a perfect task oracle is available. Hence normalized accuracy for all tasks remains at 100%. However, we will see below that the absolute accuracy of SUPSUP is not as good. . . 306 xxxi 16.6 Task mapper accuracy on all tasks learned so far, as a function of the number of tasks learned, when using Mahalanobis (left) or GMMC (right) task mappers. Our approach is able to maintain good task mapping accuracy as the number of tasks increases. . . . . . . . . . . . 307 16.7 Absolute accuracy per task after learning 102 tasks. (Top) Absolute accuracy of the GMMC and Mahalanobis task mappers alone shows quite a bit of variability, indicating various degrees of overlap among tasks. (Bottom) Absolute accuracy of the main xception+head network alone (with or without BB, assuming perfect task mapper) also shows significant variability, indicating various degrees of difficulty per task. The accuracy with BB is overall slightly higher than without BB (orange bars higher than corresponding blue bars in the bottom panel), as further explored in the next figure. . . . . . . . . . . . . . . . . . . . . . . 308 16.8 Average absolute accuracy on all tasks learned so far, as a function of the number of tasks learned. Our LLL approach is able to maintain higher average accuracy than all baselines. BB provides a small but reliable performance boost (LLL w/BB vs. LLL w/o BB). The sharp decrease in early tasks carries no special meaning except for the fact that tasks 4,8,10 are significantly harder than the other tasks in the 0-10 range, given the particular numbering of tasks in SKILL-102. Note how again SUPSUP has a low accuracy for the very first task. This is because of the nature of its design; indeed, SUPSUP is able to learn some other tasks in our sequence with high accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 16.9 Left: similar classes with a cosine similarity in the CLIP embedding greater than 0.90. Right: similar classes with a cosine similarity greater than 0.95. This can help correct spurious errors where, for example, a test image from class "bike" from the Stanford_Online_Products dataset could also be considered correctly classified if the system output were "bicycle" from the Sketches dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 16.10Correcting spurious errors by realizing when two distinct classes from two tasks actually are the same thing. The approach provides a small but consistent improvement in accuracy over baseline (which declares failure as soon as task mapping failed), here shown on 15 datasets that have some overlap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 16.11Learning speed for a given object class when the corresponding weights are initialized randomly (orange) vs. from previously learned weights of a similar object class found in one of the previously learned tasks (blue), averaged for 190 combinations of two previously learned tasks. In this example, best accuracy is already reached after just 1 to 2 training epochs in the blue curve. In contrast, it takes up to 30 epochs to train from random initialization, with still a final accuracy in the orange curve that is lower than the blue curve. This approach hence leads to significant learning speedup when tasks contain some similar classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 16.12Two experiments where the weights from previously learned similar but not identical classes are successful in boosting learning of new classes. Left: pairs of similar classes (according to CLIP). Right: accuracy achieved with weight transfer vs. random initialization. . . . . . . . 317 xxxii 17.1 Equipped with a task-agnostic immutable CNN model, our approach "reprogram" the CNN layers to each new task with lightweight task-specific parameters (less than 0.6% of the original model) to learn sequences of disjoint tasks, assuming data from previous tasks is no longer available while learning new tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 17.2 Proposed continual learning model with channel-wise lightweight reprogramming (CLR) layers. All gray blocks are fixed parameters. (top) General network architecture. (bottom) Details of CLR reprogramming layer: for each channel k ∈ [1..c] of an original w × h × c feature map (blue), a 3x3 kernel is learned to reprogram the feature towards the new task (green), without modifying the original conv parameters (grey). . . . . . . . . . . . . . . . . 326 17.3 CLR-reprogrammed CNNs for continual learning. (a) In learning time, a CNN model could be reprogrammed by Channel-wise Lightweight Reprogramming parameters to solve continual new tasks. Only the CLR layers need to be trained in each reprogramming. (b) In test time, task oracle will select which task-specific CLR parameters to use and make the final decision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 17.4 Accuracy on task 1 as a function of the number of tasks learned. Our approach maintains the highest accuracy on task 1 over time, and importantly, it totally avoids catastrophic forgetting and maintains the same accuracy as the original training, no matter how many new tasks are learned. As discussed in the approach section, this is because we explicitly isolate the task-specific parameters for all tasks and avoid parameter interference. This is also the case for baselines SUPSUP [604], CCLL [516], and EFT [567]. Other baseline methods suffer a different degree of catastrophic forgetting. EWC [497], PSP [89], LwF[338], SI [634] and SGD suffer severe catastrophic forgetting with this challenging dataset. Rehearsal-based method ER performs relatively well because it has an unlimited large replay buffer, and it saves 10 images/class of the previous tasks. Yet, the overall accuracy of ER is still lower than our CLR-reprogrammed model. Rehearsal methods also incur higher (and increasing) training costs because of the rehearsing. We noticed similar performance on the second task. 332 17.5 Average accuracy on all tasks learned so far, as a function of the number of tasks learned. Our CLR-reprogrammed approach is able to maintain higher average accuracy than all baselines. The average accuracy increases because some of the later tasks are easier than former tasks (i.e., later tasks have higher accuracy). . . . . . . . . . . . . . . . . . . . . . . 333 17.6 Absolute accuracy per task after learning 53 tasks with our CLR-reprogrammed CNN. . . . 334 17.7 Statistics of the datasets and per-task accuracy of our method and baselines after learning all 53 tasks in the continual learning setting. Ablation columns indicate our methods with different initialization weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 xxxiii 17.8 The Figure shows quantitative results of the CLR transformation ability on CLEVR and Kannada-MNIST datasets. We visualize the feature maps in the first residual group of ResNet-50 that initially has a large initial gap between pre-train and FINETUNE (or SCRATCH). The results show that after channel-wise linear transformation, the feature after pre-train could be reprogrammed towards the goal feature (FINETUNE or SCRATCH). Pretrained indicates the frozen Imagenet pretrained ResNet-50 backbone. Finetune is a finetuned ResNet-50 backbone with Imagenet pretrained initialization while Scratch is a trained ResNet-50 backbone with random initialization. . . . . . . . . . . . . . . . . . . . . 339 17.9 Bootstrapping statistic results. The X-axis represents the number of tasks t in a specific continual learning task. Y-axis shows the mean Accuracy (solid blue line) on the sampled tasks with replacement and std as the shaded light blue range. . . . . . . . . . . . . . . . . . 340 17.10Bootstrapping statistic results with detailed accuracy log. The X-axis represents the number of tasks t in a specific continual learning task. Y-axis shows the mean Accuracy (solid blue line) on the sampled tasks with replacement. The shaded light blue range shows the min and max range in the given task number t. We use the solid red line to represent our reported results in the main paper Fig.5, which filled in the shaded light blue range. . . . . . . . . . . 340 17.11Per-task accuracy of our main method and other versions of our method after learning all 53 tasks in the continual learning setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 17.12Bar plot of transfer learning performance on 53-dataset. . . . . . . . . . . . . . . . . . . . . 343 17.13Transfer learning result on 53-dataset of our method and other baselines (LINEAR, SCRATCH, FINETUNE, and Head2Toe) . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 xxxiv Abstract Artificial Intelligence (AI) models have witnessed a substantial evolution in their capabilities and complexity. This is exemplified by tools like ChatGPT, a robust AI assistant that I utilize daily. ChatGPT is characterized by its extensive parameter set, running into the trillions. Concurrently, there has been a corresponding increase in the data demands for training such models. This encompasses not only the volume of data but also its quality. Training models with real data presents several notable challenges. A primary concern is the high cost and low efficiency associated with human labeling, particularly for complex tasks such as 3D scene understanding and robotics. In these areas, acquiring accurate labels is not only challenging but also difficult to scale. Additionally, real data is often fraught with issues related to bias, fairness, copyright, and privacy. In recent times, data has increasingly become a valuable asset in technology, leading to a reluctance among major companies to share their datasets due to the commercial advantage it offers. In this thesis, the exploration focuses on the utilization of controllable data generation as a solution to address the aforementioned data-related challenges. The sources of generated data are diverse, ranging from graphic engines typically employed in video game development to interactive simulators, and more recently, advanced generative models. These models include capabilities like converting text to 2D images, 3D models, and videos. Such sources offer significant advantages for model training, including minimal or no requirement for manual labeling, and the potential to generate an almost infinite amount of data. In this thesis, we explore advanced learning techniques for controlling various properties of data generation; the generated high-quality data can be used to train or evaluate downstream models, which help them achieve xxxv better performance. In the first part, we propose different learning methods to manipulate the data generator across a spectrum of attributes, compositionality, distributions, and physical properties. This enables the creation of valuable data for training both 2D and 3D AI models, applicable in tasks such as recognition, detection, segmentation, and scene understanding. The second part details the transition of control from humans to downstream models. This shift facilitates on-demand data generation, creating a synergistic loop between the data generator and the downstream models, enhancing their effectiveness and efficiency. In the third part of this thesis, we show how we could use explainability to understand model reasoning logic and provide feedback for data generation to improve model performance. Additionally, in the fourth part, we delve into a broader conceptualization of “Data.” In this context, model parameters are viewed as a form of efficient “Data” that can be shared across different models. This approach facilitates more scalable and efficient model learning. Overall, this thesis posits that AI-generated data, with its controllability, serves as an effective complement and alternative to real data. It is increasingly vital for scalable model training. We chart the cutting edge of controllable data generation, investigating its expansive potential in revolutionizing scalable model training. xxxvi Chapter 1 Introduction ch:introduction 1.1 Large AI models need more and better quality data sec:intro-background Artificial Intelligence (AI) models have become increasingly powerful, achieving remarkable success across various domains including Computer Vision, Natural Language Processing (NLP), and Robotics. This advancement is characterized by a significant increase in both their capabilities and the number of parameters they use. For instance, ChatGPT [409], a widely-used AI assistant, encompasses trillions of parameters. Figure 1.1 shows the exponential growth of the number of parameters in Deep Learning-based language models. As AI models evolve, the demand for training data has escalated in both size and quality. In the NLP domain, for example, Llama-2 [554], an open-sourced large language model similar to GPT, requires pretraining on 2 trillion tokens and over one million new human-annotated examples for fine-tuning. While specific details about ChatGPT’s training are not public, it is known that its development involved a significant investment, exceeding 100 million dollars. In computer vision, extensive datasets and benchmarks have been the cornerstone of research over the past decade [112, 348, 138, 300, 179, 610, 62, 209, 202]. The recent Segment Anything [296] initiative has made significant strides in 2D segmentation, utilizing over 1 billion masks and 11 million images. These examples underscore the increasing importance of data in developing powerful AI models. 1 Time Figure 1.1: Exponential growth of number of parameters in Deep Learning models. From https:// towardsdatascience.com/the-rise-of-cognitive-ai-a29d2b724ccc fig:intro-1 However, training models with real data presents several challenges. Traditionally, data is collected and labeled by humans for model training. The first major challenge is the high cost and time-consuming nature of human labeling. In complex tasks such as 3D scene understanding [522, 100, 179, 629] and robotics [140], obtaining accurate labels is extremely difficult, let alone scaling the process. Additionally, real data often comes with issues related to bias, fairness [59, 381], copyright [112], and privacy [588]. In recent times, data has become a highly guarded asset, particularly for large companies, as it holds significant commercial value. 2 1.2 AI generated data to train models sec:intro-generate To address the challenges posed by real data, a pivotal question arises: can generated data be effectively used to train models? Generated data can originate from various sources: graphics engines such as Unity [273], Blender (including BlenderProc [114, 115]) that are widely used in video games development; interactive 3D simulators [323, 109, 159], developed for Embodied AI and the robotics community, that provide a physically realistic environment; and generative AI models, including Generative Adversarial Networks and Diffusion-based models, which have shown remarkable success in generating high quality data, especially for computer vision. Furthermore, the advent of Large Language Models has led to the development of text-guided generative models. These models offer a versatile and powerful means to create high-quality data, capable of converting text into 2D, 3D, and video formats. All these sources are proving to be invaluable in providing data for training models. Data Model Automatic Data Generator Controllable BlenderProc Unity GAN Diffusion … Figure 1.2: Pipeline of utilizing generated data for model training fig:intro-2 Figure 1.2 illustrates a pipeline for utilizing generated data in model training. The process begins with constructing a Data Generator, composed of various tools and AI models mentioned above. This Data Generator is used to generate data, which is then employed to train models. A key aspect of this pipeline is 3 the automation of data generation, reducing human effort and facilitating scalability. Moreover, controllability is crucial in this process. The reason being, random data is not the goal; instead, we aim for data that meets specific requirements and is particularly useful for model training. Controllability brings two major benefits to generated data: (1) it minimizes or even eliminates the need for manual labeling, as labels can be defined during the generation process; (2) it allows for the creation of almost unlimited amounts of data, as the generator can continually generate new datasets with a simple function call. 1.3 Thesis Outline sec:intro-outline This thesis presents my research conducted over the past few years, focusing on Learning Controllable Data Generation for Scalable Model Training. The primary objective is to demonstrate that controllable AI-generated data serves as an effective complement and substitute for real data, playing an increasingly vital role in scalable model training. By the conclusion of this thesis, I aim to convincingly illustrate the significance and potential of this approach. This thesis consists four main part. In the first part, we show different methods that I explored to learn control over varying attributes, data compositions, distributions, and physical properties during data generation, and how the generated data can be used to help downstream model training and evaluation. In the second part, we show transition of the controllability from humans to downstream models: downstream model control with data they need and provide signal to data generator. Importantly, how it paves the way for on-demand data generation, forging a symbiotic loop between the data generator and the downstream models. In the third part, we explore the use of explainability to understand the reasoning logic behind AI models. This understanding enables us to provide targeted feedback for data generation, thereby enhancing model performance. In the forth part, In the fourth part, we broaden our understanding of ’Data.’ Here, we view model parameters as a unique and valuable form of ’data’ that can be shared and applied across various 4 models. Adopting this view promotes a more scalable and efficient method for learning in diverse model environments. Figure 1.3 shows the overview structure of this thesis. Data Model Data Generator Control Single Attribute Multiple Attributes Compositionality Distribution Explainability as feedback Figure 1.3: Thesis overview on learning controllable data generation for scalable model training. fig:intro-3 1.3.1 Part 1: Diverse Methods of Controllable Data Generation sec:intro-outline-part1 This part of the thesis explores various methods of controllable data generation that we have explored. These methods are designed to control the data generator to generate data that is beneficial for training or evaluating AI models in tasks like object recognition, detection, segmentation, and scene understanding. In Chapter 2, our initial efforts focus on controlling a single attribute, such as an object’s pose. We demonstrate how this controlled data can enhance object recognition models. Specifically, we developed a generation model that takes an object’s view as input and generates multiple images of the object in different poses. We present findings showing how this generated data improves the object recognition task in 2D images. Moving beyond single attribute control, Chapter 3 addresses the simultaneous control of multiple attributes. Here, we 5 introduce the concept of controllable disentangled representation learning for multi-attribute manipulation during data generation. Previous attribute control are mainly fine-grained control. Chapter 4 explores higher-level control aspects like object category and the composition of foreground and background. This approach is used to generate high-quality vision data for object detection and segmentation tasks. We demonstrate that models trained solely on synthetic data can achieve performance comparable to those trained on real data. Furthermore, combining synthetic with real data can significantly enhance model performance. This finding underscores the potential of generated data as a valuable complement to real data in model training In Chapter 5, we build upon the Copy-Paste concept, applying it to the task of weakly supervised instance segmentation. Rather than employing a text-to-image generative model to create the foreground, we introduce an approach based on Expectation Maximization (EM). This method iteratively optimizes the distribution of foreground objects to automatically extract them from raw images. Subsequently, these extracted objects are used in a copy-paste process to generate training data with accurate ground truth labels for instance segmentation model training. In Chapter 6, we expand the scope of compositional control from 2D to 3D environments, focusing on controlling physical properties in the generation of complex 3D vision datasets. This aids in achieving state-of-the-art (SOTA) performance in monocular 3D detection tasks. We introduce the concept of “3D Copy-Paste”, a novel technique that automatically integrates virtual objects into real scenes, ensuring realistic physical positioning and appearance. The 3D bounding boxes generated through this method enable the training of a monocular 3D detection model, which attains SOTA results. Previously, our controllable data generation primarily focused on training models for common classes, where the class names and definitions are already known. However, there is a growing need for more personalized data generation, tailored to specific distributions. This requires a generator capable of autonomously learning a distribution and then generating new data within that distribution. Chapter 7 delves into this aspect, 6 presenting a method for generating personalized data with unique distributions. We introduce ’DreamDistribution,’ a novel approach for personalized distribution generation. This technique, using just a few reference images as a starting point, is designed to generate new samples that align with the desired distribution. 1.3.2 Part 2: On-demanded Data Generation sec:intro-outline-part2 In all the methods described in Section 1.3.1, the control of data generation is executed by humans, who determine what data is useful for downstream tasks. This approach can be referred to as passive data generation from the perspective of downstream models. However, ensuring that the generated data’s distribution aligns well with the target domain for various downstream scenarios remains a challenge. This part marks a significant shift from passive to active data generation. Here, we transition from human-led to model-driven control over data generation. The downstream models themselves dictate the type of data to be generated, essentially taking over the role traditionally played by humans. In Chapter 8, we introduce a method for automatic on-demand data generation, facilitating direct communication between models. This approach, termed ’Neural-Sim,’ utilizes Neural Radiance Fields (NeRFs) as a dataset generator. Depending on the requirements of various downstream tasks, Neural-Sim can automatically generate the necessary data. This process ensures that models trained on this tailored data exhibit enhanced performance in their target scenarios. In Chapter 9, we explore the ’Real-to-Sim-to-Real’ data generation pipeline. This method is particularly useful for tasks with limited or no available data for current model training, such as spatial relationship prediction. Our approach involves creating a digital twin within a simulation environment, allowing for controllable data generation for both model training and evaluation. The data generated in this manner can significantly improve model performance on real-world test scenarios. We introduce the BEHAVIOR Vision Suite (BVS), a comprehensive set of tools and assets developed for creating custom synthetic data within the BEHAVIOR-1K embodied AI environment. BVS offers extensive customization options across various levels: 7 scene level (e.g., lighting, object placement), object level (e.g., joint configuration, attributes like ’filled’ and ’folded’), and camera level (e.g., field of view, focal length). Researchers can manipulate these parameters as needed during data generation, enabling them to conduct controlled experiments and systematically evaluate computer vision models. 1.3.3 Part 3: Enhancing Model Performance and Generalization through Model Explainability sec:intro-outline-part3 The processes discussed previously lack interpretability for human users. To bridge this gap, this part of the thesis introduces the concept of using model explainability as a form of feedback, aimed at improving model performance and generalization. The underlying idea is that understanding the reasons behind a model’s failures can guide the generation of targeted data to rectify these issues. In Chapter 10, we focus on interpreting the model’s reasoning logic by addressing ’why’ and ’why not’ questions: specifically, why the model predicts class A instead of class B. This approach aims to provide a deeper understanding of the model’s reasoning logic. The tool we propose for this purpose is the ’Structural Concept Graph.’ In this graph, each node represents a visual concept, while the edges delineate the relationships between these concepts, collectively providing a logical explanation of the model’s reasoning process. In Chapter 11, we introduce the concept of a Human-Neural Network (NN) Interface. This interface is designed to enable humans to easily and intuitively understand the reasoning logic of a neural network through graphical representation. It provides a means for humans to apply their broader contextual understanding, common sense, and causal reasoning skills to adjust the network’s logic by modifying the graphical representation. Following these human-led modifications, we have developed a method to seamlessly integrate this refined knowledge back into the original network. The result is an enhanced network that not only replaces its predecessor but also performs better, benefiting from human-guided improvements. 8 In Chapter 12, we investigate the contributions of three important features of the human visual system (HVS)—shape, texture, and color—to object classification. We build a humanoid vision engine (HVE) that explicitly and separately computes shape, texture, and color features from images. We use human experiments to confirm that both HVE and humans predominantly use some specific features to support the classification of specific classes. In Chapter 13, we conduct an analysis of model failure cases and assess model confidence during decision-making processes. We introduce a novel approach of top-down and bottom-up label augmentation, leveraging the hierarchical information from WordNet. This method is aimed at enhancing the Zero-shot Generalization and Robustness of Multi-modal Models. By utilizing the structured relationships within WordNet, our approach systematically strengthens the model’s ability to generalize and perform reliably in zero-shot scenarios. In Chapter 14, we propose a novel one-class open-set OOD detector that leverages text-image pre-trained models in a zero-shot fashion and incorporates various descriptions of in-domain and OOD. Our approach is designed to detect anything not in-domain and offers the flexibility to detect a wide variety of OOD, defined via fine- or coarse-grained labels, or even in natural language. In the subsequent part of the thesis, we delve into the application of causal explainability to enhance model generalization. Chapter 15 introduces a pioneering framework named Invariant Structure Learning (ISL). ISL is designed to advance causal structure discovery, using generalization as a key indicator. It operates by dividing data into distinct environments and learning a structure that remains consistent across these environments. This is achieved by applying a consistency constraint, ensuring invariance to the target. Our experiments on both synthetic and real-world datasets show that ISL not only accurately identifies causal structures but also surpasses competing methods in performance. Moreover, ISL demonstrates exceptional generalization capabilities, particularly in datasets experiencing significant distribution shifts. 9 1.3.4 Part 4: Model Parameter as Special “Data” for Lifelong Learning sec:intro-outline-part4 Prior research has primarily concentrated on generating data in raw formats, such as images, videos, 3D scenes, or tabular data. However, when considering knowledge transfer to downstream model as a goal of data generation, it’s important to recognize that these raw formats may not be the most efficient means of transferring knowledge to downstream models. In this section, we investigate an alternative approach: treating model parameters as a more efficient form of ’Data’ for knowledge representation and transfer. This perspective considers model parameters as encapsulations of learned knowledge, which can be effectively shared and utilized across different AI models, potentially enhancing the process of lifelong learning. In Chapter 16, we propose the Shared Knowledge Lifelong Learning (SKILL) challenge. This challenge involves a decentralized network of Lifelong Learning (LL) agents. Each agent in this network independently and simultaneously learns different tasks. Once they have learned their respective tasks, these agents share and integrate their acquired knowledge through a decentralized communication network. The ultimate goal is for every agent in the network to be proficient in all tasks. We propose a solution to the SKILL challenge by employing Lightweight Lifelong Learning (LLL) agents. The key objective of these agents is to enable efficient knowledge sharing. This is achieved by minimizing the portion of the agent’s architecture that is task-specific, thereby facilitating easier transfer and integration of knowledge across different tasks. In Chapter 17, we introduce the Channel-wise Lightweight Reprogramming (CLR) approach, designed to aid Convolutional Neural Networks (CNNs) in mitigating catastrophic forgetting during lifelong learning. Our research demonstrates that a CNN model, initially trained on an old task or a self-supervised proxy task, can be effectively ’reprogrammed’ for a new task. This reprogramming is achieved through our proposed lightweight parameters, which are both cost-effective and efficient. The CLR approach underscores the potential for flexible adaptation of CNN models to evolving learning tasks without the need for extensive retraining. 10 In summary, this thesis explores the potential of controllable AI-generated data in scalable model training, presenting a range of innovative methods across four key areas. It begins with demonstrating various techniques for data generation to enhance AI model training in tasks like recognition and segmentation. The focus then shifts to on-demand, model-driven data generation, emphasizing automation over human input. In the third part, we utilize model explainability to improve performance and generalization, introducing tools like the Structural Concept Graph. The final part redefines model parameters as efficient data for lifelong learning, proposing novel methods for knowledge transfer. Overall, this thesis presents AI-generated data as a viable alternative to real data, showcasing its significant role in advancing scalable model training. 11 Chapter 2 Pose Augmentation: Class-agnostic Object Pose Transformation for Object Recognition chapter-2 Object pose increases intraclass object variance which makes object recognition from 2D images harder. To render a classifier robust to pose variations, most deep neural networks try to eliminate the influence of pose by using large datasets with many poses for each class. Here, we propose a different approach: a class-agnostic object pose transformation network (OPT-Net) can transform an image along 3D yaw and pitch axes to synthesize additional poses continuously. Synthesized images lead to better training of an object classifier. We design a novel eliminate-add structure to explicitly disentangle pose from object identity: first ‘eliminate’ pose information of the input image and then ‘add’ target pose information (regularized as continuous variables) to synthesize any target pose. We trained OPT-Net on images of toy vehicles shot on a turntable from the iLab-20M dataset. After training on unbalanced discrete poses (5 classes with 6 poses per object instance, plus 5 classes with only 2 poses), we show that OPT-Net can synthesize balanced continuous new poses along yaw and pitch axes with high quality. Training a ResNet-18 classifier with original plus synthesized poses improves mAP accuracy by 9% over training on original poses only. Further, the pre-trained OPT-Net can generalize to new object classes, which we demonstrate on both iLab-20M and RGB-D. We also show that the learned features can generalize to ImageNet. 12 Figure 2.1: Object pose transformation with OPT-Net. The first column shows input images from the test dataset, and the remaining columns show target pose images transformed by OPT-Net. Integer poses (1, 2, 3, 4, 5, 6 in red) are defined in the training dataset, while decimal poses (1.5, 2.5, 3.5, 4.5, 5.5 in green) are new poses, which shows OPT-Net can achieve continuous pose transformation. chap-2-fig-1 2.1 Introduction and related work In object recognition from 2D images, object pose has a significant influence on performance. An image depends on geometry (shape), photometry (illumination and material properties of objects) and dynamics (as objects move) of the scene. Thus, every image is a mixture of instance-specific information and nuisance factors [368], such as 3D viewpoint, illumination, occlusions, shadows, etc. Nuisance factors often depend on the task itself. Specifically, in object recognition from 2D images, we care for instance-specific information like shape, while the dynamics of pose is a nuisance that often degrades classification accuracy [368]. Deep convolution neural networks (CNNs) have achieved great success in object recognition [302, 515, 220, 540, 252] and many other tasks, such as object detection [192, 452, 448, 141], image segmentation [469, 386, 219], etc. Most research tries to discount pose, by eliminating pose information or improving pose robustness of a classifier. Typical CNN architectures, such as LeNet [314] AlexNet [302] and VGG [515] use convolution layers and pooling layers to make the high-level feature representations invariant to object pose over some limited range [657]. In contrast, recent results have shown that explicitly modeling pose 13 information can help an object recognition task [657, 37, 603, 30]. Some approaches use multi-task learning where pose information can be an auxiliary task or regularization to improve the main object recognition task [654, 637, 254, 532]. These neural networks have the potential to disentangle content from their instantiation attributes [443, 658, 200]. Training on multiple views of the object can improve recognition accuracy [531]. A common method is collecting all poses of the object and creating a pose-balanced dataset, with the hope that pose variations will average out. However, collecting pose-balanced datasets is hard and expensive. One notable such dataset is iLab-20M which comprises 22 million images of 704 toy vehicles captured by 11 cameras while rotating on a turntable [48]. Here, we use a subset of this data to learn about pose transformations, then transferring this knowledge to new datasets (RGB-D [306], ImageNet [112]). 2D images can be seen as samples of 3D poses along yaw and pitch axes (Figure 2.2(a)). We want our OPT-Net to imitate the 3D pose transformation along these two axes. Thus given any single pose image, we can ’rotate’ the object along yaw and pitch axes to any target pose. Instead of directly training a transformation model to continuously ’rotate’ images, we start with a discrete transform, which is easier to constrain. Then we can make the pose representation continuous and regularize the continuous transform process. Here, we use sampled discrete poses along yaw and pitch as our predefined poses (Figure 2.2(b), 6 poses along the yaw axis and 3 poses along pitch axis). We treat different object poses as different domains so that discrete pose transformation can be seen as an image-to-image translation task, where a generative model can be used to synthesize any target pose given any input pose. Recently, Generative Adversarial Networks (GAN) [196] have shown a significant advantage in transforming images from one modality into another modality [388, 261, 485, 672, 291]. GANs show great performance in various tasks, such as style transfer [75, 282], domain adaptation [244, 559], etc. However, there is a high cost in our task, because we should train specific GANs for all pairs of poses [40]. StarGAN [92] and CollaGAN [316] proposed a method for multi-domain mapping with one generator and showed great results in appearance changes such as hair color, age, and emotion transform. However, pose transform creates a large, nonlinear spatial change between input and 14 output images. The traditional structure of the generators (Unet [469], Vnet [386]) has few shared structures which satisfy all randomly paired pose transformation. It makes StarGAN training hard to converge (see Exp 4.1). Learning a better representation could also reduce variance due to pose. [676] tried to learn better representation features to disentangle identity rotation and view features. InfoGAN [83] learns disentangled representations in an unsupervised manner. [279] seeks a view-invariant representation shared by views. To combine the idea of better representation and multi-domain image transformation, we propose a class-agnostic object pose transformation neural network (OPT-Net), which first transforms the input image into a canonical space with pose-invariant representation and then transform it to the target domain. We design a novel eliminate-add structure of the OPT-Net and explicitly disentangle pose from object identity: OPT-Net first ‘eliminates’ the pose information of the input image and then ‘adds’ target pose information to synthesize any target pose. Convolutional regularization is first used to implicitly regularize the representation to keep only the key identification information that may be useful to any target pose. Then, our proposed pose-eliminate module can explicitly eliminate the pose information contained in the canonical representation by adversarial learning. We also add a discriminator leveraging pose classification and image quality classification to supervise the optimization of transforming. Overall our contributions are multifold: (1) developed OPT-Net, a novel class-agnostic object pose transformation network with an eliminate-add structure generator that learns the class-agnostic transformation among object poses by turning the input into a pose-invariant canonical representation. (2) design a continuous representation of 3D object pose and achieve continuous pose transforming in 3D, which can be learned from limited discrete sampled poses and adversarial regularization. (3) demonstrated the generative OPT-Net significantly boosts the performance of discriminative object recognition models. (4) showed OPT-Net learns class-agnostic pose transformations, generalizes to out-of-class categories and transfers well to other datasets like RGB-D and ImageNet. 15 Figure 2.2: (a) Discrete predefined pose images sample. (b) Predefined sample poses and pose change along pitch and yaw axes. (c) Given any pose (1st and 8th columns), OPT-Net can transform it along pitch and yaw axes to target poses (remaining columns) chap-2-fig-2 2.2 Object Pose Transforming Network As shown in Figure 2.3, the proposed OPT-Net has an eliminate-add structure generator, a discriminator and a pose-eliminate module. 2.2.1 Eliminate-add structure of the generator Figure 2.3: Flow of OPT-Net, consisting of three modules: eliminate-add structure generator G, discriminator D, and pose-eliminate module. (a) Pose transformation sketch (b) Origin to target pose transformation. In the pose ‘eliminate’ part, G takes in the original pose image and first uses both implicit regularization and the explicit pose-eliminate module to eliminate pose information of the input, yielding a pose-invariant canonical representation. Then, in the pose ‘add’ part, the representation features are concatenated with a target pose mask and the target pose image is synthesized. D learns to distinguish between real and fake images and to classify real images to their correct pose. (c) Training OPT-Net: G first maps the original pose image to target pose and synthesizes a fake image, then G tries to reconstruct the original pose image from the fake image given the original pose information. chap-2-fig-3 The generator (G) of OPT-Net transforms an input object pose image x into a target object pose y conditioned on the target pose label c, G(x, c) →y. Different from the hair color, gender and age transform, which have more appearance transfer with smaller shape changes, object pose transformation creates large shape differences. Our eliminate-add structure generator (Figure 2.2 (a)) first turns the input pose image 16 into a pose-invariant canonical representation by ‘eliminating’ pose information, and then ‘adds’ target pose information to turn the representation into the target pose. As shown in Figure 2.2 (b), given an input image, we randomly select the target pose domain. We do not input target pose along with the input image. Instead, in the ‘eliminate’ part, the first several convolution layers with stride s > 2 are used to implicitly regularize the preserved representation features. This implicit regularization makes the representation features contain only key information for the transformation (appearance, color, shape), and eliminates useless information which may hinder transformation (pose). At the same time (Figure 2.2 (b)), the ‘pose-eliminate module’ (Pelim) explicitly forces the representation to contain as little pose information as possible, by predicting equal probability for every pose. After both implicit and explicit elimination of pose information, the input image is turned to a pose-invariant canonical representation space. We then ‘add’ the target pose information by concatenating it with the representation feature map. The remaining layers in the generative model transform the concatenated features into the target pose image. This eliminate-add structure is shared and can be used for any pose transformation. This shared structure makes the generator easy to converge. To control the translation direction, as shown in Figure 2.2 (b), we use an auxiliary classifier as discriminator D to guide the image quality and pose transform. Given one image, the discriminator has two outputs, the probability that the input image is real, which represents the quality of the synthesized image, and the output pose, which should match the desired target pose, D : x → {Dsrc(x), Dcls(x)} 2.2.2 Pose-eliminate module The pose-eliminate module (Pelim) takes the preserved representation feature xr as input and outputs pose classification {Pelim(xr)}. Pelim can be treated as a discriminator which forms an adversarial learning framework with the ‘eliminate’ part of the generator (Gelim). The canonical representation features of real images with pose labels are used to train Pelim. We use Cross-Entropy loss to make Pelim predict the correct pose from the pose-invariant feature after Gelim. Different from traditional adversarial training, when using 17 Pelim to train Gelim, we want the generator to eliminate all pose information in the pose-invariant feature, which makes Pelim produce equal output probability for every pose. We use the uniform probability (1/N) as the ground truth label to compute the pose-eliminate loss, which is used to optimize the Gelim. 2.2.3 Continuous pose transforming training We design a 2-dimension linear space to represent pitch and yaw values, in which we could interpolate and achieve continuous pose representation (Figure 2.1). The yaw and pitch values can be duplicated as a matrix with same h and w dimension as the canonical representation features and N (totally 6, 3 for yaw and 3 for pitch) channel dimension, which is easy to be concatenated and can be adjusted depending on the canonical features channel. We start the training on discrete sampled poses (which can be represented as integer in linear space). After the network has converged, we randomly sample decimal poses as target poses and use a style consistency loss to regularize the synthesized images, which keeps pose representation consistent along yaw and pitch axes. 2.2.4 Loss Function Our goal is to train a generator G that learns object pose transformations along yaw and pitch axes. The overall loss is formed by adversarial loss, domain classification loss, reconstruction loss, pose-eliminate loss and style consistency loss. Adversarial Loss. The adversarial loss is used to make the synthesized image indistinguishable from real images. Ladv = Ex[logDsrc(x)] + Ex,c[log(1 − Dsrc(G(x, c)))] (2.1) {chap-2-Eq.1} chap-2-Eq.1 Dsrc(x) represent the probability that input x belongs to the real images given by D. The generator G tries to minimize the loss, while the discriminator D tries to maximize it. 18 Pose Classification Loss. The pose classification loss is used to guide the pose transformation which makes the synthesized image y belong to the target pose c. This pose classification loss is used to optimize both D and G. The pose classification loss of D is defined as L r cls = Ex,c′[−logDcls(c ′ |x)] (2.2) {chap-2-Eq.2} chap-2-Eq.2 The loss for D is similar to a traditional Cross-Entropy loss for classification, whereDcls(c ′ |x) means the predicted probability of real image x belongs to the ground truth pose label c ′ . The pose classification loss of G is defined as L f cls = Ex,c[−logDcls(c|G(x, c))] (2.3) {chap-2-Eq.3} chap-2-Eq.3 G tries to minimize this loss to make the synthesized fake image G(x, c) be classified as the target pose c. Reconstruction Loss. To make the synthesized image preserve the content information and change only the object pose, as shown in Figure 2.3 (c), we use the cycle consistency loss [672] to optimize G. Lrec = Ex,c,c′[∥x − G(G(x, c), c′ )∥1] (2.4) {chap-2-Eq.4} chap-2-Eq.4 where G can reconstruct the original image x by transforming the synthesized fake target pose image G(x, c) back to the original pose c ′ . L1 norm is used as reconstruction loss. Pose-eliminate Loss. In the eliminate-add structure of G, to eliminate the pose information in preserved canonical representation features, we designed pose-eliminate loss to optimize the pose eliminate module ( Pelim) and the eliminate part of G,(Gelim ). The pose eliminate loss is L P pose = Ex,c′[−logPelim(c ′ |Gelim(x))] (2.5) {chap-2-Eq.5} chap-2-Eq.5 19 where Pelim(c ′ |Gelim(x)) means the predicted probability of the canonical representation features of a real image belongs to the ground truth pose label c ′ . The pose eliminate loss for Gelim is defined as L G pose = −Ex X N ci=1 1/N · log(Pelim(ci |Gelim(x))) (2.6) {chap-2-Eq.6} chap-2-Eq.6 where N is the number of pose classes we defined, ci represent the pose label, ci ∈ [0, N), Pelim(ci |Gelim(x)) represent the probability of the synthesized canonical representation belongs to the ci pose. In ideal situations, the Pelim can hardly predict the correct pose from canonical representation features and output equal probability for every pose, which means the pose information is eliminated in preserved canonical features. We use equal prediction of every pose to optimize Gelim instead of minimizing the pose classification accuracy of to avoid a ‘cheated optimize’ that Pelim tries to predict all input to a fixed pose class. Style consistency Loss. After the converge of the previous loss, we randomly sample decimal target pose instead of all integers to make continuous pose transforming, the style consistency loss can regularize the synthesized images. The equation of style consistency loss is same as adversarial loss above, but the target pose is randomly sampled decimal value along yaw and pitch axes. Full Loss Function. Finally, we optimize: LG = Ladv + λclsL f cls + λrecLrec + λposeL G pose (2.7) {chap-2-Eq.7} chap-2-Eq.7 LD = −Ladv + λclsL r cls (2.8) {chap-2-Eq.8} chap-2-Eq.8 LPelim = L P pose (2.9) {chap-2-Eq.9} chap-2-Eq.9 where λcls, λrec and λpose are hyper-parameters that control the relative importance of classification, reconstruction, and pose-eliminate losses. 20 2.3 Experimental Methods 2.3.1 Datasets iLab-20M dataset [48]. The iLab-20M dataset is a controlled, parametric dataset collected by shooting images of toy vehicles placed on a turntable using 11 cameras at different viewing points. There are in total 15 object categories with each object having 25 160 instances. Each object instance was shot on more than 14 backgrounds (printed satellite images), in a relevant context (e.g., cars on roads, trains on rail tracks, boats on water). In total, 1,320 images were captured for each instance and background combinations: 11 azimuth angles (from the 11 cameras), 8 turntable rotation angles, 5 lighting conditions, and 3 focus values (-3, 0, and +3 from the default focus value of each camera). The complete dataset consists of 704 object instances, with 1,320 images per object-instance/background combination, almost 22M images (18 times of ImageNet). RGB-D dataset. The RGB-D Object Dataset consists of 300 common household objects organized into 51 categories. This dataset was recorded using a Kinect style 3D camera. Each object was placed on a turntable and video sequences were captured for one whole rotation. For each object, there are 3 video sequences, each recorded with the camera mounted at a different height so that the object is shot from different viewpoints. 2.3.2 Network Implementation OPT-Net consists of two parts, pose ‘eliminate’, (including Gelim and Pelim) and pose ‘add’, (including Gadd and D). As shown in Figure 2.3 (b), Gelim first has 3 convolution layers, 2 of them with stride size of 2 to down-sample the input image. Then, 3 Residual blocks [220] form the backbone of Gelim. The output Gelim(x) is the pose-invariant canonical representation feature. The canonical feature is copied to different streams, one concatenates with the target pose mask, forming the input of Gadd to synthesize the target pose image. The other one is treated as the input of Pelim to predict the pose class. Gadd uses first layer merge the target pose information, then has 5 Residual blocks as a backbone and ends with 3 convolution layers (2 of 21 them perform up-sampling) to transform the canonical representation features to a target pose image, given a target pose information mask. For discriminator D, we adopt the PatchGAN [261] network. Pelim has a traditional classification network structure, which has the first 3 convolution layers with stride size of 2 to down-sample the input features, followed with 1 Residual block and another 3 down-sampling convolution layers. In the end, the output layer turns the feature to a N-dimensional (N poses) vector and we use Softmax to obtain the prediction of pose class. We use Wasserstein GAN objective with a gradient penalty [24] to stabilize the training process. We adjust the λpose during training the generator, at the beginning epochs of training, improving the value of λpose can accelerate the convergence of generator, which makes the synthesized fake pose image have meaningful corresponding spacial structure. We gradually reduce the value of λpose. At the last ending part of the training, λpose can be very small to make the optimization concentrate on improving the image quality. 2.4 Experiments and Results We have five main experiments: in Section 4.1 on object pose transformation task, we compare OPT-Net with baseline StarGAN [92] by quantitatively and qualitatively comparing the synthesized object pose image quality. In Section 4.2, we use the OPT-Net as a generative model to help the training of a discriminative model for object recognition, by synthesizing missing poses and balancing a pose bias in the training dataset. In Section 4.3, we further show the class-agnostic transformation property of OPT-Net by generalizing the pretrained OPT-Net to new datasets. In Section 4.4, we study the influence of object pose information for objects which are mainly distinguishable by shape, as opposed to other features like color. Finally, in Section 4.5, we further demonstrate how the learned pose features in OPT-Net and object recognition model with the iLab-20M dataset can generalize to other datasets like ImageNet. 22 Figure 2.4: Object pose transform comparison for StarGAN and OPT-Net. chap-2-fig-4 2.4.1 Object Pose Transformation Experiments Because the baseline models can only do discrete pose transform, we fix the pitch value and use 6 different yaw viewpoints among the 88 different views of iLab-20M as our predefined pose to implement our OPT-Net. As is shown in Figure 2.2, the selected 6 viewpoints have big spatial variance which can better represent the general object pose transformation task. In training set, each pose has nearly 26k images with 10 vehicle classes (Table 2.2). Each class contains 20∼80 different instances. The test set has the same 10 vehicle categories, but different instances than the training set. Both training and test datasets are 256x256 RGB images. The training dataset is used to train our OPT-Net and the baseline models, StarGAN. Our OPT-Net has one generator, one discriminator and one pose-eliminate module; StarGAN has one generator and one discriminator. Qualitative evaluation. The experiment results are shown in Figure 2.4. Compared with StarGAN, which struggles with large pose variations, the synthesized target pose images by OPT-Net are high quality with enough details. One possible reason is that eliminate-add structure decrease the conflicts between different directions on pose transformation. Fig.1 shows more results of OPT-Net. Quantitative evaluation. Real target pose images of input are used as ground truth. To reduce background influence, we segment the foreground vehicle with the Graph-Based Image Segmentation method and only compute mean squared error (MSE) and peak signal to noise ratio (PSNR) of foreground between the synthesized image and ground truth (Table 2.1). The result is the mean MSE and PSNR computed by 200 23 Table 2.1: Average Mean squared error (MSE; lower is better) and peak-signal-to-noise ratio (PSNR; higher is better) for different methods chap-2-table-1 StarGAN OPT-Net Mean MSE 502.51 374.76 Mean PSNR 21.95 23.04 different instances, the MSE and PSNR for each instance is the average of 6 synthesized fake pose images. Table 2.1 shows that the quality of synthesized images by OPT-Net is better than StarGAN. 2.4.2 Object Recognition Experiment We design an object recognition experiment to explore the performance of OPT-Net as a generative model to help the training of a discriminative model. Two different training datasets are tailored from iLab-20M, pose-unbalanced (P-UB) and pose-balanced (P-B). In P-UB (Table 2.2), 5 classes of vehicles (boat, car, semi, tank, and van) have all 6 pose images (same poses as 4.1), while the other 5 classes (bus, military car, monster, pickup, and train) have only two poses (pose2 and pose5), which has significant pose bias. In P-B, each category among 10 classes of vehicles has all 6 pose images (no pose bias). The test dataset is a pose-balanced dataset which contains different instances of the 10 classes of vehicles that were not in either training dataset (P-UB and P-B). The classification neural network we used is Resnet-18 [220] (no pre-training). We first train the classification model on P-UB and P-B, calculating the test accuracy of each class of vehicles on the test dataset. To evaluate the performance of OPT-Net, we first train it on P-UB to learn the object transformation ability. After training, for each category in P-UB which have only pose2 and pose5 (bus, military car, monster, pickup, and train), we use the trained OPT-Net to synthesize the missing 4 poses (pose1, pose3, pose4, pose6). We combine the synthesized images with P-UB and form a synthesized-pose-balanced (S-P-B) training dataset. To show continuous transforms, we also interpolate pose values and synthesize 5 new poses beyond the predefined ones, and form a synthesized-additional-pose-balanced (SA-P-B) training 24 Table 2.2: Poses used in the pose-unbalanced (P-UB) training dataset to train OPT-Net chap-2-table-2 Pose1 Pose2 Pose3 Pose4 Pose5 Pose6 boat ✓ ✓ ✓ ✓ ✓ ✓ bus ✓ ✓ car ✓ ✓ ✓ ✓ ✓ ✓ mil ✓ ✓ monster ✓ ✓ pickup ✓ ✓ semi ✓ ✓ ✓ ✓ ✓ ✓ tank ✓ ✓ ✓ ✓ ✓ ✓ train ✓ ✓ van ✓ ✓ ✓ ✓ ✓ ✓ dataset. S-P-B and SA-P-B were used to train the same resnet-18 classification model from scratch and to calculate test accuracy of each class of vehicles in the test dataset. We also use common data augmentation methods (random crop, horizontal flip, scale resize, etc) to augment the P-UB dataset to the same number of images as P-B, called A-P-UB (Table 2.3). The test accuracy of each class is shown in Table 2.4. From P-UB to S-P-B, the overall accuracy improved from 52.26% to 59.15%, which shows the synthesized missing pose images by OPT-Net can improve the performance of object recognition. It is also shown that OPT-Net, as a generative model, can help the discriminative model. Specifically, the vacant pose categories show significant improvement in accuracy: military improved by 11.68%, monster improved by 14.97%, pickup and train improved by 8.74% and 16.12% respectively. The comparison of S-P-B and A-P-UB shows that synthesized images by OPT-Net are better than traditional augmented images in helping object recognition. Because of the continuous pose transformation ability, our OPT-Net can synthesize additional poses different from the 6 poses in P-B. With these additional poses, SA-P-B (61.23%) performs even better than the P-B (59.20%), achieve 9% improvement compared with P-UB. 25 Table 2.3: Different training and testing datasets for object recognition chap-2-table-3 Dataset P-UB P-B S-P-B SA-P-B A-P-UB Test Source real real synthesized synthesized augmented real Size 25166 37423 37423 66041 37423 4137 Table 2.4: Testing object recognition accuracy (%) of each class after trained on different training dataset. Comparing S-P-B and SA-P-B with P-UB shows how much classification improves thanks to adding synthesized images for missing poses in the training set, reaching or surpassing the level of when all real poses are available (P-B). Our synthesized poses yield better learning than traditional data augmentation (A-P-UB) chap-2-table-4 Category P-UB P-B S-P-B SA-P-B A-P-UB boat 54.0 61.6 65.4 57.7 51.3 bus 35.2 42.5 38.1 47.8 37.2 car 85.1 76.3 79.8 64.0 78.9 mil 73.8 84.2 85.4 86.4 70.7 monster 45.3 67.4 60.2 66.0 52.9 pickup 17.8 26.7 26.6 36.5 18.7 semi 83.9 79.8 79.0 83.5 86.1 tank 78.1 69.4 78.6 77.0 72.5 train 41.1 65.1 57.2 58.1 43.1 van 23.6 18.6 24.2 20.7 21.0 overall 52.3 59.2 59.2 61.2 52.3 2.4.3 Class-agnostic Object Transformation Experiment Our proposed OPT-Net can simultaneously make pose transformation on different classes of vehicles, which demonstrate that the learned object pose transformation has not fixed with object classes, it is a class-agnostic object pose transformation. To further explore the class-agnostic property of OPT-Net, we design experiments that generalize OPT-Net’s ability for object pose transformation from one dataset to other datasets. 15 categories of objects from RGB-D are used. They are both common household objects with big spatial variance between different object poses. Similar poses of objects in RGB-D are selected and defined as the same pose as iLab-20M. For each pose, RGB-D contains only about 100 images which cannot train our OPT-Net from scratch, thus we use RGB-D to finetune OPT-Net pre-trained on iLab-20M. We can 26 Figure 2.5: Generalization results of OPT-Net on RGB-D dataset pretrained on iLab-20M. chap-2-fig-5 Table 2.5: Overall object recognition accuracy for different training dataset in RGB-D chap-2-table-5 Dataset P-UB P-B S-P-B A-P-UB Accuracy(%) 99.1 99.9 99.7 99.2 see Figure 2.5 that our pre-trained OPT-Net can generalize well to other datasets, which demonstrates that OPT-Net is a class-agnostic object pose transformation framework. To further explore the performance of OPT-Net as a generative model to help a discriminative model of object recognition, we split RGB-D into a pose-unbalanced (P-UB) training dataset, where each category randomly takes 3 poses among all 6 poses; pose-balanced (P-B), and test dataset similar to 4.2. We first use P-UB to finetune the pretrained OPT-Net, and then use the trained OPT-Net to synthesize missing poses of household objects in RGB-D. The synthesized images and the original pose-unbalanced images form the synthesized pose balanced (S-P-B) training dataset. Similarly, to eliminate the influence of the number of training images, we created A-P-UB using common data augmentation methods. We trained Alexnet [302] on the 4 training datasets separately, and showed the test accuracy for each category in Table 2.5. The (small) accuracy improvement in S-P-B compared with P-UB demonstrates that our pretrained OPT-Net can be generalized to different datasets after finetune, which can help the discriminative model in object recognition. While the overall improvement is small, below we show that this is not the case uniformly across all object categories. 27 Table 2.6: Object recognition overall accuracy for different datasets chap-2-table-6 Dataset P-UB-1 A-P-UB-1 S-P-B-1 P-UB-2 A-P-UB-2 S-P-B-2 Accuracy(%) 75.1 77.6 83.2 90.4 91.2 94.2 Dataset P-UB-3 A-P-UB-3 S-P-B-3 P-B Accuracy(%) 99.3 99.2 99.4 99.8 2.4.4 Object Pose Significance on Different Object Recognition Tasks Because the accuracy improvement in RGB-D is smaller than in iLab-20M, we tested whether this was the case across all object categories, or whether those which look more alike would benefit more from synthesized images from OPT-Net. Indeed, maybe classifying a black keyboard vs. a blue stapler can easily be achieved by size or color even without pose-dependent shape analysis. To verify our hypothesis, we use the confusion matrix of classification to select categories which are more confused by classifier: marker, comb, toothbrush, stapler, lightbulb, and sponge. We then assign different fixed poses to each category to improve overall pose variance and form P-UB-1 (randomly fix 1 pose for each category), P-UB-2 (randomly fix 2 poses for each category), and P-UB-3 (randomly fix 3 poses for each category) pose-unbalanced datasets. Similarly, we create 3 other training datasets using the same method as in 4.2 and 4.3: (S-P-B: use pretrained OPT-Net to synthesize the missing poses; P-B, and A-P-UB for each unbalanced datasets), and report the object recognition performance on the test dataset in Table 2.6. The results in Table 2.6 demonstrate that object pose information has different degrees of impact on the object recognition task. Compared with the results in 4.3, where the improvement between P-UB and S-P-B is less than 1%, here, when the class variance is small, OPT-Net can improve more accuracy after synthesizing the missing poses in the unbalanced dataset. The accuracy improvement in experiment group 1 (P-UB-1 and S-P-B-1) is 8.1%. This result verified our hypothesis that pose balance is more important in small interclass variance object cognition tasks. Meanwhile, comparing the different accuracy improvements in different experimental groups, group 2 (P-UB-2 and S-P-B-2) is 3.8%, while group 3 (P-UB-3 and S-P-B-3) 28 Figure 2.6: Top 8 ImageNet images for each pose predicted by discriminator in OPT-Net without finetune. chap-2-fig-6 is 0.1%. This demonstrates that when class-variance is fixed, the more pose bias we have, the more accuracy improvement we will get with the help of our OPT-Net pose transformation. 2.4.5 Generalization to Imagenet We directly use the pretrained OPT-Net on iLab-20M to synthesize images of different poses on ImageNet. Results are not as good and might be improved using domain adaptation in future work. However, the discriminator of OPT-Net makes decent prediction of image poses: Figure 2.6 shows the top 8 ImageNet images for each of our 6 poses. To test object recognition in ImageNet, we replace real images by OPT-Net synthesized images in S-P-B (4.2) and form a S-P-B (OPT-Net) dataset (all synthesized images). Similarly, we use StarGAN synthesized images form S-P-B (StarGAN). We use a resnet18 10-class vehicles classifier pretrained with this two synthesized datasets and predict 4 classes of vehicles in ImageNet which have similar meanings as iLab-20M, with good results on some classes like car. To further explore generalization, we pretrian an AlexNet on S-P-B which synthesized pose images by StarGAN and OPT-Net respectively and then 29 finetune it on ImageNet. Results shows significantly better accuracy compared to training from scratch when using only a small number of images per class, demonstrating generalization from iLab-20M to ImageNet. 2.5 Conclusions We proposed OPT-Net, a class-agnostic object pose transformation network (OPT-Net) to synthesize any target poses continuously given a single pose image. The proposed eliminate-add structure generator can first eliminate pose information and turn the input to a pose-invariant canonical representation, then adding the target pose information to synthesize the target pose image. OPT-Net also gives a more common framework to solve big variance continuous transformation problems. OPT-Net generated images have higher visual quality compared to existing methods. We also demonstrate that the OPT-Net, as a generative model can help the discriminative model in the object recognition task, which achieve a 9% accuracy improvement. We design experiments to demonstrate that pose balance is more important in small interclass variance object cognition tasks. Finally, we demonstrate the learned pose features in OPT-Net with the iLab-20M dataset can better generalize to other datasets like ImageNet. 30 Chapter 3 Zero-shot Synthesis with Group-Supervised Learning chapter-3 Visual cognition of primates is superior to that of artificial neural networks in its ability to “envision” a visual object, even a newly-introduced one, in different attributes including pose, position, color, texture, etc. To aid neural networks to envision objects with different attributes, we propose a family of objective functions, expressed on groups of examples, as a novel learning framework that we term Group-Supervised Learning (GSL). GSL allows us to decompose inputs into a disentangled representation with swappable components, that can be recombined to synthesize new samples. For instance, images of red boats & blue cars can be decomposed and recombined to synthesize novel images of red cars. We propose an implementation based on auto-encoder, termed group-supervised zero-shot synthesis network (GZS-Net) trained with our learning framework, that can produce a high-quality red car even if no such example is witnessed during training. We test our model and learning framework on existing benchmarks, in addition to a new dataset that we open-source. We qualitatively and quantitatively demonstrate that GZS-Net trained with GSL outperforms state-of-the-art methods. 3.1 Introduction Primates perform well at generalization tasks. If presented with a single visual instance of an object, they often immediately can generalize and envision the object in different attributes, e.g., in different 3D pose [362]. 31 Primates can readily do so, as their previous knowledge allows them to be cognizant of attributes. Machines, by contrast, are most-commonly trained on sample features (e.g., pixels), not taking into consideration attributes that gave rise to those features. To aid machine cognition of visual object attributes, a class of algorithms focuses on learning disentangled representations [294, 232, 60, 290, 82], which map visual samples onto a latent space that separates the information belonging to different attributes. These methods show disentanglement by interpolating between attribute values (e.g., interpolate pose, etc). However, these methods usually process one sample at a time, rather than contrasting or reasoning about a group of samples. We posit that semantic links across samples could lead to better learning. We are motivated by the visual generalization of primates. We seek a method that can synthesize realistic images for arbitrary queries (e.g., a particular car, in a given pose, on a given background), which we refer to as controlled synthesis. We design a method that enforces semantic consistency of attributes, facilitating controlled synthesis by leveraging semantic links between samples. Our method maps samples onto a disentangled latent representation space that (i) consists of subspaces, each encoding one attribute (e.g., identity, pose, ...), and, (ii) is such that two visual samples that share an attribute value (e.g., both have identity “car”) have identical latent values in the shared attribute subspace (identity), even if other attribute values (e.g., pose) differ. To achieve this, we propose a general learning framework: Group Supervised Learning (GSL, Section 3.3), which provides a learner (e.g., neural network) with groups of semantically-related training examples, represented as multigraph. Given a query of attributes, GSL proposes groups of training examples with attribute combinations that are useful for synthesizing a test example satisfying the query (Figure 3.1). This endows the network with an envisioning capability. In addition to applications in graphics, controlled synthesis can also augment training sets for better generalization on machine learning tasks (Section 3.6.3). As an instantiation of GSL, we propose an encoder-decoder network for zero-shot synthesis: Group-Supervised Zero-Shot Synthesis Network (GZS-Net, Section 3.4). While learning (Section 3.4.2 & 3.4.3), we repeatedly 32 Image-1 Image-2 Image-3 Image-1 Image-2 Image-3 Image-1 Image-2 Image-3 Image-4 Image-5 identity pose background identity pose expression content font color back color size font (a) (b) (c) Encoder Decoder Disentangled Latent Features Figure 3.1: Zero-shot synthesis performance of our method. (a), (b), and (c) are from datasets, respectively, iLab-20M, RaFD, and Fonts. Bottom: training images (attributes are known). Top: Test image (attributes are a query). Training images go through an encoder, their latent features get combined, passed into a decoder, to synthesize the requested image. Section 3.4.2 shows how we disentangle the latent space, with explicit latent feature swap during training. chap-3-fig-1 draw a group of semantically-related examples, as informed by a multigraph created by GSL. GZS-Net encodes group examples, to obtain latent vectors, then swaps entries for one or more attributes in the latent space across examples, through multigraph edges, then decodes into an example within the group (Section 3.4.2). Our contributions are: (i) We propose Group-Supervised Learning (GSL), explain how it casts its admissible datasets into a multigraph, and show how it can be used to express learning from semanticallyrelated groups and to synthesize samples with controllable attributes; (ii) We show one instantiation of GSL: Group-supervised Zero-shot Synthesis Network (GZS-Net), trained on groups of examples and reconstruction objectives; (iii) We demonstrate that GZS-Net trained with GSL outperforms state-of-the-art alternatives for controllable image synthesis on existing datasets; (iv) We provide a new dataset, Fonts*, with its generating code. It contains 1.56 million images and their attributes. Its simplicity allows rapid idea prototyping for learning disentangled representations. 3.2 Related Work We review research areas, that share similarities with our work, to position our contribution. *http://ilab.usc.edu/datasets/fonts 33 Self-Supervised Learning (e.g., [188]) admits a dataset containing features of training samples (e.g., upright images) and maps it onto an auxiliary task (e.g., rotated images): dataset examples are drawn and a random transformation (e.g., rotate 90◦ ) is applied to each. The task could be to predict the transformation (e.g., =90◦ ) from the transformed features (e.g., rotated image). Our approach is similar, in that it also creates auxiliary tasks, however, the tasks we create involve semantically-related group of examples, rather than from one example at a time. Disentangled Representation Learning are methods that infer latent factors given example visible features, under a generative assumption that each latent factor is responsible for generating one semantic attribute (e.g. color). Following Variational Autoencoders (VAEs) [294], a class of models including [232, 82] achieve disentanglement implicitly by incorporating into the objective, a distance measure e.g. KL-divergence, encouraging the latent factors to be statistically-independent. While these methods can disentangle the factors without knowing them beforehand, unfortunately, they are unable to generate novel combinations not witnessed during training (e.g., generating images of red car, without any in training). On the other hand, our method requires knowing the semantic relationships between samples (e.g., which objects are of same identity and/or color), but can then synthesize novel combinations (e.g., by stitching latent features of “any car” plus “any red object”). Conditional synthesis methods can synthesize a sample (e.g., image) and some use information external to the synthesized modalities, e.g., natural language sentence [641, 246] or class label [388, 555]. Ours differ, in that our “external information” takes the form of semantic relationships between samples. There are methods based on GAN [197] that also utilize semantic relationships including Motion Re-targeting [626], which unfortunately requires domain-specific hand-engineering (detect and track human body parts). On the other hand, we design and apply our method on different tasks (including people faces, vehicles, fonts; see Figure 3.1). Further, we compare against two recent GAN methods starGAN [92] and ELEGANT [611], as they are state-of-the-art GAN methods for amending visual attributes onto images. While they are powerful in 34 carrying local image transformations (within a small patch, e.g., changing skin tone or hair texture). However, our method better maintains global information: when rotating the main object, the scene also rotates with it, in a semantically coherent manner. Importantly, our learning framework allows expressing simpler model network architectures, such as feed-forward auto-encoders, trained with only reconstruction objectives, as opposed to GANs, with potential difficulties such as lack of convergence guarantees. Zero-shot learning also consumes side-information. For instance, models of [309, 27] learn from object attributes, like our method. However, (i) these models are supervised to accurately predict attributes, (ii) they train and infer one example at a time, and (iii) they are concerned with classifying unseen objects. We differ in that (i) no learning gradients (supervision signal) are derived from the attributes, as (ii) these attributes are used to group the examples (based on shared attribute values), and (iii) we are concerned with generation rather than classification: we want to synthesize an object in previously-unseen attribute combinations. Graph Neural Networks (GNNs) [488] are a class of models described on graph structured data. This is applicable to our method, as we propose to create a multigraph connecting training samples. In fact, our method can be described as a GNN, with message passing functions [190] that are aware of the latent space partitioning per attribute (explained in Section 3.4). Nonetheless, for self-containment, we introduce our method in the absence of the GNN framework. 3.3 Group-Supervised Learning sec:gsl 3.3.1 Datasets admissible by GSL sec:gsl-admissible Formally, a dataset admissible by GSL containing n samples D = {x (i)} n i=1 where each example is accompanied with m attributes Da = {(a (i) 1 , a (i) 2 , . . . a (i) m )} n i=1. Each attribute value is a member of a countable set: aj ∈ Aj . For instance, pertaining to visual scenes, A1 can denote foreground-colors 35 Figure 3.2: (a) Samples from our proposed Fonts dataset, shown in groups. In each group, we vary one attribute but keep others the same. (b) (Sub-)multigraph of our Fonts dataset. Each edge connect two examples sharing an attribute. Sets S1 and S2 cover sample i. chap-3-fig-2 A1 = {red, yellow, . . . }, A2 could denote background colors, A3 could correspond to foreground identity, A4 to (quantized) orientation. Such datasets have appeared in literature, e.g. in [48, 379, 311, 306]. 3.3.2 Auxiliary tasks via Multigraphs Given a dataset of n samples and their attributes, we define a multigraph M with node set [1..n]. Two nodes, i, k ∈ [1..n] with i ̸= k are connected with edge labels M(i, k) ⊆ [1..m] as: M(i, k) = n j a (i) j = a (k) j ; j ∈ [1..m] o . In particular, M defines a multigraph, with |M(i, k)| denoting the number of edges connecting nodes i and k, which is equals the number of their shared attributes. Figure 3.2 depicts a (sub-)multigraph for the Fonts dataset (Section 3.5.1). Definition 1 COVER(S, i): Given node set S ⊆ [1..|Dg|] and node i ∈ [1..|Dg|] we say set S covers node i if every attribute value of i is in at least one member of S. Formally: COVER(S, i) ⇐⇒ [1..m] = [ k∈S M(i, k). (3.1) 36 When COVER(S, i) holds, there are two mutually-exclusive cases: either i ∈ S, or i /∈ S, respectively shaded as green and blue in Figure 3.2 (b). The first case trivially holds even for small S, e.g. COVER({i}, i) holds for all i. However, we are interested in non-trivial sets where |S| > 1, as sets with |S| = 1 would cast our proposed network (Sec. 3.4) to a standard Auto-Encoder. The second case is crucial for zero-shot synthesis. Suppose the (image) features of node i (in Figure 3.2 (b)) are not given, we can search for S1, under the assumption that if COVER(S1, i) holds, then S1 contains sufficient information to synthesize i’s features as they are not given (i /∈ S1). Until this point, we made no assumptions how the pairs (S, i) are extracted (mined) from the multigraph s.t. COVER(S, i) holds. In the sequel, we train with |S| = 2 and i ∈ S. We find that this particular specialization of GSL is easy to program, and we leave-out analyzing the impact of mining different kinds of cover sets for future work. 3.4 Group-Supervised Zero-Shot Synthesis Network (GZS-Net) sec:gzs We now describe our ingredients towards our goal: synthesize holistically-semantic novel images. 3.4.1 Auto-Encoding along relations in M Auto-encoders (D ◦ E) : X → X are composed of an encoder network E : X → R d and a decoder network D : R d → X . Our networks further utilize M emitted by GSL. GZS-Net consists of an encoder E : X × M → R d × M ; and a decoder D : R d × M → X . (3.2) M denotes the space of sample pairwise-relationships. GSL realizes such (X, M) ⊂ X × M, where X contains (a batch of) training samples and M the (sub)graph of their pairwise relations. Rather than passing 37 �! Group (a) Images: (b) E D � Self Reconstruction: OneOverlap Attribute Swap E D E Same id � D ? E D E D Cycle Attribute Swap � �# E D E D �#"#$ �"#$ �#%&' �%&' Same id � Same pose �# Same back No overlap (c) Random Corresponding Swap Encoder Decoder Disentangled Latent Features Feature Swap ? Reconstruction Loss Swap Reconstruction Loss Cycle Swap Reconstruction Loss E D Framed Images : Real images; No framed Images: Synthesis images Image copy Same pose � E D E D E D E D Same back � �#! �#! �#! �#! �#! �#! �(#! �(#! �! �#! �(#! Figure 3.3: Architecture of GZS-Net, consisting of an encoder E: maps sample onto latent vector, and a decoder D: maps latent vector onto sample. The latent space is pre-partitioned among the attribute classes (3 shown: identity, pose, background). (a, left) considered examples: a center image (x, red border) and 3 images sharing one attribute with it, as well as a no overlap image sharing no attributes (x¯, black border). (a, right) standard reconstruction loss, applied for all images. (b) One-overlap attribute swap: Two images with identical values for one attribute should be reconstructed into nearly the original images when the latent representations for that attribute are swapped ("no-op" swap; left: identity; middle: pose; right: background). (c) Cycle swap: given any example pair, we randomly pick an attribute class j. We encode both images, swap representations of j, decode, re-encode, swap on j again (to reverse the first swap), and decode to recover the inputs. This unsupervised cycle enforces that double-swap on j does not destroy information for other attributes. fig:gzs-net as-is the output of E into D, one can modify it using algorithm A by chaining: D ◦ A ◦ E. For notation brevity, we fold A into the encoder E, by designing a swap operation, next. 3.4.2 Disentanglement by swap Operation sec:gzl-swap While training our auto-encoder D(E(X, M)), we wish to disentangle the latents output by E, to provide use for using D to decode samples not given to E. D (/ E) outputs (/ inputs) one or more images, onto (/ from) the image space. Both networks can access feature and relationship information. At a high level, GZS-Net aims to swap attributes across images by swapping corresponding entries across their latent representations. Before any training, we fix partitioning of the the latent space Z = E(X, M). 38 Let row-vector z (1) = [g (1) 1 , g (1) 2 , . . . , g (1) m ] be the concatenation of m row vectors {g (1) j ∈ R dj } m j=1 where d = Pm j=1 dj and the values of {dj} m j=1 are hyperparameters. To simplify the notation to follow, we define an operation swap : R d × R d × [1..m] → R d × R d , which accepts two latent vectors (e.g., z (1) and z (2)) and an attribute (e.g., 2) and returns the input vectors except that the latent features corresponding to the attribute are swapped. E.g., swap(z (1), z(2) , 2) = swap([g (1) 1 , g (1) 2 , g (1) 3 , . . . , g(1) m ], [g (2) 1 , g (2) 2 , g (2) 3 , . . . , g(2) m ], 2) = [g (1) 1 , g (2) 2 , g (1) 3 , . . . , g(1) m ], [g (2) 1 , g (1) 2 , g (2) 3 , . . . , g(2) m ] One-Overlap Attribute Swap. To encourage disentanglement in the latent representation of attributes, we consider group S and example x s.t. COVER(S, x) holds, and for all x o ∈ S, x ̸= x o , the pair (x o , x) share exactly one attribute value (|M(x o , x)| = 1). Encoding those pairs, swapping the latent representation of the attribute, and decoding should then be a no-op if the swap did not affect other attributes (Figure 3.3 (b)). Specifically, we would like for a pair of examples, x (red border in Figure 3.3 (b)) and x o (blue border) sharing only attribute j (e.g., identity)† , with z = E(x) and z o = E(x o ), be s.t. D (zs) ≈ x and D z (o) s ≈ x (o) ; with zs, z(o) s = swap(z, zo , j). (3.3) {eq:oneoverlap} eq:oneoverlap If, for each attribute, sufficient sample pairs share only that attribute, and Eq. 3.3 holds for all with zero residual loss, then disentanglement is achieved for that attribute (on the training set). Cycle Attribute Swap. This operates on all example pairs, regardless of whether they share an attribute or not. Given two examples and their corresponding latent vectors, if we swap latent information corresponding to any attribute, we should end up with a sensible decoding. However, we may not have ground-truth supervision samples for swapping all attributes of all pairs. For instance, when swapping the color attribute between pair † It holds that COVER({x, xo }, x) and COVER({x, xo }, xo ) 39 orange truck and white airplane, we would like to learn from this pair, even without any orange airplanes in the dataset. To train from any pair, we are motivated to follow a recipe similar to CycleGAN [671]. As shown in Figure 3.3 (c), given two examples x and x¯: (i) sample an attribute j ∼ U[1..m]; (ii) encode both examples, z = E(x) and z¯ = E(¯x); (iii) swap features corresponding to attribute j with zs, z¯s = swap(z, z, j ¯ ); (iv) decode, xb = D(zs) and xb¯ = D(¯zs); (v) on a second round (hence, cycle), encode again as zb = E(xb) and bz¯ = E(xb¯); (vi) another swap, which should reverse the first swap, zbs, bz¯s = swap(z, b bz, j ¯ ); (vii) finally, one last decoding which should approximately recover the original input pair, such that: D (zbs) ≈ x and D bz¯s ≈ x¯; (3.4) {eq:cycle} eq:cycle If, after the two encode-swap-decode, we are able to recover the input images, regardless of which attribute we sample, this implies that swapping one attribute does not destroy latent information for other attributes. As shown in Section 3.5, this can be seen as a data augmentation, growing the effective training set size by adding all possible new attribute combinations not already in the training set. 3.4.3 Training and Optimization sec:gzl-learning Algorithm 1 lists our sampling strategy and calculates loss terms, which we combine into a total loss L(E, D; D, M) = Lr + λsrLsr + λcsrLcsr, (3.5) {chap-3-Eq.2} chap-3-Eq.2 where Lr , Lsr and Lcsr, respectively are the reconstruction, swap-reconstruction, and cycle construction losses. Scalar coefficients λsr, λcsr > 0 control the relative importance of the loss terms. The total loss L can be minimized w.r.t. parameters of encoder (E) and decoder (D) via gradient descent. 4 Algorithm 1 Training Regime; for sampling data and calculating loss terms Input: Dataset D and Multigraph M Output: Lr , Lsr, Lcsr 1 Sample x ∈ D, S ⊂ D such that COVER(S, x) and |S| = m and ∀k ∈ S, |M(x, k)| = 1 2 for x (o) ∈ S do 3 z ← E(x); z (o) ← E(x (o) ); zs, z (o) s ← swap(z, z(o) , j) 4 Lsr ← Lsr + ||D (zs) − x||l1 + D z (o) s − x (o) l1 # Swap reconstruction loss 5 x¯ ∼ D and j ∼ U[1..m] # Sample for Cycle swap 6 z ← E(x); z¯ ← E(¯x); (zs, z¯s) ← swap(z, z, j ¯ ); xb ← D(zs); xb¯ ← D(¯zs) 7 zb ← E(xb); bz¯ ← E(xb¯); (zbs, bz¯s) ← swap(z, b bz, j ¯ ) 8 Lcsr ← ||D (zbs) − x||l1 + D bz¯s − x¯ l1 # Cycle reconstruction loss 9 Lr ← ||D (E(x)) − x||l1 # Standard reconstruction loss chap-3-alg:training 3.5 Qualitative Experiments sec:exp We qualitatively evaluate our method on zero-shot synthesis tasks, and on its ability to learn disentangled representations, on existing datasets (Section 3.5.2), and on a dataset we contribute (Section 3.5.1). GZS-Net architecture. For all experiments, the encoder E is composed of two convolutional layers with stride 2, followed by 3 residual blocks, followed by a convolutional layer with stride 2, followed by reshaping the response map to a vector, and finally two fully-connected layers to output 100-dim vector as latent feature. The decoder D mirrors the encoder, and is composed of two fully-connected layers, followed by reshape into cuboid, followed by de-conv layer with stride 2, followed by 3 residual blocks, then finally two de-conv layers with stride 2, to output a synthesized image. 3.5.1 Fonts Dataset & Zero-shot synthesis Performance sec:exp_fonts Design Choices. Fonts is a computer-generated image datasets. Each image is of an alphabet letter and is accompanied with its generating attributes: Letters (52 choices, of lower- and upper-case English alphabet); size (3 choices); font colors (10); background colors (10); fonts (100); giving a total of 1.56 million images, each with size (128×128) pixels. We propose this dataset to allow fast testing and idea iteration on zero-shot 4 synthesis and disentangled representation learning. Samples from the dataset are shown in Figure 3.2. We partition the 100-d latents equally among the 5 attributes. We use a train:test split of 75:25. Figure 3.4: Zero-shot synthesis performance compare on Fonts. 7-11 and 18-22 columns are input group images and we want to combine the specific attribute of them to synthesize an new images. 1-5 and 12-16 columns are synthesized images use auto-encoder + Exhaustive Swap (AE+ES), β-VAE + Exhaustive Swap (β-VAE+ES), β-TCVAE + Exhaustive Swap (β-TCVAE+ES), auto-encoder + Directly Supervision (AE+DS) and GZS-Net respectively. 6 and 17 columns are ground truth (GT) chap-3-fig:baseline Baselines. We train four baselines: • The first three are a standard Autoencoder, a β-VAE [232], and β-TCVAE [82]. β-VAE and β-TCVAE show reasonable disentanglement on the dSprites dataset [379]. Yet, they do not make explicit the assignment between latent variables and attributes, which would have been useful for precisely controlling the attributes (e.g. color, orientation) of synthesized images. Therefore, for these methods, we designed a best-effort approach by exhaustively searching for the assignments. Once assignments are known, swapping attributes between images might become possible with these VAEs, and hopefully enabling for controllable-synthesis. We denote these three baselines with this Exhaustive Search, using suffix +ES. • The fourth baseline, AE+DS, is an auto-encoder where its latent space is partitioned and each partition receives direct supervision from one attribute. As shown in Figure 3.4, our method outperforms baselines, with second-runner being AE+DS: With discriminative supervision, the model focus on the most discriminative information, e.g., can distinguish e.g. across size, identity, etc, but can hardly synthesize photo-realistic letters. 42 3.5.2 Zero-shot synthesis on ilab-20M and RaFD sec:exp-ilabrafd iLab-20M [48]: is an attributed dataset containing images of toy vehicles placed on a turntable using 11 cameras at different viewing points. There are 3 attribute classes: vehicle identity: 15 categories, each having 25-160 instances; pose; and backgrounds: over 14 for each identity: projecting vehicles in relevant contexts. We partition the 100-d latent space among attributes as: 60 for identity, 20 for pose, and 20 for background. iLab-20M has limited attribute combinations (identity shows only in relevant background; e.g., cars on roads but not in deserts), GZS-Net can disentangle these three attributes and reconstruct novel combinations (e.g., cars on desert backgrounds) Figure 3.5 shows qualitative generation results. We compare against (AE+DS), confirming that maintains discriminative information, and against two state-of-the-art GAN baselines: starGAN [92] and ELEGANT [611]. GAN baselines are strong in knowing what to change but not necessarily how to change it: Where change is required, pixels are locally perturbed (within a patch) but the perturbations often lack global correctness (on the image). RaFD [311]: contains pictures of 67 models displaying 8 emotional expressions taken by 5 different camera angles simultaneously. There are 3 attributes: identity, camera position (pose), and expression. We partition the 100-d latent space among the attributes as 60 for identity, 20 for pose, and 20 for expression. We use a 80:20 split for train:test, and use GZS-Net to synthesize images with novel combination of attributes (Figure 3.6). The synthesized images can capture the corresponding attributes well, especially for pose and identity. 3.6 Quantitative Experiments 3.6.1 Quantifying Disentanglement through attribute co-prediction sec:disentanglement Can latent features of one attribute predict the attribute value? Can it also predict values for other attributes? Under perfect disentanglement, we should answer always for the first and never for the second. Our network 43 Figure 3.5: Zero-shot synthesis qualitative performance on ilab-20M. Columns left of the dashed line are output by methods: the first five are baselines, followed by three GZS networks. The baselines are: (1) is an auto-encoder with direct supervision (AE+DS); (2, 3, 4) are three GAN baselines changing only one attribute; (5) is starGAN changing two attributes. Then, first two columns by GZS-Net are ablation experiments: trained with part of the objective function, and the third column is output by a GZS-Net trained with all terms of the objective. starGAN of [92] receives one input image and edit information. ELEGANT uses identity and background images. Others use all three inputs. chap-3-fig:ilab did not receive attribute information through supervision, but rather, through swapping. We quantitatively assess disentanglement by calculating a model-based confusion matrix between attributes: We analyze models trained on the Fonts dataset. We take the Test examples from Font, and split them 80:20 for trainDR:testDR. For each attribute pair j, r ∈ [1..m] × [1..m], we train a classifier (3 layer MLP) from gj of trainDR to the attribute values of r, then obtain the accuracy of each attribute by testing with gj of testDR. Table 3.1 compares how well features of each attribute (row) can predict an attribute value (column): perfect should be as close as possible to Identity matrix, with off-diagonal entries close to random (i.e., 1 / |Ar|). GZS-Net outperforms other methods, except for (AE + DS) as its latent space was Directly Supervised for this particular task, though it shows inferior synthesis performance. 44 Figure 3.6: GZS-Net zero-shot synthesis performance on RaFD. 1-2 and 6-7 columns are the synthesized novel images using auto-encoder + Directly Supervision (AE+DS) and GZS-Net respectively. Remaining columns are training set images with their attributes provide. fig:rafd Table 3.1: Disentangled representation analysis. Diagonals are bolded. chap-3-table-1 GZS-Net Auto-encoder AE + DS β-VAE + ES β-TCVAE + ES A (|A|) C S FC BC St C S FC BC St C S FC BC St C S FC BC St C S FC BC St Content (52) .99 .92 .11 .13 .30 .48 .60 .71 .92 .06 .99 .72 .22 .20 .25 .02 .35 .11 .19 .01 .1 .39 .13 .11 .01 Size (3) .78 1.0 .11 .15 .36 .45 .61 .77 .96 .07 .54 1.0 .19 .23 .25 .02 .38 .29 .11 .01 .02 .47 .18 .19 .01 FontColor (10) .70 .88 1.0 .16 .23 .48 .60 .67 .95 .06 .19 .64 1.0 .66 .20 .02 .33 .42 .11 .01 .02 .35 .21 .13 .01 BackColor (10) .53 .78 .21 1.0 .15 .53 .63 .64 .93 .08 .32 .65 .29 1.0 .25 .02 .34 .11 .86 .01 .03 .40 .24 .75 .01 Style (100) .70 .93 .12 .12 .63 .49 .60 .70 .94 .06 .38 .29 .20 .20 .65 .02 .33 .10 .11 .02 .02 .33 .10 .08 .01 3.6.2 Distance of synthesized image to ground truth sec:imagedistance The construction of the Fonts dataset allows programmatic calculating ground-truth images corresponding to synthesized images (recall, Figure 3.4). We measure how well do our generated images compare to the ground-truth test images. Table 3.2 shows image similarity metrics, averaged over the test set, comparing our method against baselines. Our method significantly outperforms baselines. 3.6.3 GZS-Net Boost Object Recognition sec:exp-gzs-boost We show-case that our zero-shot synthesised images by GZS-Net can augment and boost training of a visual object recognition classifier [177]. Two different training datasets (Figure 3.7a) are tailored from iLab-20M, pose and background unbalanced datasets (DUB) (half classes with 6 poses per object instance, other half with only 2 poses; as we cut poses, some backgrounds are also eliminated), as well as pose and background balanced dataset (DB) (all classes with all 6 poses per object instance). 45 Table 3.2: Average metrics between ground-truth test image and image synthesized by models, conducted over the Fonts dataset. We report MSE (smaller is better) and PSNR (larger is better). chap-3-table-2 GZS-Net AE+DS β-TCVAE +ES β-vae + ES AE +ES Average Mean Squared Error (MSE) 0.0014 0.0254 0.2366 0.1719 0.1877 Average Peak Signal-to-Noise Ratio (PSNR) 29.45 16.44 6.70 9.08 7.9441 We use GZS-Net to synthesize the missing images of DUB and synthesize a new (augmented) balanced dataset DB-s. We alternatively use common data augmentation methods (random crop, horizontal flip, scale resize, etc) to augment the DUB dataset to the same number of images as DB-s, called DUB-a. We show object recognition performance on the test set using these four datasets respectively. Comparing DB-s with DUB shows ≈ 7% points improvements on classification performance, due to augmentation with synthesized images for missing poses in the training set, reaching the level of when all real poses are available (DB). Our synthesized poses outperform traditional data augmentation (DUB-a) Figure 3.7: (a) Dataset details for training object recognition task, where the x-axis represents different identities (1004) and the y-axis represents the backgrounds (111) and poses (6) each purple and brown pixel means our dataset covers the specific combination of attributes. (b) object recognition accuracy (%) on 37469 test examples, after training on (augmented) datasets. fig:cla 3.7 Conclusion We propose a new learning framework, Group Supervised Learning (GSL), which admits datasets of examples and their semantic relationships. It provides a learner groups of semantically-related samples, which we show is powerful for zero-shot synthesis. In particular, our Group-supervised Zero-Shot synthesis network (GZS-Net) is capable of training on groups of examples, and can learn disentangled representations by explicitly swapping latent features across training examples, along edges suggested by GSL. We show that, to 46 synthesize samples given a query with custom attributes, it is sufficient to find one example per requested attribute and to combine them in the latent space. We hope that researchers find our learning framework useful and extend it for their applications. 47 Chapter 4 Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation chapter-4 We propose a new paradigm to automatically generate training data with accurate labels at scale using the textto-image synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach* decouples training data generation into foreground object generation, and contextually coherent background generation. To generate foreground objects, we employ a straightforward textual template, incorporating the object class name as input prompts. This is fed into a text-to-image synthesis framework, producing various foreground images set against isolated backgrounds. A foreground-background segmentation algorithm is then used to generate foreground object masks. To generate context images, we begin by creating language descriptions of the context. This is achieved by applying an image captioning method to a small set of images representing the desired context. These textual descriptions are then transformed into a diverse array of context images via a text-to-image synthesis framework. Subsequently, we composite these with the foreground object masks produced in the initial step, utilizing a cut-and-paste method, to formulate the training data. We demonstrate the advantages of our approach on five object detection and segmentation datasets, including Pascal VOC and COCO. We found that detectors trained solely on synthetic data produced by our method achieve performance comparable to those trained on real data (Fig. 4.1). Moreover, a combination of real and *This is an extension of DALL-E for detection [176] 48 68 43.24 45.5 0 10 20 30 40 50 60 70 80 syn + real syn real Object Detection Performance (Pascal VOC mAP-50 ) I want an object detector for dog Human effort Machine (Test to Image model) effort Text to Image model generated data with automatic labeling Human collected data with human labeling (a) (b) Figure 4.1: (a) Comparison of DALL-E for detection pipeline and traditional human-centric pipeline (b) Using pure synthetic data from the text-to-image model (syn) could lead on-par performance with using all real data (real), mixing real and synthetic (syn + real) gives strong performance gains (+22.5 mAP). chap-4-fig:teaser synthetic data yields even much better results. Further analysis indicates that the synthetic data distribution complements the real data distribution effectively. Additionally, we emphasize the compositional nature of our data generation approach in out-of-distribution and zero-shot data generation scenarios. We open-source our code at https://github.com/gyhandy/Text2Image-for-Detection. 4.1 Introduction chap-4-sec:intro Training modern deep learning models necessitates large labeled datasets [451, 220, 499]. Yet, acquiring such datasets poses significant challenges due to the high costs and extensive human effort involved, making 49 the process both expensive and time-intensive. This leads us to a pivotal question: Is it possible to efficiently generate large-scale labeled data that also delivers high accuracy for subsequent tasks? We believe that any such approach should satisfy these qualities (Table 4.1): minimal human involvement, automatic generalization of the images for any new classes and environments, scalable, generation of high quality and diverse set of images, explainable, compositional, and privacy-preserving. To this end, synthetic techniques could be used as promising avenues for generating labeled data for training computer vision models. One popular approach is to use computer graphics to generate data [457, 524, 556]. Despite their potential, these strategies predominantly hinge on acquiring 3D models of both objects and scenes. This often necessitates specialized skills, such as 3D modeling expertise, which inherently curtails the scalability of such methods. Another viable approach is using NeRF based rendering [383, 167]. These strategies typically entail retraining models to accommodate new object classes. As such, they lack the flexibility to effortlessly scale across a vast array of object classifications. Lastly, a third approach is object cut and paste [129], which pastes foregrounds onto the backgrounds. However, such an approach demands a diverse and reliable source of foreground object masks as well as coherent backgrounds, which can be challenging to procure. Recently, there has been a revolutionary advancement in text-to-image (T2I) synthesis models such as DALL-E [440], Stable Diffusion [468], RU-DALLE [487], CogView [121], Imagen [483], MUSE [76], and eDiff-I [31]. Not only are these models capable of crafting high-quality images, but they also excel in illustrating intricate scenes, understanding semantics, and capturing the compositional nuances of the real world. Therefore, they might act as a natural bridge between humans and image synthesis. Nevertheless, despite their capability to function as synthetic data creators, these models lack the ability to produce regionlevel bounding boxes or pixel-level segmentation. Consequently, they remain unsuitable for downstream tasks such as object detection or instance segmentation. 50 Method Quality Less Human Adapt Scalable Explain. Privacy Comp. Human capture ✓ ✗ ✗ ✗ ✓ ✗ ✓ Web image ✓ ✓ ✗ ✓ ✗ ✗ ✗ Public dataset ✓ ✗ ✗ ✓ ✗ ✗ ✗ Generative models ✓ ✓ ✗ ✓ ✗ ✓ ✗ Ours ✓ ✓ ✓ ✓ ✓ ✓ ✓ Table 4.1: Desired quality of context generation method: images should be high quality and diverse, less human involvement, generalization of the images for any new environment, scalable, explainable, privacypreserving, and compositional. table:advantage In this work, we explore if these T2I models can be harnessed to generate expansive training datasets equipped with accurate labels, specifically tailored for tasks such as object detection and instance segmentation. Moreover, we primarily focus on low-resource regime wherein downstream users have access to a limited training dataset but demand a high-performing, robust, and broadly applicable object detector or instance segmentation. We opt for this context because acquiring on-demand human annotations tailored to the specific requirements of each individual downstream user can be prohibitively expensive, making the role of synthetic datasets even more crucial. We propose a novel data generation paradigm that use T2I to produce large-scale high-quality and contextually-coherent pseudo-labeled datasets with minimal human involvement. Essentially, we retrieve pertinent information to curate a specific dataset aimed at enhancing downstream discriminative object detectors and instance segmentation, from the general knowledge base of generative T2I. Our pipeline is bifurcated into two components: foreground object mask generation and contextual background generation (Figure 4.2). For zero-shot scenarios where only class names are available, we employ a set of simple templates for each of the interested objects, such as “A photo of dog in a white background” to elicit dog-related images from the T2I. Given that our templates are devised to position objects against easilysegmentable backdrops, straightforward background-foreground segmentation techniques can be employed to extract precise foreground object masks. For generating backgrounds, we employ analogous templates to craft clear backgrounds devoid of any target objects. After CLIP filtering, foreground object masks are pasted 51 onto the backgrounds via cut and paste [129], resulting in the final pseudo-labeled dataset. For few-shot scenarios where additional real training images are available, we specifically produce contextually-aligned coherent backgrounds from these shots as background context plays a pivotal role in learning strong object recognition models [122, 128]. For example, placing airplanes and boats in their natural context helped to improve accuracy, e.g. airplanes are generally found in the sky and boats are on the water. We caption each shot, extract contextual keywords (e.g. “grass field” from “A dog lying on grass field”) from these captions, contextually augment those captions (e.g. “A real photo of forest”), and produce coherent images by feeding T2I those augmented captions. The proposed pipeline satisfies all desired properties of data generation (Table 4.1). It minimizes the requirement for human intervention in both the initial provision of training data and the subsequent steps of synthetic data generation. T2I enables privacy-preserving and scalable generation. We obtain explainable and compositional data generation by operating within the language domain before feeding to T2I. Adding or removing objects or settings can be easily done. For example, a description as “an environment with a table” can be easily modified to a kitchen environment by utilizing the compositional properties of language as “a kitchen environment with a table.” Furthermore, even when out-of-distribution training instances are given, by simple language edit, we can easily force generated synthetic dataset to be better matched with the test distribution (e.g. by substituting “cartoon kitchen” with “real kitchen” to bridge the gap between cartoon training set and real test set). Our main contributions are three-fold: (1) We propose a language-driven compositional synthetic dataset generation that automatically generate large-scale high-quality foreground objects and contextually coherent backgrounds. (2) We demonstrate strong performance gains across three downstream tasks and five widelyused benchmarks, under various low-resource regmines including 0, 1, and 10-shot settings. (3) We highlight that the compositionality in our synthetic data generation results in a closer alignment with the test distribution, 52 even when dealing with out-of-distribution training instances. To the best of our knowledge, this is the first work to use vision and language models for generating object detection and segmentation datasets. 4.2 Related works chap-4-sec:related_works Text-to-Image Synthesis Models. T2I approaches like DALL-E [440], RU-DALLE [487], Stable Diffusion [468], CogView [121] have revolutionized high-quality image generation. These approaches leverage benefits of using large transformer models trained using large vision and text data. Though they can generate high quality images of real world scenes, they are not able to generate ground truth labels for objects. For example, they are not able to provide bounding box and per-pixel annotation for objects. In our work, we propose an automatic approach to generate high quality images with ground truth bounding box and per-pixel labels from T2I models. Synthetic Data Generation. A series of works on using synthetic data for training computer vision problems have been proposed. Some of them include using graphics pipeline or computer games to generate high quality labelled data [457, 458, 470, 241, 557, 215]. Generally using graphics pipeline requires having 3D models of both objects and environment, that may limit their scalability. Some of them use generative models (e.g., GAN) [52, 177] or zero-shot synthesis [166] to augment datasets and remove bias. However, they need a relatively large initial dataset to train the model, and not easy to generalize to new domains. The idea of pasting foreground objects on background images has emerged as a easy and scalable approach for large scale data generation. The idea has been used to solve vision problems like object instance segmentation tasks [129], object detection and pose estimation [533, 234, 433, 544, 558], and more [127, 632, 184]. These approaches generally require accurate foreground object masks. This limits their scalability. While we utilize a cut-and-paste approach, in contrast to previous works in this space, our work can generate foreground object masks for any new object class. 53 Language for object recognition. Vision-language based models have been developed for image captioning [569, 454, 18, 210], visual question answering tasks [20, 564, 9, 102] and others [365, 624, 326]. In recent years, vision-language based models have been developed for self-supervised training. The recent CLIP appraoch [434] showed how training a model on large image-text pairs can generalize to several image classification datasets where current image based models performed very poorly. Such vision-language models havec been used to solve other tasks like object detection [206, 278, 330] and semantic segmentation [321]. These works demonstrate benefits of using language in solving vision tasks. However, these methods can not be used generate new data for any new tasks or environment settings. Large T2I synthesis models can be used to generate new data for a new task. A recent concurrent work X-Paste [656] has used Stable Diffusion to solve object detection. We are motivated by generation quality of these text to image generation methods in this work. 4.3 Method chap-4-sec:method We layout our method in Figure 4.2. This work aims to enhance object detection and instance segmentation models by efficiently generating a diverse and extensive collection of pseudo-labeled synthetic data using textto-image models. One key observation is that each image can be divided into backgrounds and foregrounds. In this work, we propose to generate synthetic foreground objects (with mask extraction) (Section 4.3.1) and contextually coherent backgrounds (Section 4.3.2) separately. Subsequently, after CLIP filtering on both (Section 4.3.3), the synthesized foregrounds are then composited onto the synthesized backgrounds via cut-paste [129] (Section 4.3.4). Examples of synthetic datasets can be viewed in Figure 4.8 and Figure 4.9. We utilize off-the-shelf text-to-image generative models (T2I) [238, 468, 239] to generate both the foreground masks and contextual backgrounds. Firstly, T2I efficiently compresses web-scale image-text data, ensuring both portability and scalability. The text-to-image generation model can produce an endless array of high-quality images, with our method offering a controllable generation process. Our experiments 54 CDI of Interest Class (dog): Interest Class (“dog”): Foreground Extraction Foreground Paste foreground to background Generated background CLIP DALL-E / Stable Diffusion Prompts A photo of a dog A realistic photo of dog … … Caption A dog lying on grass field Context description A real image of grass field A real photo of forest Context extraction and augmentation DALL-E / Stable Diffusion (a) Foreground Generation CLIP (b) Background Generation … Figure 4.2: (a) Foreground generation: (top row, Section 4.3.1) verbalizes class name into templates understandable by T2I models [440, 468], which synthesize desired foreground images with easy-to-separate backgrounds. Off-the-shelf foreground/background segmentation methods are then used to extract foreground segments from foreground images. (b) Background generation: (bottom row, Section 4.3.2) an image captioning method (e.g. SCST [454]) captions user-provided images (CDIs). Context words (e.g. “grass field”) are extracted and the augmented caption is feed into T2I to generate background images. (c) CLIP [434] is used (Section 4.3.3) to maintain the quality of both foregrounds and backgrounds, as well as ensure that the generated images do not have unwanted class. (d) Finally, we composite (Section 4.3.4) the foreground segments and background images to obtain synthetic images with accurate labels. chap-4-fig:capgen 55 A man and a woman are preparing food in a kitchen A kitchen and food a kitchen of white microwave with a pizza inside of it a white microwave with a pizza inside of it A cartoon kitchen with a microwave and other kitchen accessories A real kitchen with a microwave and other kitchen accessories Compositional & Explainable Remove Add Style change Figure 4.3: When CDIs can not perfectly describe the real test scenario, the compositional property of language can help to correct the context description. For instance, if the initial description contains noisy information “man and a woman”, we can remove the noisy information to generate a congruent context description. Images with a red frame show the generated image without language intervention and the green frame shows the images after the intervention. fig:compositionality and subsequent analysis confirm that the synthetic data distribution effectively complements the real data distribution, as detailed in Section 4.4.3. Also, the generative model can create novel scenarios that were not present in the training data (Section 4.4.4, Section 4.4.5). Furthermore, language-based generation facilitates compositionality (Section 4.4.5). Lastly, the synthetic nature of the data generation procedure ensures privacy preservation. We underscore that our method, thanks to the capabilities of T2I in generating diverse images, particularly excels in low-resource regimes where ground truth data is limited. Following the terminology of [434, 320], in this work, we focus mainly on zero-shot (where no ground-truth data is provided but a list of desired objects) and few-shot (where one or ten images per class are given). We also included experiments to show our methods improve using full-set as well (Table 4.4). 56 A photo of A realistic photo of A photo of in pure background in a white background without background isolated on white background Table 4.2: Six manually designed templates for generating foreground images zero-shot. Here will be replaced by label names such as “bus”. The design philosophy is to put objects in a clean background for easy foreground extraction. tab:fg_template empty living room empty kitch blue sky empty city street, color empty city road, color empty lake empty sea railway without train empty railway, color trees forest empty street, colored farms nature empty farm stable Table 4.3: Sixteen handcraft templates for generating coherent background images zero-shot. The full template is “A real photo of <context>” where <context> is substituted with one of the above 16 places. The design philosophy is to create natural images but without any interested objects (thus “empty”) since we would not have segmentation labels for those objects if they are generated. tab:bg_template 4.3.1 Zero-shot Foreground Synthesis sub:fg_gen In object detection or segmentation, the quality of foreground objects significantly influences the performance of downstream tasks [184]. This poses a significant challenge in low-resource scenarios, as training a model that can effectively generalize with limited exposure to seen foregrounds is inherently difficult. To address this, we aim to harness the available prior knowledge to generate a vast and diverse collection of high-quality foregrounds. Samples of the synthesized images can be found in Figure 4.2 and Figure 4.8. Specifically, we manually design six fixed prompt templates (Table 4.2), where is substituted with the class name like “dog.” We then feed those verbalized templates into the T2I, which generates high-quality iconic object images. The templates are designed to elicit images where interested objects are centered on a simple isolated background, enabling straightforward extraction of the foreground object masks using an unsupervised segmentation method EM-Paste [173] built on top of EntSeg [429], as analyzed in Section 4.4.3. We emphasize that our method is robust to the selection of such segmentation method (Section 4.4.2). Examples of generated foregrounds and extracted masks can be viewed in Figure 4.8. 57 4.3.2 Language-driven Context Synthesis sub:bg_gen Merely having high-quality foregrounds is insufficient for training a robust downstream model. It is equally crucial to position those foregrounds within suitable in-context backgrounds [122, 394, 128, 646, 315, 632]. In our experiments (Section 4.4.2) we discovered that the choice of backgrounds can significantly impact performance, with different backgrounds having the potential to degrade results dramatically. Thus, we propose to utilize T2I to generate coherent backgrounds as relevant “context” for downstream models. Other than creating contextually coherent backgrounds, one additional benefit is that, as contexts are described in natural language, compositional generation is possibly by editing the descriptions e.g. adding or removing a keyword. For example, in Figure 4.3 the word “kitchen” can be added or “people” can be removed to align more closely with the test distribution. 4.3.2.1 Zero-shot scenario sub:bg_gen:zero-shot Similar to Section 4.3.1, for the zero-shot scenario, we design sixteen background templates (Table 4.3) intended to generate backgrounds without interested objects, avoiding “misplaced” objects with unknown object masks. The selection of background templates is primarily based on a cursory examination of various images from the training set, Pascal VOC [138] and MS COCO [348], in our case. However, it is important to highlight that this process is minimal, and only high-level background descriptions are required. 4.3.2.2 Few-shot scenario sub:bg_gen:few-shot In a more relaxed few-shot setting, a few exemplars, dubbed Context Description Images (CDI), are given, all sampled from a distribution that is contextually close to the test distribution. CDIs provide valuable contextual cues. We term this setting “few-shot” but note that provided CDIs do not have to be sampled from the training distribution (Figure 4.3). E.g. if the test distribution includes a kitchen environment, the small set of kitchen images can be taken from any public dataset or web images (Section 4.4.4, Section 4.4.5). Our 58 goal is to mine beneficial context to encourage the synthesized background distribution to more closely match the test distribution by imposing the context constraint. Lastly, we note that our method significantly reduces human labor, since our method requires a minimal number of CDI. In fact, our method works sufficiently well using as little as 1 CDI (Figure 4.10, Section 4.4.5). Context Description. We first describe the context information in natural language via image captioning. We leverage self-critique sequence training (SCST) [454] to generate a set of diverse textual captions for input CDIs. Yet we note that our method is agnostic to the image captioning method and any other methods e.g. [18, 210, 327] can also be applied. Context Extraction and Augmentation. Our focus is primarily on the background rather than the objects themselves, as the generated objects do not have labels. Therefore we need to extract background contexts out of the captions that might contain objects. We built a simple parser that extracts context words (e.g., “grass field”) or removes unwanted objects (e.g. detecting nouns such as “dogs”). These cleansed captions are then transformed into a few augmented contexts (e.g. “A real photo of grass field” and “A real photo of forest”) via simple heuristics. While it is possible to automate the entire context extraction process using large language models [57, 409], we leave that as future work. Context-guided Synthesis. We feed augmented contexts into the T2I to produce a diverse set of contextually coherent backgrounds. As these augmented contexts are derived from the CDIs, they are more contextualaligned with the test distribution; also they should contain no objects since these are removed in the extraction process. 4.3.3 CLIP Sanity Check sub:clip sanity check Due to the limitations of T2I, we observed that it occasionally generated irrelevant or nonsensical images that strongly deviated from the input text. Moreover, T2I method learns associations between the context and objects, leading to instances where horses would frequently appear in the context of a “stable,” even when 59 the text explicitly states “An empty stable.” Thus, in Section 4.4.2 we find it indispensable to post-process generated foregrounds and backgrounds via CLIP [434] filtering. Specifically, we use CLIP to rank images via two rules: images are semantically faithful to the input text and semantically dissimilar to any interest classes. This step ensures the generations align more closely with desired semantics. 4.3.4 Cut-paste Composition sub:cut paste To create the final pseudo-labeled training data, we composite foreground object masks (Section 4.3.1) onto the backgrounds (Section 4.3.2) using cut-paste [129]. At each step, a set of four foreground object masks is selected and pasted into a sampled background image, and such procedure is repeated until all foreground masks are pasted. The foreground mask, after random 2D augmentation such as rotation and scaling, is pasted on a random location in the image. In addition, following [129, 184], we apply a Gaussian blur on the object boundary with blur kernel σ to blend pasted foregrounds. 4.4 Experiments chap-4-sec:experiment #CDI Method Foreground Background mAP@50 mAP 1,464 (Fullset) Pure Real [219] Real Real 45.50 17.00 0 (0 shot) Pure Syn Syn Syn 43.24 19.78 20 · 1 (1 shot) Pure Real [219] Real Real 0.14 0.04 + cut paste [129] Real Real 6.03 2.07 Syn Fg Syn + Real Real 37.97 17.53 Pure Syn Syn Syn 44.24 20.63 Syn + real Syn + Real Syn + Real 45.62 (+39.59) 21.45 (+19.38) 20 · 10 (10 shot) Pure Real [219] Real Real 9.12 2.35 + cut paste [129] Real Real 29.60 10.82 Syn Fg Syn + Real Real 48.14 21.62 Pure Syn Syn Syn 45.12 22.38 Syn + real Syn + Real Syn + Real 51.82 (+22.22) 25.87 (+15.05) Syn + 1,464 real Syn + 1,464 Real Syn + 1,464 Real 68.38 (+22.88) 35.96 (+18.96) Table 4.4: We harness T2I (Stable Diffusion) to generate a large-scaled high-quality synthetic foregrounds and backgrounds, and improve VOC object detection. Column mAP is computed as the average of IoU ranging from 50 to 95 with step size 5. tab:voc_object_detection 60 We now present experiments to demonstrate the effectiveness of the large-scale synthetic data generated using our proposed approach. In Section 4.4.1, we first provide detailed results on object detection task on the Pascal VOC [138] and COCO [348] datasets in low-resource regimes including zero-shot, 1-shot and 10-shot settings. We are interested in low-resource settings, which are more common in practice yet highly challenging due to the limited information provided. In this case, CDIs are the shots given. We emphasize that our method particularly shines in such settings due to the capability to create a large diverse set of high-quality coherent synthetic training datasets. In addition, we present ablation studies (Section 4.4.2) to investigate the impact of different design choices. Importantly, further analysis (Section 4.4.3) demonstrates that the synthetic data distribution effectively complements the real data distribution. Next, in Section 4.4.4, we show that our method can generalize to more tasks, including instance segmentation tasks on VOC and COCO, and instance detection tasks on GMU-Kitchen [182], Active Vision [17], and YCB-video datasets [610]. Finally, in Section 4.4.5, we also provide results highlighting the compositional nature of our data generation process. Model, training, and evaluation criterion. We use MaskRCNN [451] with a ResNet-50 [220] backbone for compatibility among object detection, instance segmentation, and object instance detection tasks. We set the learning rate as 0.001 with a weight decay 0.0005 and train the models to convergence for both baselines and our approaches. We report mean average precision (mAP) for the results. In Section 4.4.2 we additionally experiment with MaskRCNN with DINO-pretrained ResNet-50 [69] as well as the recent transformer-based method EVA [142]. Variants of synthetic dataset. We evaluate three variants of our methods: • Pure Syn uses purely synthetic foregrounds (Section 4.3.1) and zero-shot backgrounds (Section 4.3.2.1) for zero-shot. For few-shot settings where CDIs are available, contextual backgrounds from CDIs (Section 4.3.2.2) are used on top of template backgrounds. 61 • Syn Fg pastes synthetic foregrounds (Section 4.3.1) on real backgrounds. Note that foregrounds from original real backgrounds are retained. • Syn + Real further incorporates synthetic backgrounds. In other words, synthetic datasets are blended with the entire real dataset. We mainly compare our methods with Pure Real, which trains MaskRCNN fully-supervised using the available real training set, and Pure Real + cut paste, which utilizes cut-paste [129, 184] to generate a relevant synthetic dataset from real images. We note that the latter is an upper bound of what can be achieved with provided real images without leveraging external sources, such as T2I, as in our work. 4.4.1 Object Detection chap-4-sec:det #CDI Method Foreground Background mAP@50 mAP 0 (0 shot) Pure Syn Syn Syn 16.30 8.40 80 · 1 (1 shot) Pure Real [219] Real Real 1.47 0.92 + cut paste [129] Real Real 2.89 1.23 Syn Fg Syn + Real Real 17.87 8.64 Pure Syn Syn Syn 16.80 8.59 Syn + real Syn + Real Syn + Real 20.82 (+17.93) 10.63 (+9.40) Table 4.5: Stable Diffusion generated foregrounds and contextual backgrounds enhance object detection on COCO dataset. tab:coco_object_detection For object detection, we present zero-/one-/ten-shot results on PASCAL VOC 2012 [138] in Table 4.4 and zero-/one-shot results on MS COCO [348] in Table 4.5. VOC encompasses 20 target objects, with 1,464 images for training and 1,449 for validation. While there exists an augmented training set comprising 10,582 images, it lacks instance segmentation masks, which are required by cut-and-paste augmentation. Given our emphasis on low-resource scenarios, we prioritize smaller training sets. MS COCO features 80 target objects with a total of 120k training instances. 62 Implementation details of synthetic datasets. We use the instance segmentation mask labels from the training set as our real foreground masks, and CDI is the number of training images in this case. Take 0-/10- shot VOC as example,† detailed statistics is provided in Table 4.6. For all three variants of synthetic datasets, we leverage fixed foreground templates to generate high-quality foregrounds (Table 4.2, Section 4.3.1). For each of the 20 objects, we generate 500 images for each of the six templates, then use CLIP (Section 4.3.3) to downselect 200 best images, thus aggregating to a total of 24k synthetic foregrounds. In the 0 shot Pure Syn scenario, we lack access to any training instances and must depend on fixed background templates to generate coherent backgrounds, as detailed in Table 4.3 and Section 4.3.2.1. For each of the 16 templates, we produce 600 images. Notably, the generated images are of high caliber and typically exclude target objects. As a result, we filter out 5% of the images using only CLIP, as described in Section 4.3.3, leaving us with 9,120 synthetic backgrounds. Onto each of these backgrounds, we superimpose four synthetic foregrounds, as outlined in Section 4.3.4. This results in a synthetic dataset complete with segmentation labels for the foregrounds. We repeat this procedure until the count of final pseudo-labeled images reaches 60k. In the 10 shot Pure Syn, we have an additional 200 CDIs at our disposal. Instead of directly training on these real images, they serve to facilitate the creation of more contextually coherent backgrounds through context-guided synthesis, as detailed in Section 4.3.2.2. Specifically, for each CDI, two captions are generated, from which 80 images are produced. Using CLIP, we then narrow this down to 30 images per CDI, as described in Section 4.3.3. This results in a total of 2×200×30 = 12k contextually relevant backgrounds. We then apply the cut-paste method, as outlined in Section 4.3.4, using the same zero-shot synthetic foregrounds on these expanded backgrounds, culminating in a final dataset of 60k. † 1-shot VOC and MS COCO is similar and omit for briefity. 63 For the 10 shot Syn+Real, our training encompasses not just the synthetic datasets from Pure Syn, but also the actual 200 images. Within those 20 × 10 real images, we also obtain multiple real foregrounds: in our selection of 10 shot data, there are 541 unique foregrounds. Experiment # Real images # Foreground # Background # Training set size Fullset (Pure Real) 1464 - - 1,464 0-shot (Pure Syn) 0 24k 9120 60k 10-shot (Pure Real) 200 - - 200 10-shot (Pure Real + cut paste) 200 200 200 60k 10-shot (Syn Fg) 200 24k 200 60k 10-shot (Pure Syn) 200 24k 9120+12k 60k 10-shot (Syn + Real) 200 24k+541 9120+12k+200 60k 10-shot (Syn + Real) + 1464 1464 24k+541 9120+12k+200 60k + 1464 Table 4.6: Detail statsitics of synthetic datasets created for VOC.tab:details_for_voc Figure 4.9 show more example training images generated by our pipeline on PASCAL VOC dataset: both foreground object and background context images are generated by our method with Stable Diffusion. We use Stable Diffusion [468] as the T2I, which takes 20 P40 GPUs 50h to generate 10-shot synthetic dataset. We emphasize that generation is a one-time process that can train various downstream tasks and models, and that our method is automatic and requires little human involvement to collect synthetic data. Collecting COCO or VOC manually takes much longer time with extra privacy or scalability issues. Synthetic foregrounds and backgrounds benefit downstream task. On VOC, we first notice that a model trained solely on synthetic data in the absence of any real images (0 shot Pure Syn) achieves comparable performance to a model trained on 1.4k real images (Pure Real). This suggests that synthetic datasets can effectively improve downstream performance when only prior knowledge of interested objects is known. We then observe that Pure Syn performance improves as adding CDIs, reinforcing our assumption that contextual information encoded in the backgrounds provides valuable cues in learning. Further, we note that Syn Fg in few-shot setting outperforms Pure Real + cut paste significantly, implying that inclusion of large-scale synthesized foregrounds enables the detection model to learn a more comprehensive understanding of objects due to the diversity and coherency of the synthetic data. Lastly, further performance gains by adding synthetic 64 backgrounds (Syn + real) show that blending both real and synthetic leads to the best performance, e.g. +22.22 net improvement over Pure Real + cut paste in 10 shot regime. Combining just 10-shot synthesized data with the 1,464 real training images achieves 68.38 mAP@50, a substantial +22.88 net improvement over using the 1,464 real training images alone. On COCO, we observe a similar trend as VOC, i.e. synthetic datasets provide a strong learning signal, while mixing both synthetic and real gives the most performance boost. 4.4.2 More Baselines and Ablation Studies chap-4-sec:ablation Figure 4.4: Our synthetic dataset generation is agnostic to different models and backbones. fig:ablation_on_backbone Agnostic to backbones. We first show that our synthetic data generation pipeline is model-agnostic. In Figure 4.4, we present performance of two additional models: transformer-based EVA [142] and DINO self-supervised pretrained ResNet-50 [69] on VOC. We observe a similar trend across the model choice: With only 10 shot exemplars, our approach can outperform fullest (pure real), which is 7x larger. On the other 65 hand, the model trained with our 0-shot-generated dataset can significantly surpass the best model available for training on 10 shot with cut-and-paste augmentation. Contextual backgrounds are crucial. We already demonstrated that in-context coherent backgrounds are beneficial in Section 4.4.1. We further investigate what are the best contextual backgrounds. As shown in Figure 4.5, on VOC we compare our synthetic contextual background from context-guided synthesis (Section 4.3.2) with three other contexts: (1) Search Engine: substitute T2I with a search engine. Specifically, we directly collect backgrounds from Google search by using the same prompts as described in Section 4.3.2; (2) Other real datasets: use MS COCO dataset as background. We randomly sample COCO images that contain only the remaining 60 classes so that they are disjoint from VOC 20 objects; (3) Black background: replace each of the contextual backgrounds with pure black backgrounds. Contextual backgrounds consistently outperform other baselines, suggesting that contextual cues in the backgrounds are important to learn a detection model. Figure 4.5: Contextual backgrounds generated by our approach provide valuable cues. fig:baselines on syn bg 66 sub:CLIP_is_beneficial CLIP and context extraction controls semantic quality and cleanness. We use CLIP [434] to filter and rank the synthesized context backgrounds (Section 4.3.2). In Figure 4.6, we train VOC object detection model without CLIP. We observe at most -21.65 net decrease in performance, which implies that using CLIP as a variance reduction step is essential in pruning noisy or nonsensical backgrounds that might be generated by T2I. We additionally ablate the effect of context extraction, as captions might contain interested objects. We found without extraction, interested objects contained in the captions will often be reflected in the synthetic images, and thus mislead the model during training as no annotation is provided in the pseudo-labeled dataset (Section 4.3.4). Figure 4.6: Our generated synthetic foregrounds (fg) are high-quality and diverse, and adding more helps. On the other hand, CLIP filtering and context extraction are crucial tricks to ensure the quality of synthetic backgrounds (bg). fig:ablation_on_syn_fg Extracting information from the generative model to enhance a discriminative model. Our foreground generation only requires class labels, so we can potentially generate as many foreground objects and their 67 corresponding masks per class as we want. In Figure 4.6, we observe performance improvement as the number of foreground objects increases. Robustness to foreground extraction methods. We next empirically demonstrate that foreground images consist of easy-to-separate background images. To this end, we use two off-the-shelf image segmentation methods, Entity Segmentation [429] and PP-Matting [80] to segment out the foreground objects. We use the segmented foreground to generate training data and show their results on the Pascal VOC object detection task in Table 4.7. We observe that we achieve similar mAP scores in both these settings, which demonstrates that our approach is generally robust to the selection of the image segmentation method. #CDI Method EntSeg [429] PP-Matting [80] 0 shot Pure Syn 43.24 46.11 20 · 1 (1 shot) Syn Fg 37.97 42.82 Pure Syn 44.24 44.84 Syn + real 45.62 46.71 20 · 10 (10 shot) Syn Fg 48.14 47.78 Pure Syn 45.12 43.01 Syn + real 51.82 52.39 Table 4.7: Our approach is robust to foreground extraction methods. tab:ablation_of_entseg Figure 4.7: Mixing of real and synthetic data further improves downstream models. fig:real+syn 68 Mixing real data with synthetic data. We show the effect of incorporating different percentages of realworld training images together with our synthesized images on Pascal VOC object detection. In Figure 4.7, we experiment with adding additional 5, 25, 50, 75 and 100 percent of real images on top of the synthetic dataset generated in Section 4.4.1. We observe strong performance gains compared to relying only on the same amount of real images or after applying cut-paste [129], e.g. improving from 51.82 mAP to 68.38. In Section 4.4.4, we show that such behavior is not limited to VOC dataset only. Pascal VOC training images Pascal VOC training masks Our generated Foreground images Our generated Foreground masks (a) Foreground Chair Dining table Sofa comparison (b) Object detection comparison Trained with pure real data Trained with real + synthetic data (c) Per-class Accuracy on Pascal VOC Figure 4.8: Synthetic data distribution complements the real data distribution. Our foreground generation helps even more on the highly occluded classes. fig:analysis 69 4.4.3 Synthetic data distribution complements the real data distribution. chap-4-sec:analysis Our results demonstrate that large T2I synthesis models can be used to generate high-quality training data for several large-scale problems. In order to analyze the behavior of our approach, we look at the object detection scores on VOC before and after adding synthetic data to full real training data. Further, in Figure 4.8 (c), we calculate the per-class mAP and look at the relative improvement and overall scores. We made two crucial observations. Firstly, there’s a substantial relative increase in accuracy across all classes, with mAP50 values ranging from 30% to 100%. This suggests that the synthesized images significantly enhance the performance of downstream detection tasks. The marked improvement underscores how effectively the synthetic data distribution complements the real data distribution. Secondly, to delve deeper into the synergy between synthetic and real data distributions, we examined the performance on a per-class basis. Notably, we witnessed marked enhancements in specific classes, such as the sofa, dining table, and chair. We hypothesize that the significant improvement observed in indoor classes can be attributed to the generation of clean and unobstructed foreground objects using our approach. In real-world scenarios, these objects are typically occluded due to the presence of humans or other objects, as depicted in the first two rows of Figure 4.8 (a). Consequently, training images may exhibit similar characteristics, with a high degree of occlusion. However, our approach enables the generation of a diverse set of clean objects, which supplements the quality of the training data from the real world, as illustrated in the remaining two rows of Figure 4.8 (a). This allows the training models to learn from both clean and occluded examples. Qualitative examples and the final object detection results on the test set are presented in Figure 4.8 (b), demonstrating that the model trained with a combination of synthetic and real data outperforms the model trained solely on real data, particularly for highly occluded categories. 70 Figure 4.9: Pseudo-labeled synthetic images generated by our pipeline. fig:final-composition-voc-smaller #CDI Method mAP@50 mAP 0 Pure Syn 42.42 22.38 20 · 1 (1 shot) Pure Real [219] 0.00 0.00 + cut paste [129] 1.27 0.88 Syn Fg 42.23 21.80 Syn + real 46.74 (+45.47) 24.81 (+23.93) 20 · 10 (10 shot) Pure Real [219] 5.21 2.09 + cut paste [129] 24.50 9.24 Syn Fg 51.29 29.11 Syn + real 55.19 (+30.69) 30.77 (+21.53) #CDI Method mAP@50 mAP 0 Pure Syn 15.04 8.40 80 · 1 (1 shot) Pure Real [219] 0.03 0.00 + cut paste [129] 3.83 1.87 Syn Fg 15.10 8.23 Syn + real 17.56 (+13.73) 9.12 (+7.25) Table 4.8: Instance Segmentation for VOC (left) and COCO (above). Our methods generalize to other tasks and are competitive even in 1 shot. tab:instance_seg 4.4.4 Generalization to more tasks chap-4-sec:instance_segmentation chap-4-sec:generalization Instance Segmentation. Our approach can generalize well to low-resource instance segmentation task on both Pascal VOC and COCO. Following same settings as Section 4.4.1, in Table 4.8, we observe similar patterns across two datasets: 0 shot pure synthetic dataset yields strong performance while mixing real images further boosts the performance. chap-4-sec:kitchen_instance Object Instance Detection. We evaluate our method on object instance detection tasks using three benchmarks: GMU-Kitchen [182], Active Vision [17], and YCB-video datasets [64]. For a fair comparison with [129], we instead use the object instance masks provided with the datasets and only synthesize backgrounds. In Table 4.10 we compare our synthetic contextual backgrounds, generated from Ru-DALLE [487], with 71 context images from UW dataset [182] following the setup of prior work [129]. Significant performance boosts indicate that our method is able to create congruent context images compared to real-world relevant images from public datasets. Furthermore, similar to experiments done in Section 4.4.2, we investigate if the behavior of improved performance via blending real and synthetic can be transferred to this task. In Table 4.9 on the GMU kitchen, we use all synthetic backgrounds, but with various percentages of the real training images. We observe that using only a subset of real-world data (70%) with our synthesized images achieves better performance than full (100%) real-world data only. This suggests the advantages of our data generation approach saving the amount of human efforts required in labeling the real-world data significantly. Further, we also observe that accuracy gradually improves from 78.3% to 91.4% as we increase the amount of real-world data. a counter top in a kitchen next to a table a kitchen area with a counter a kitchen with a wooden counter a wooden table sitting in a kitchen next to an oven DALL·E a close up of a sink in a kitchen a metal sink in a kitchen next to a red counter a metal sink in a kitchen next to red tiles a stainless steel sink in a kitchen DALL·E Figure 4.10: Contextual backgrounds generated from user-provided CDI. We note that even if the user provides as little as 1 CDI, our approach can still generate large-scale coherent images. fig:evidence 4.4.5 Compositionality in Synthetic Dataset chap-4-sec:compositionality As mentioned in Section 4.3.2, the compositional nature of our language-based context image generation allows us to remove noisy information, add relevant but missing information from the original textual description of the CDIs, or change the style of the received CDI to a more desired one. For instance, the language description of a kitchen with people present in it may contain “people” as a distractor that may hamper the quality of the generated images and negatively affect the accuracy. Using our pipeline, we can 72 Dataset CC CM HB HS MR NV1 NV2 PO PS Pbbq RB mAP Syn (ours) 79.0 92.9 90.4 44.9 77.0 92.1 88.0 77.5 64.1 75.7 80.2 78.3 100% Real 81.9 95.3 92.0 87.3 86.5 96.8 88.9 80.5 92.3 88.9 58.6 86.3 Syn (ours) + 10% Real 90.5 96.9 93.2 74.0 60.4 90.7 86.5 48.7 97.7 86.4 72.1 81.6 Syn (ours) + 40% Real 91.8 97.4 94.5 84.9 75.1 90.7 78.6 52.1 96.9 87.6 77.9 84.3 Syn (ours) + 70% Real 92.7 98.2 95.2 90.9 88.0 93.1 89.7 50.3 97.6 92.2 78.3 87.9 Syn (ours) + 100% Real 94.4 98.2 95.2 90.7 92.5 94.1 93.0 72.8 98.3 98.7 79.8 91.4 Table 4.9: We highlight that our synthesized data together with 70 % amount of real data achieves better performance than full (100 %) set of real data only. This highlights the benefit of our approach in reducing total human efforts. Syn (ours) means ru-DALLE synthesized 1500 diverse images (use UW as CDI). Top row terms are: CC: Coca Cola, CM: Coffee mate, HB: honey bunches, HS: hunt’s sauce, MR: mahatma rice, NV1: nature V1, NV2: nature V2, PO: palmolive orange, PS: pop secret, Pbbq: pringles bbg, RB: red bull. table:vary_real remove the distractor by detecting and removing it (“people”) from the caption before feeding them into T2I (Figure 4.3). Contextual backgrounds from only one CDI. To demonstrate the compositionality of our approach, we extend the experimental setting from Object Instance Detection in Section 4.4.2: focusing on contextual background generation from only one CDI on which GMU-Kitchen ground-truth objects are pasted. In this case, CDI means the input image, but no longer in-domain. We consider 4 scenarios (Cartoon kitchen, Skeleton kitchen, objects in Kitchen and Kitchen with human), where provided CDIs are out of distribution from the target domain, which is a real-world kitchen without humans. There are two main challenges to (1) the conventional method struggles to learn effectively with as few as one training image (2) the provided training image is out-of-domain. In Table 4.11, we demonstrate that our synthetic data generation pipeline can address both of these challenges. Specifically, when training solely on a synthetic dataset constructed by pasting objects onto a single CDI (Only CDI), performance across Dataset GMU Active Vision YCB-video UW-Kitchen 76.1 22.6 38.3 DALL-E (ours) 80.1+4.0 25.8+3.2 45.5+7.2 Table 4.10: Contextual synthetic backgrounds produced by our approach significantly enhance object instance detection accuracy across three datasets. table:active-vision-ycb-main 73 Dataset Only CDI No Intervention After Intervention Cartoon Kitchen 11.2 70.0 76.7+6.7 Skeleton Kitchen 10.3 64.6 74.8+10.2 Objects in Kitchen 9.4 71.8 77.0+5.2 Kitchens with Human 10.2 70.9 76.9+6.0 Table 4.11: Even if the user provides out-of-distribution CDI, our approach is able to produce a synthetic dataset tailored towards actual test distribution by in-domain intervention. table:compositional-mian various settings is suboptimal. This underscores the inherent difficulties the model faces when learning from limited images and a narrow diversity. However, our methodology can produce a vast collection of highquality backgrounds that are contextually relevant. The superior performance of the No Intervention results compared to the Only CDI results substantiates our hypothesis. Qualitative results are presented in Figure 4.10 as additional empirical support that T2I is able to generate many relevant images from a single CDI using our approach. Mitigate domain gap via language compositionality. While directly applying our approach leads to a significant performance improvement across all four scenarios, there remains room for enhancement due to the existing domain gap. Given that the contextual backgrounds are generated using augmented captions, i.e. operating in the language space, in-domain intervention becomes feasible. Specifically, we have the flexibility to add, remove, or alter contextual words, thereby influencing the style of the generated images. In Table 4.11 we observe up to 10.2 performance gain of After Intervention over No Intervention, which demonstrates the effectiveness of intervention in bridging the domain gap. Such interventions ensure a closer alignment between the synthesis distribution and the desired test distribution. 4.5 Conclusion chap-4-sec:conclusion We have proposed a new paradigm to generate large-scale labeled data for object detection and segmentation tasks using large vision and language-based text-to-image synthesis frameworks. We demonstrate effortless labeled data generation on popular benchmarks for object detection tasks. Computer vision models trained 74 using these data improve the performance over the models trained with large real data. Thus reducing the need for expensive human labeling process. We also highlight the compositional nature of our data generation approach on out-of-distribution and zero-shot data generation scenarios. We believe our approach opens door to democratizing computer vision models. Limitations. Since we rely on T2I synthesis models for data generation, we are limited by two issues. First, our approach does not provide control for illumination, viewpoints, object pose and other such data generation properties. Second, our current approach can not generate labelled data for 3D geometry tasks like 3D object pose estimation tasks. We leave these problems as interesting future works. 75 Chapter 5 EM-Paste: EM-guided Cut-Paste for Image-level Weakly Supervised Instance Segmentation chapter-5 We propose EM-PASTE: an Expectation Maximization (EM) guided Cut-Paste compositional dataset augmentation approach for weakly-supervised instance segmentation using only image-level supervision. The proposed method consists of three main components. The first component generates high-quality foreground object masks. To this end, an EM-like approach is proposed that iteratively refines an initial set of object mask proposals generated by a generic region proposal method. Next, in the second component, high-quality context-aware background images are generated using a text-to-image compositional synthesis method like DALL-E. Finally, the third component creates a large-scale pseudo-labeled instance segmentation training dataset by compositing the foreground object masks onto the original and generated background images. The proposed approach achieves state-of-the-art weakly-supervised instance segmentation results on both the PASCAL VOC 2012 and MS COCO datasets by using only image-level, weak label information. In particular, it outperforms the best baseline by +7.4 and +2.8 mAP0.50 on PASCAL and COCO, respectively. Further, the method provides a new solution to the long-tail weakly-supervised instance segmentation problem (when many classes may only have few training samples), by selectively augmenting under-represented classes. 76 5.1 Introduction The instance segmentation task aims to assign an instance label to every pixel in an image. It has been found in many applications on many real-world domains [211], e.g., self-driving cars, AR/VR, robotics, etc. Standard approaches to solving this problem involve framing it as a per-pixel labeling problem in deep learning framework [219, 211]. Training of instance segmentation methods requires a vast amount of labeled data [219]. Getting a large labeled dataset with per-pixel instance labels is very expensive, requires significant human effort, and is also a time-consuming process. In order to tackle these issues, alternative approaches have been proposed. One direction involves utilizing synthetic data to train instance segmentation methods [456, 250, 168]. However, they generally suffer from the sim2real domain gap, and expert knowledge is required to create synthetic environments [242]. A few other works have used object cut-and-paste [129, 185, 172] to augment training data for instance segmentation tasks. However, these methods require the availability of accurate foreground object masks, so that objects can accurately be cut before they are pasted. Acquiring these foreground masks may require extensive human efforts, which can make this line of work difficult to scale. Weakly-supervised learning approaches have evolved as important alternatives to solving the problem. A few of these methods [286, 343, 536, 26, 249] involve using bounding boxes as a source of weak supervision. Bounding boxes contain important cues about object sizes and their instance labels. However, even bounding boxes are taxing to label. Another line of works [668, 675, 93, 164, 256, 312, 289, 10, 26, 358] explores using only image-level labels for learning instance segmentation. Due to the lack of segmentation annotations, those works generally need to introduce object priors from region proposals [668, 675, 26, 312]. One approach involves utilizing signals from class activation maps [668, 675], yet those maps do not provide strong instance-level information but only semantic-level, and can be noisy and/or not very accurate. Another procedure involves generating pseudo-label from proposals and training a supervised model with pseudo-label 77 as ground truth. Those methods can not generate high-quality pseudo-labels which hinders the supervised model performance. In this work, we propose EM-PASTE, a new weakly-supervised instance segmentation approach using only image-level labels. It consists of: First, "EM-guided Cut": we extract high-quality foreground object masks using an Expectation Maximization (EM)-like method to iteratively optimize the foreground mask distribution of each interested class and refine object mask proposals from generic region segmentation methods [374, 21, 429]. Then, we generate high-quality background images, by first captioning the source image, then passing them to text-to-image synthesis method (similar to [172]), e.g., DALL-E [440, 487, 121] and stable diffusion [468]. Finally, "Paste": we create a large labeled training dataset by pasting the foreground masks onto the original and generated context images (Figure 5.3). We achieve state-of-the-art (SOTA) performance on weakly-supervised instance segmentation on the PASCAL VOC [138] and COCO dataset [347] using only image-level weak label. We outperform the best baselines by +7.3 and +2.8 mAP0.50 on Pascal VOC and COCO datasets respectively. EM-PASTE also provides a new solution to long-tail weakly-supervised instance segmentation problem on Pascal VOC dataset. Additionally, we also show that EM-PASTE is generalizable to object detection task. 5.2 Related works Weakly Supervised Instance Segmentation Since acquiring per-pixel segmentation annotations is timeconsuming and expensive, many weakly supervised methods have been proposed to utilize cheaper labels. Existing weakly supervised instance segmentation methods can be largely grouped in two categories, characterized by labels that the algorithms can access during the training phase. The first line of works explores the use of bounding boxes as weak labels for instance segmentation tasks [286, 343, 536, 26, 249]. Notably, [286] generate pseudo-instance mask by GrabCut+ and MCG [21], and [249] restraint the bounding box by tightness. Another series of works have also started using image-level labels as weak labels for instance 78 segmentation tasks [668, 675, 93, 164, 256, 312, 289, 10, 26, 358]. Notably, [668] utilizes class peak response, [164] refines segmentation seed by a multi-task approach, [26] improve the generated pseudo-labels by viewing them as conditional probabilities, and [289] transfer semantic knowledge from semantic segmentation to obtain pseudo instance label. However, aggregating pseudo-label across multiple images remains largely unexplored. Data Augmentations for Instance Segmentation In recent years, data augmentation has been an indispensable component in solving instance segmentation tasks [211, 219]. [185] found that large-scale jittering plays an important role in learning a strong instance segmentation model, especially in a weakly-supervised setting. [129] proposed a new paradigm of data augmentation which augments the instances by rotation, scaling, and then pastes the augmented instances to images. Entitled Cut-Paste augmentation strategy can diversify training data to a very large scale. Empirical experiments [129, 185] have found that cut paste augmentation can lead to a major boost in instance segmentation datasets. These approaches require the presence of foreground object masks. So they can not be applied for weakly-supervised instance segmentation problems using only image-level labels. In contrast, our approach is designed to work with only image-level label information. Long-Tail Visual Recognition Instance segmentation models usually fail to perform well in real-world scenarios due to the long-tail nature of object categories in natural images [208, 561, 218, 360]. A long-tail dataset consists of mostly objects from head classes, while objects from tail classes comprise relatively few instances. Existing instance segmentation methods [219] often yield poor performance on tail classes, and sometimes predict head class all the time [582]. Existing methods to alleviate this include supervising models using a new loss that favors tail classes [582, 248] and dataset balancing techniques [218, 370] that redistribute classes so that model can see more tail instances. However, few works evaluate weakly-supervised methods in long-tail setting. 79 ... (c) (d) (e) (a) (b) Image Classifier ... Figure 5.1: Step 1 of foreground extraction. (a) Entity Segmentation extracts segments from images. (b) Grad-CAM highlights a region based on the given label, and the center of moments (white dot on the image) is calculated for the highlighted region. (c) For all eligible segments, we compute the pixel-wise average distance to the center of the region highlighted by Grad-CAM. (d) We select n segments that have the shortest distances to the center. (e) All n foreground candidate segments are filtered using the classifier network, and we select the foreground with highest predicted probability. chap-5-fig:1 5.3 Method chap-5-sec:method Our goal is to learn an instance segmentation model in a weakly supervised framework using only image-level labels. To this end, we propose EM-PASTE: EM-guided Cut-Paste with DALL-E augmentation approach that consists of three main components: foreground extraction (Section 5.3.1), background augmentation (Section 5.3.2), and compositional paste (Section 5.3.3). EM-PASTE produces an augmented dataset with pseudo-labels, and we train a supervised model using pseudo-labels as ground-truth. 5.3.1 EM-guided Foreground Extraction chap-5-sec:Foreground Extraction We propose an Expectation Maximization (EM) guided foreground extraction (F-EM) algorithm. Given only image-level labels for a dataset, F-EM extracts as many high-quality foreground object masks as possible by iteratively optimizing the foreground mask distribution of each interested class and refining object mask proposals. There are three steps: 1) region proposal, 2) Maximization step to estimate object foreground distribution statistics of each interested object class, 3) Expectation step to refine the collection of matching region proposals given the approximated object foreground distribution statistics. Steps 2 and 3 are performed in an interactive manner. Figure 5.2 demonstrates different steps for extraction of foreground object masks. 80 Figure 5.2: Step 2 and 3 of foreground extraction. (a) Each extracted foreground is passed to the classifier, and a latent representation of the image is extracted using a bottleneck layer. (b) Using the mean of all latent representations, we keep k% representations that are close to the mean and rule out outliers. (c) The mean is updated after ruling out the outliers. (d) For each image, latent representations of all eligible segments are obtained by the classifier network. (e) The segment with the highest cosine similarity to the updated mean is selected as the new foreground of the image. (f) After obtaining a new set of foregrounds, they are used as input of step 2 of the next iteration. fig:EM Step 1: Region Proposal. In this step, the goal is to generate candidate foreground object segments corresponding to a given image label for each image. Suppose we are given a dataset D = {(Ii , yi)} N i=1 where each image Ii may contain one or more objects of different classes, therefore yi is a binary vector that corresponds to image-level object labels for a multilabeled image Ii . We train an image classifier f(·) which takes image I as input and predicts image label: y = f(I). Given a ground truth class a, for each image Ii where y a i = 1, meaning that an object of class a is present in image Ii , we generate the Grad-CAM [495] activation map through the classifier f(·) (Figure 5.1 (b)). Then we threshold the activation map to convert it to a binary mask Ga i that is associated with class a and calculate x-y coordinate of the center of gravity of the 81 object mask as: c a i = (c a ix, ca iy) = (P x,y Ga i (x, y)x/G, P x,y Ga i (x, y)y/G), where G = P x,y Ga i (x, y). c a i will be used as anchor to select foreground segments of class a object for image Ii . Next, for each input image Ii , we use an off-the-shelf generic region proposal method to propose candidate objects. The generic region proposal methods include super-pixel methods (SLIC [5], GCa10 [565, 143]) and hierarchical entity segment methods (MCG [21], COB[374], entity segmentation [429]). These approaches only propose general class-agnostic segmentation masks with no class labels for the segments. In this work we show the results of using the entity segmentation method [429] and COB[374] to obtain a set of segments of the image Si = {s 1 i , s2 i , ..., sm i }, but we note that our method is compatible to other methods as well. Then we use the above computed Grad-CAM location anchor c a i for interest class a in image Ii to find the correct foreground segment Oa i with label a from Si . For each segment s j i , we calculate a pixel-wise average distance to the anchor c a i . We have the location assumption that the correct foreground segments should have a large overlap with the Grad-CAM mask Ga i . In other words, foreground object mask Oa i for object class a should have short average euclidean distance to the Grad-CAM center (dist(Oa i , ca i )). We observe that this is generally true when there is only one object present in an image. But, in many images, more than one object from the same class can be present. In this case, foreground object may not perfectly overlap with the Grad-CAM activation map, because c a i is the mean position of multiple foreground objects. To resolve this problem, we keep top-n segments with the shortest distances to the center c a i . Typically n ≤ 3. Then, we use the image classifier as an additional semantic metric to select the correct foreground. Specifically, we pass the top-n segments through the same image classifier f(·) used in Grad-CAM. The segment with the highest predicted probability is our initial selection of the foreground of the input image Oi . While, the initial extraction is far from perfect, because grad-cam localize the most discriminative location depends only on high-level classification information, which may have mismatch to the correct object location given complex scene. 82 Algorithm 2 F-EM alg:extract Input: Set of images I = {Ii} N i=1, set of labels Y = {yi} N i=1, image classifier f(·) with feature extractor ϕ(·), class of interest a Output: Set of foregrounds O = {Oi} N i=1 of class a objects. 10 O ← ∅ ▷ Step 1 for image Ii ∈ I where y a i = 1 do 11 Si ← EntitySeg(Ii) 12 c a i ← center of Grad-CAM(Ii , a) 13 ps ← f(s) for n of s ∈ Si with smallest dist(s, ca i ) 14 O ← O ∪ {arg maxs{ps}} for j iterations do 15 µ, ˆ Σˆ ← mean and covar of {ϕ(Oi)} N i=1 ▷ Step 2 (Maximization) 16 O′ ← k% of Oi ∈ O with smallest m-dist(ϕ(Oi),(ˆµ, Σ)) ˆ 17 µˆ ′ , Σˆ′ ← mean and covar of {ϕ(O′ i )} N×k% i=1 18 O ← ∅ ▷ Step 3 (Expectation) 19 for image Ii ∈ I where y a i = 1 do 20 Si ← EntitySeg(Ii) 21 O ← O ∪ {arg mins∈Si m-dist(ϕ(s),(ˆµ ′ , Σˆ′ ))} To resolve the above issues and to further improve foreground object masks, we propose an iterative approach whereby we select a subset of segments O from the larger set of original segments S generated by the region proposal method. These selected segments are considered as foreground object segments. We frame the iterative segment selection within an Expectation-Maximization like steps. The following EM steps assume that the latent representation (through a feature extractor ϕ(·) of the image classifier f(·)) of all foreground objects from the same class a follow a distribution pψa , here we assume pψa ∼ N (µ, Σ) follows Multivariate Gaussian distribution because in the latent space of image classifier f(·), representations of images from same class should be a single cluster. Our goal is to find the optimal parameters of the distribution. This corresponds to generating right object masks. We follow an expectation maximization (EM-) like approach to find optimal parameters. This involves iteratively optimizing the distribution parameter ψa which includes µ ∈ R d and Σ ∈ R d×d for each interest class a (Maximization step). Then use ψa to find accurate foreground segments in the latent space of ϕ(·) (Expectation step). Figure 5.2 shows the whole process. 83 Step 2 (M-step) Maximization. In M-step, for each interest class a, our goal is to find the optimal parameters µa, Σa given the candidate foreground proposals O. Here O are extracted foreground objects (from step 1, or step 3 at the previous iteration) for a specific class a. For each segment Oi , we generate its latent space representation hi by passing it through the image classifier hi = ϕ(Oi). In particular, hi is the feature after the last convolution layer of the classifier. We compute µˆ = E(ψ) = 1 N PN i=1 hi and Σ = ˆ E((h − µˆ)(h − µˆ) T ) as initial mean vector and covariance matrix of latent space representations of all selected foreground segments. Because not all foreground masks Oi are correct foregrounds, some of them may be background objects or objects with a different label. To remove the outlier and update the mean vector, we rule out outliers by keeping only k% of the segments that are closest to µˆ based on the Mahalanobis distance m-dist(hi ,(ˆµ, Σ)) = ˆ q (hi − µˆ) T Σˆ −1 (hi − µˆ) of their latent representations. Using only the remaining foreground (inlier) segments, we compute a new mean µˆ ′ and covariance matrix Σˆ′ of the foreground object latent representations of the given class, which can be used to match more accurate foreground mask in E-step. Step 3 (E-step) Expectation. In E-step, we regenerate the set of foreground segments O of class a by matching segment candidates with the updated µˆ ′ and Σˆ′ in M-step. In other words, we compute the “expectation” of the foreground mask for each image: E(Oi |µˆ ′ , Σˆ′ , Ii). For each image Ii , we start with the set of all eligible segments Si (computed in step 1) again and generate the corresponding latent space representations Φi = {h 1 i , h2 i , ..., hm i } as described earlier. We then compute a Mahalanobis distance (m-dist) between each latent representation of the segment h j i and the new mean µˆ ′ obtained from step 2. The segment with the smallest m-dist is selected as the new foreground of the image. With a new set of foreground segments O, we can perform step 2 followed by step 3 again for, typically 2 or 3, iterations. Algorithm 2 shows the details of the EM-guided Foreground Extraction algorithm. 84 5.3.2 Background (Context) Augmentation chap-5-sec:aug Next step involves generating a large set of high-quality context images that could be used as background images for pasting foreground masks. One possible approach would be to use randomly selected web images as background images. However, prior works [128, 122, 632] have shown that context affects model’s capacity for object recognition. Thus, selecting appropriate context images is important for learning good object representation, and thus beneficial for instance segmentation as well. To this end, we use a similar pipeline as DALL-E for Detection [172] to use image captioning followed by text-to-image generation methods to automatically generate background images that could provide good contextual information. Image Captioning Given an training set image, we leverage an off-the-shelf self-critique sequence training image captioning method [454] to describe the image, but we note that our method is agnostic to any specific image captioning method. These descriptions can capture the important context information. We further design a simple rule to substitute the object words, that has overlap with target interest class (in VOC or COCO) with other object words, in captions, to decrease the possibility of generating images that contains interest object, since they come without labels. Image Synthesis We use the captions as inputs to text-to-image synthesis pipeline DALL-E [440] *, to synthesize a large set of high-quality images that capture all relevant contextual information for performing paste operation (Section 5.3.3). For each caption, we generate five synthesized images. Note that with our caption pruning rule described above, we assume that synthesized images do not contain foreground objects. 5.3.3 Compositional Paste chap-5-sec:paste After foreground extraction (Section 5.3.1), we have a pool of extracted foregrounds where each class has a set of corresponding foreground objects. After background augmentation (Section 5.3.2) we have both the original background images and contextual augmented background images by DALL-E. We can * In implementation, we use Ru-DALL-E [487]. 85 create a synthetic dataset with pseudo instance segmentation labels by pasting the foreground masks onto the background images. For each background image, we select np foregrounds based on a pre-defined distribution p, discussed later, and the goal is to paste those extracted foregrounds with the appropriate size. The appropriate choice of np depends on the dataset. To force the model to learn a more robust understanding, each pasted foreground undergoes a random 2D rotation and a scaling augmentation. In addition, we note that direct object pasting might lead to unwanted artifacts, also shown in the findings of [129] and [185]. To mitigate this issue, we apply a variety of blendings on each pasted image, including Poisson blurring, Gaussian blurring, no blending at all, or any combination of those. In practice, we find Gaussian blurring alone can yield sufficiently strong performance. Now we present two methods, each with their edges, of how to find the paste location. We leave the end-user to decide which method to use. Random Paste In this simple method, we iteratively scale the foreground object by a random factor ∼ Uniform(0.3, 1.0), and paste in a random location on the image. We find a factor > 1 generally creates objects too large and a small factor enhances model learning capacity for small objects. Space Maximize Paste This dynamic pasting algorithm tries to iteratively utilize the remaining available background regions to paste foreground objects. Our intuition is to force the pasted foregrounds to occupy as many spaces of the pasted background as possible, while remaining non-overlapping with the new to-bepasted foreground and original background plus already pasted foregrounds. We give an illustrative example in Figure 5.3. Firstly, we find background regions where no object lies by computing the maximum inscribing circle from the contour of background images without existing foregrounds, original or pasted, as shown in the red circle in Figure 5.3b. The maximum inscribing circle gives a maximum region not occupied by any objects, thus providing the largest empty space. Next, we scale the pasted foreground to largely match the size of the radius of the maximum inscribing circle, rotate by a random degree, and paste to the location of the center of the maximum inscribing circle, shown in Figure 5.3c. We iteratively repeat the above steps to paste all np foregrounds (Figure 5.3d). We note that since this method finds the background space with the 86 (a) The original image to be pasted. fig:paste1 (b) The red circle is the max inscribing circle found based on contour, denoted by the blue line. fig:paste2 (c) The first object, a person, is pasted on this image. fig:paste3 (d) After repeatedly applying above steps, four objects are pasted on the image. fig:paste4 Figure 5.3: Illustrative example of Space Maximize Paste algorithm. In this example, four foreground objects are pasted on the background image that contains an aeroplane. In part (b) the max inscribing circle is found from contour based on region without aeroplane. We emphasize that the contour is found only based on image level, using process described in Section 5.3.1. Note that the person is scaled to match the size of the circle found in part (b), and a random rotation is performed. fig:paste decreasing area, thus able to synthesize images pasted with objects of various sizes. Selection Probability The pre-defined selection distribution p is crucial in that it imposes the class distribution of synthetic dataset produced by the paste method. We investigate and provide two types of probability to end users. The simplest type is a uniform distribution, i.e., selecting each image from foreground pool with the same chance. With this choice, the synthetic data approximately follows the class distribution of foreground pool. The second type is a balanced sampling, i.e. giving the classes with more instances a smaller weight to be selected while giving the classes with less instances a larger weight. This type enforces each class to appear in synthetic data in approximately the same quantity. In Section 5.4.4 we show that this setting is beneficial for long-tail problem. 5.4 Experiments We demonstrate the effectiveness of EM-PASTE in weakly-supervised instance segmentation from image-level labels on Pascal VOC and MS COCO datasets. Additionally, we also show that EM-PASTE is generalizable to object detection task and highlight benefits of EM-PASTE in handling long-tail class distribution with only image-level label information. 87 5.4.1 Experiment Setup chap-5-sec:Experimental Setup Dataset and Metrics We evaluate EM-PASTE on Pascal VOC [138] and MS COCO [348] datasets. Pascal VOC consists of 20 foreground classes. Further, following common practice of prior works [10, 26, 536], we use the augmented version [216] with 10,582 training images, and 1,449 val images for Pascal VOC dataset. MS COCO dataset consists of 80 foreground classes with 118,287 training and 5,000 test images. Per the standard instance segmentation and object detection evaluation protocol, we report mean average precision (mAP) [217] on two different intersection-over-union (IoU) thresholds, namely, 0.5 and 0.75. We denote these two mAPs as mAP0.50 and mAP0.75, respectively. Synthesized Training Dataset We do not touch on the segmentation label but instead generate pseudolabeled synthesized training dataset using methods described in Section 5.3. Pascal VOC training dataset of 10,582 images consists of 29,723 objects in total, we extract 10,113 masks of foreground segments (34.0%). Similarly, MS COCO training set of 118,287 images consists of 860,001 objects in total, we extract 192,731 masks of foreground segments (22.4%). We observe that such masks are not perfect and contain noise, but overall have sufficient quality. To further ensure the quality of foregrounds, we filter the final results using 0.1 classifier score threshold. Additionally, we leverage image captioning and DALL-E (Section 5.3.2) to further contextually augment backgrounds. We generate 2 captions per image† , synthesize 10 contextual backgrounds per caption, and utilize CLIP [434] to select top 5 backgrounds among the 10 synthesized images, together producing 10k and 118k augmented backgrounds for VOC and COCO respectively. To make the best use of both original backgrounds and contextually augmented backgrounds, we blend them together as our background pool, on which we apply methods from Section 5.3.3 to paste these extracted foregrounds. For simplicity, we use Random Paste method. We duplicate the original backgrounds twice to make the distribution between real and synthetic backgrounds more balanced. †We augment each of 10,582 VOC image, and a random 10% sample of 118k COCO images. 88 Table 5.1: Metrics for instance segmentation models on Pascal VOC 2012 val set. Here F means fully supervised, B and I mean bounding box and image level label based weakly supervised methods respectively. We highlight the best mAP with image level label in green , and bounding box label in blue . Our method outperforms prior SOTA image level methods. Further our method achieves better performance than some of the prior bounding box SOTA, although bounding box method has access to a lot more information about object instances. Method Supervision Backbone mAP0.50 mAP0.75 Mask R-CNN [219] F R-101 67.9 44.9 SDI [286] B R-101 44.8 46.7 Liao et al.[343] B R-50 51.3 22.4 Sun et al.[536] B R-50 56.9 21.4 ACI [26] B R-101 58.2 32.1 BBTP [249] B R-101 58.9 21.6 PRM [668] I R-50 26.8 9.0 IAM [675] I R-50 28.8 11.9 OCIS [93] I R-50 30.2 14.4 Label-PEnet [164] I R-50 30.2 12.9 CL [256] I R-50 38.1 12.3 WISE [312] I R-50 41.7 23.7 BESTIE [289] I R-50 41.8 24.2 JTSM [502] I R-18 44.2 12.0 IRN [10] I R-50 46.7 23.5 LLID [358] I R-50 48.4 24.9 PDSL [503] I R-101 49.7 13.1 ACI [26] I R-50 50.9 28.5 BESTIE + Refinement [289] I R-50 51.0 26.6 EM-PASTE (Ours) I R-50 56.2 35.5 EM-PASTE (Ours) I R-101 58.4 37.2 tab:VOC Model Architecture and Training Details We train Mask R-CNN [219] with Resnet 50 (R-50) or Resnet 101 (R-101) as backbone [221]. We initialize Resnet from ImageNet [112] pretrained weights released by detectron2 [606]. We deploy large-scale jittering [185], and additionally augment training data with random brightness and contrast with probability 0.5. We run our experiments on one 32GB Tesla V100 GPU with learning rate 0.1 and batch size 128. 5.4.2 Weakly-supervised Instance Segmentation chap-5-sec:mainexp In our setting, we follow the details in Section 5.4.1 and assume access to only image-level labels. That is, we do not use any segmentation annotation from the training set. We report VOC performance in Table 5.1. Baselines We compare against previous weakly-supervised SOTA. Notably, [668] utilize peak response maps from image-level multi-label object classification model to infer instance mask from pre-computed proposal 89 Table 5.2: Weakly supervised instance segmentation on COCO val2017. Models here use image-level label. Method Backbone mAP0.50 mAP0.75 WS-JDS [504] VGG16 11.7 5.5 JTSM [502] R-18 12.1 5.0 PDSL [503] R-18 13.1 5.0 IISI [139] R-101 25.5 13.5 LLID [358] R-50 27.1 16.5 BESTIE [289] R-50 28.0 13.2 EM-PASTE (Ours) R-50 30.8 20.7 tab:COCO gallery; [312] generate pseudo-label training set from MCG [21] and train a supervised Mask-RCNN [219]; and [286] generate pseudo-label using GradCut+ and MCG [21]. Results Quantitative results on Pascal VOC dataset have been shown in Table 5.1. Firstly, we experiment with the choice of R-50 or R-101. With more capacity brought by a deeper model, we find that R-101 works better compared to R-50, leading to +2.2 mAP0.50 and +1.7 mAP0.75 improvement. This validates that EM-PASTE is suitable for instance segmentation task. Secondly, our method can significantly outperform previous image-level SOTA by +7.4 mAP0.50 (from 51.0 to 58.4), and +8.7 mAP0.75 improvements (from 28.5 to 37.2). This suggests that pseudo-labels generated by EM-PASTE give a strong learning signal for the model to develop object awareness. Lastly, although bounding box is a more insightful cure for instance segmentation, we find our results comparable with the previous SOTAs that use bounding box. Indeed, we are only 0.5 mAP0.50 lower compared to the best bounding box SOTA, which requires hand-drawn ground-truth boxes. Next we demonstrate effectiveness of the proposed EM-PASTE method on MS COCO dataset [348]. It is much more challenging than the Pascal VOC dataset as it consists of 80 object classes and each image may contain multiple instances of different classes. Quantitative results are shown in Table 5.2. We observe that the proposed method can achieve an improvement of +2.8 mAP0.50, and +7.5 mAP0.75 improvements over previous image-level SOTA. Interestingly, our method with smaller architecture (R-50) outperform prior method IISI [139] that works with larger network (R-101). These results provide evidence that our method can scale to large data with large number of object classes. 90 Ablation Study We present the performance of EM-PASTE on PASCAL VOC 2012 val set with different choice of parameters in Table 5.3. All experiments use R-101 as backbone, training with synthetic data generated by Section 5.3 and following the details in Section 5.4.1. We first note that DALL-E is indispensable for best performance, and training only with 20,226 backgrounds from original background pool gives 1.9 lower mAP0.50, validating our hypothesis that additional contextual background makes model learn more thorough object representation. Further, it is crucial to choose appropriate backgrounds. A purely black background or a random background‡ does not bring benefit but in turn harm model learning (0.8 and 0.2 mAP0.50 lower than not using additional augmented images). Additionally, we quantify the effect of Figure 2 by training a model on foreground extracted without F-EM, and observe that iterative foreground refinement is essential for a quality foreground, as F-EM provides 5.0 mAP0.50 improvement. Moreover, for the original PASCAL VOC dataset, balanced selection might not work well overall, giving 1.6 mAP0.50 lower. Given 30k training set, a balanced selection makes each class approximately 1.5k, and classes with a smaller set of extracted foregrounds will be reused more often, and the potential noise from extraction in Section 5.3.1 might be amplified. This result suggests a more sophisticated selection method is needed, which we leave for future work. Lastly, the number of paste objects np is important in that a value too low results in sparse foregrounds, while a value too large results in crowded foregrounds, each of those hurts the model learning. We empirically show that for PASCAL VOC dataset, 4 seems to be a more appropriate value to use. Surprisingly a fixed 4 gives slightly higher mAP0.50 compared to assigning random np ∼ Unif[1, 4] dynamically. 5.4.3 Weakly-supervised Object Detection We argue that EM-PASTE is effective not only in instance segmentation task, but on other tasks as well. We reuse synthesized dataset described in Section 5.4.1 to conduct object detection on Pascal VOC. We compare ‡ For simplicity we use MS COCO images that does not contain any of 20 VOC objects as random background. 91 Table 5.3: Ablation study on PASCAL VOC. DALL-E # Paste Objects Foreground mAP0.50 mAP0.75 ✗ 4 - 56.5 35.8 Black 4 - 55.7 34.3 Random 4 - 56.3 36.7 ✓ 4 w/o Algorithm 2 53.4 35.7 ✓ 2 Balanced Selection 56.8 36.3 ✓ 2 - 56.9 36.6 ✓ 1 ∼ 4 - 58.0 38.1 ✓ 6 - 57.2 37.5 ✓ 4 - 58.4 37.2 tab:VOCablation our method against two popular baselines, CASD [255] and Wetectron [453]. In Table 5.4, we observe that our method achieves almost +4.0 and +5.0 mAP0.50 compared to CASD and Wetectron respectively. Table 5.4: Object detection on Pascal VOC 2012. Method Backbone mAP0.50 mAP0.75 Wetectron [453] VGG16 52.1 - CASD [255] VGG16 53.6 - EM-PASTE (Ours) R-50 57.2 30.7 tab:object_detection (a) Long-tail distribution of generated data [605]. The number of instances for each class shown on the top. fig:longtail_dist (b) We report mAP@50 for each class. Gray values are from Mask RCNN trained directly on data with extracted mask (Section 5.3.1); values in red are the value after EM-PASTE. The classes are ordered the same as (a). fig:longtail_map Figure 5.4: Long-tail instance segmentation setting and results. 5.4.4 Weakly-supervised Instance Segmentation on Long-tail Dataset chap-5-sec:longtail We now discuss how EM-PASTE can alleviate the long tail problem. In long-tail [218] dataset, the head class objects contain much more instances than tail classes, so that simple learning method might learn the bias of the dataset, that is, become the majority voter which predicts the head class all the time [360, 582]. Due to 92 the ability to generate synthetic data based on selection distribution, even given a highly imbalanced dataset, we can create a synthetic dataset with a balanced class distribution. Implementation Detail We conduct our experiments on a long-tailed version of PASCAL VOC [138] dataset. Our long-tailed dataset, generated based on method proposed in [605], forces the distribution of each class to follow Pareto distribution [104]. It contains 2,415 images in total, with a maximum of 836 and a minimum of 4 masks in a class. Statistics of our generated dataset shown in Figure 5.4a. The person class contains the most instances, while there are 5 classes with less than 10 instances. To the best of our knowledge, we are the first to conduct weakly-supervised instance segmentation task using [605]. Results We now show that our weakly-supervised instance segmentation can largely mitigate the long tail problem. Our results on PASCAL VOC val set are shown in Figure 5.4b, with mAP0.50 values for each class. We compare with Mask RCNN [219] with details described in Section 5.4.1, using the long tail dataset itself, i.e. only train with pseudo-labels inferred by Section 5.3.1. As shown in Figure 5.4a, training on such an imbalanced data deteriorates the model. Out of 20 classes, there are 10 classes that have mAP0.50 ≤ 12.42, 6 classes that have mAP0.50 < 1 and 4 classes that are not being recognized by model at all. The overall mAP0.50 is 20.26. However, after EM-PASTE with balanced setting (the p in Section 5.3.3), the model increasing overall mAP0.50 to 40.28. All classes show an improvement compared to vanilla training, with an average improvement of 20.0 mAP0.50. 5.5 Conclusion We propose EM-PASTE: an Expectation Maximization guided Cut-Paste compositional dataset augmentation approach for weakly supervised instance segmentation method using only image-level supervision. The core of our approach involves proposing an EM-like iterative method for foreground object mask generation and then compositing them on context-aware background images. We demonstrate the effectiveness of our 93 approach on the Pascal VOC 2012 instance segmentation task by using only image-level labels. Our method significantly outperforms the best baselines. Further, the method also achieves state-of-the-art accuracy on the long-tail weakly-supervised instance segmentation problem. 94 Chapter 6 3D Copy-Paste: Physically Plausible Object Insertion for Monocular 3D Detection chapter-6 A major challenge in monocular 3D object detection is the limited diversity and quantity of objects in real datasets. While augmenting real scenes with virtual objects holds promise to improve both the diversity and quantity of the objects, it remains elusive due to the lack of an effective 3D object insertion method in complex real captured scenes. In this work, we study augmenting complex real indoor scenes with virtual objects for monocular 3D object detection. The main challenge is to automatically identify plausible physical properties for virtual assets (e.g., locations, appearances, sizes, etc.) in cluttered real scenes. To address this challenge, we propose a physically plausible indoor 3D object insertion approach to automatically copy virtual objects and paste them into real scenes. The resulting objects in scenes have 3D bounding boxes with plausible physical locations and appearances. In particular, our method first identifies physically feasible locations and poses for the inserted objects to prevent collisions with the existing room layout. Subsequently, it estimates spatially-varying illumination for the insertion location, enabling the immersive blending of the virtual objects into the original scene with plausible appearances and cast shadows. We show that our augmentation method significantly improves existing monocular 3D object models and achieves state-of-the-art performance. For the first time, we demonstrate that a physically plausible 3D object insertion, serving as a generative data 95 augmentation technique, can lead to significant improvements for discriminative downstream tasks such as monocular 3D object detection. Project website: https://gyhandy.github.io/3D-Copy-Paste/. 6.1 Introduction Monocular indoor 3D object detection methods have shown promising results in various applications such as robotics and augmented reality [623, 85]. However, the deployment of these methods is potentially constrained by the limited diversity and quantity of objects in existing real datasets. For example, in SUN RGB-D dataset [522], the bathtub category has only less than 500 annotations compared to chair which has over 19,000 annotations. This may be due to the difficulty in acquiring and labeling substantial indoor scene datasets with diverse 3D object annotations [512, 522, 100]. Data augmentation techniques have been widely utilized in 2D detection and segmentation tasks to improve the diversity and quantity of the available training data [129, 168, 184, 172, 174]. However, it is non-trivial to scale 2D augmentation methods to 3D scenes due to physical constraints in real 3D scenes. In particular, technical challenges emerge especially in how to maintain physical plausibility for: (1) Collision and Occlusion Handling: In 3D data augmentation, handling collisions between objects is more challenging than in 2D data. Properly managing collisions is essential to prevent artifacts and ensure that objects appear as natural and coherent parts of the scene. (2) Illumination and Shading: For 3D data, augmenting objects requires careful consideration of the lighting conditions in the scene to create realistic shading and reflections. This involves estimating the spatially-varying illumination and adapting the appearance of the inserted objects to maintain visual coherence. (3) Geometric Consistency: In 3D data augmentation, maintaining geometric consistency is crucial to ensure that the augmented objects fit naturally within the scene. Unlike 2D augmentation, which deals with flat images, 3D augmentation must consider spatial relationships, object orientations, and their interaction with the surrounding environment. 96 External 3D Objects (e.g., Objaverse) Indoor Scene Dataset (e.g., SUN RGB-D) 40.96 43.79 ImVoxelNet ImVoxelNet + 3D Copy-paste 3D Detection mAP (%) 3D Copy-Paste Lighting Position, Pose, Size Physically Plausible Monocular 3D Object Detector Training Figure 6.1: Overall pipeline of physically plausible object insertion for monocular 3D object detection: Our approach copies external 3D objects (e.g., from Objaverse [107]) and pastes them into indoor scene datasets (e.g., SUN RGB-D [522]) in a physically plausible manner. The augmented indoor scene dataset, enriched with inserted 3D objects, is then used to train monocular 3D object detection models, resulting in significant performance improvements. chap-6-fig:overall In this paper, we explore a novel approach, 3D Copy-Paste, to achieve 3D data augmentation in indoor scenes. We employ physically plausible indoor 3D object insertion to automatically generate large-scale annotated 3D objects with both plausible physical location and illumination. Unlike outdoor scenarios, indoor environments present unique challenges: (1) complex spatial layouts, notably cluttered backgrounds and limited space for object placement, which require a meticulously crafted method for automated object positioning (ensuring realistic position, size, and pose), and (2) intricate lighting effects, such as soft shadows, inter-reflections, and long-range light source dependency, which necessitate sophisticated lighting considerations for harmonious object insertion. Figure 6.1 shows our overall pipeline. In our approach, we take advantage of existing large-scale 3D object datasets, from which we copy simulated 3D objects and paste them into real scenes. To address the challenges associated with creating physically plausible insertions, we employ a three-step process. First, we analyze the scene by identifying all suitable planes for 3D object insertion. Next, we estimate the object’s pose and size, taking into account the insertion site to prevent collisions. Lastly, we estimate the spatially-varying illumination to render realistic shading and shadows for the inserted object, ensuring that it is seamlessly blended into the scene. 97 Our proposed method augment existing indoor scene datasets, such as SUN RGB-D [522], by incorporating large-scale 3D object datasets like Objaverse [107] using our 3D Copy-Paste approach. Our method is an offline augmentation method that creates a new augmented dataset. The monocular 3D object detection model, ImvoxelNet [478], trained on this augmented dataset, achieves new state-of-the-art performance on the challenging SUN RGB-D dataset. We systematically evaluate the influence of the inserted objects’ physical position and illumination on the downstream performance of the final monocular 3D object detection model. Our results suggest that physically plausible 3D object insertion can serve as an effective generative data augmentation technique, leading to state-of-the-art performances in discriminative downstream tasks such as monocular 3D object detection. We make three main contributions: (1) We introduce 3D Copy-Paste, a novel physically plausible indoor object insertion technique for automatically generating large-scale annotated 3D objects. This approach ensures the plausibility of the objects’ physical location, size, pose, and illumination within the scene. (2) We demonstrate that training a monocular 3D object detection model on a dataset augmented using our 3D Copy-Paste technique results in state-of-the-art performance. Our results show that a physically plausible 3D object insertion method can serve as an effective generative data augmentation technique, leading to significant improvements in discriminative downstream monocular 3D object detection tasks. (3) We conduct a systematic evaluation on the effect of location and illumination of the inserted objects on the performance of the downstream monocular 3D object detection model. This analysis provides valuable insights into the role of these factors in the overall effectiveness of our proposed approach. 98 6.2 Related Works 6.2.1 Monocular 3D Object Detection Monocular 3D Object Detection estimates the 3D location, orientation, and dimensions (3D bounding box) of objects from a single 2D image. It has garnered significant attention in recent years due to its potential applications in autonomous driving, robotics, and augmented reality. There are many works of monocular 3D detection in driving scenarios, such as 3DOP[84], MLFusion[614], M3D-RPN[53], MonoDIS[513], Pseudo-LiDAR[590], FCOS3D[586], SMOKE[359], RTM3D[331], PGD[585], CaDDN[444]. Specifically, Geometry-based Approaches: MV3D [85] utilized both LiDAR-based point clouds and geometric cues from images for 3D object detection. [395] introduced a method that regresses object properties such as dimensions, orientation, and location from 2D bounding boxes using geometric constraints. In the context of indoor scenes, multi-task learning has gained traction. Recent studies, including PointFusion by [615], have amalgamated 3D object detection with tasks like depth estimation or semantic segmentation to improve performance. Total3D [402] and Implicit3D [638] use end-to-end solutions to jointly reconstruct room layout, object bounding boxes and meshes from a single image. ImvoxelNet [478] achieves state-of-the-art performance by using the image-voxels projection for monocular 3d object detection. 6.2.2 3D Data Augmentation Data augmentation in 3D has become increasingly vital for enhancing performance across various 3D perception tasks. Most of work focuses on outdoor scenes [649, 341, 4, 87, 553]. Geometric Transformations: [607] applied rotations, translations, and scaling to augment the ModelNet dataset, improving classification and retrieval tasks. Point Cloud Augmentation: [134] proposed techniques such as random point removal, Gaussian noise, and point cloud interpolation for augmenting LiDAR datasets, enhancing object detection and segmentation performance. Generative Model-based Augmentation: [518] used a conditional GAN to 99 generate diverse and realistic 3D objects. Similarly, [6] employed a VAE for learning a generative model of 3D shapes for shape completion and exploration tasks. However, while 3D generative models can achieve object-level augmentation, they are not scalable to scene-level augmentation. 2D generative models can produce highly realistic images, but they do not provide physically plausible 3D labels. 3D Common corruptions [281] use 3D information to generate real-world corruptions for 2D dataset, which can evaluate the model robustness and be used as a data augmentation for model training, but does not support 3D detection because it does not introduce new 3D object content. 6.2.3 Illumination Estimation Illumination estimation is a critical focus within computer vision research, given its crucial role in various applications. [339] addressed the inverse rendering problem for complex indoor scenes, estimating spatiallyvarying lighting, SVBRDF, and shape from a single image. Meanwhile, a differentiable ray tracing method combined with deep learning was proposed for the learning-based inverse rendering of indoor scenes [670]. Additionally, research has been conducted on using deep learning for indoor lighting estimation, with methods like Deep Parametric Indoor Lighting Estimation offering enhanced accuracy and efficiency [161]. Furthermore, [592] introduced Neural Light Field Estimation, a method that effectively models complex lighting conditions for virtual object insertion in street scenes. These studies underscore the potential of machine learning in improving illumination estimation capabilities in rendering and computer vision tasks. 6.3 Methods This section presents our proposed physically plausible indoor 3D object insertion approach. Figure 6.2 shows our 3D Copy-Paste method overview. Section 6.3.1 addresses the question of "where and how to place the object", detailing the process of estimating suitable insertion positions, poses, and sizes for the objects while avoiding collisions with existing objects. Section 6.3.2 explains "what illumination should we add to 100 Scene Image Depth WHERE and HOW to put the object WHAT illumination is on the object (c) Lighting Estimation & Registration (d) Environment Map Refinement Lighting Position, Pose, Size Insertion Render 3D Object (a) Plane Reconstruction & Selection (b) Insertion Parameter Search Compositional Image 3D Bounding Box Label Figure 6.2: 3D Copy-Paste method overview: Our method (a) processes the input RGB image and depth data to reconstruct floor planes that can accommodate inserted objects. (b) Using the reconstructed planes and information about objects in the original scene, we estimate a physically plausible position, pose, and size for the inserted objects, ensuring they do not collide with existing objects. (c) We predict the spatially-varying lighting of the scene. (d) By registering the insertion position determined in (b) to spatially-varying lighting, our light estimation module (d) refined an HDR environment map to represent the lighting information for the inserted objects. (e) The insertion rendering module takes the position, pose, size, and lighting as input and inserts a 3D object into the real scene, adjusting the object’s lighting and shadows accordingly to ensure it seamlessly integrates as a natural and coherent part of the scene. chap-6-fig:method the object": estimate the scene’s spatially-varying illumination and render the inserted objects with realistic lighting and shadows. Section 6.3.3 describes how we create an augmented dataset using the inserted objects and train monocular 3D object detection models. 6.3.1 Where and how: Physically Plausible Position, Pose, and Size Estimation This section describes handling the first challenge of avoiding collisions during insertion by estimating physically plausible position, pose, and size parameters. chap-6-sec:3.1 6.3.1.1 Ground Plane Selection chap-6-sec:3.1.1 Given a scene and a 3D object to insert, the initial question is where to place the object. To accommodate a new object, we must identify and understand the available regions where the object can be situated. We 101 perform plane reconstruction to comprehend the scene’s layout and subsequently, we estimate physically plausible key parameters such as position, size, and pose. Figure 6.2(a) presents an overview of our plane reconstruction and selection module, which takes an RGB image and depth data as input and predicts all potential planes, then narrows down to the ground plane. To get a rough plane reconstruction, we followed the plane extraction method using Agglomerative Hierarchical Clustering (AHC) described in [144]. There are three main steps: (1) we construct a graph with nodes and edges representing groups of points, obtained by dividing the point cloud (merging RGB with depth) into non-overlapping groups. (2) We then perform AHC on the organized graph to identify potential planes by merging nodes that belong to the same plane, continuing until the mean squared error of plane fitting surpasses a threshold. (3) We use a pixel-wise region-growing method to refine the detected planes. To further refine the extracted planes while preserving clear face textures and sharp features without losing geometric details, we utilize a back-end indoor plane optimization and reconstruction method described in [577]. Specifically, we first partition the entire dense mesh into different planar clusters based on the planes extracted with AHC, treating them as plane primitives. We then create a texture patch for each plane and sample points on it, followed by executing a global optimization process to maximize the photometric consistency of sampled points across frames by optimizing camera poses, plane parameters, and texture colors. Further, we optimize the mesh geometry by maximizing consistency between geometry and plane primitives, further preserving the original scene’s sharp features, such as edges and corners of plane intersections. Finally, we get the reconstructed plane with the geometry parameters (e.g., surface normal). To select a proper plane for insertion, we first identify all horizontal planes based on surface direction and the standard deviation along the Z-axis. Specifically, there are two constraints for considering a plane as horizontal: (1) The plane must have a surface normal aligned with the positive direction of the Z-axis (opposite of the gravity vector), and (2) the standard deviation along the Z-axis should be smaller than a predefined threshold. In our scenario, we aim to insert furniture into the scene, such as the ten interest classes 102 in the SUN RGB-D dataset [522]: sofa, bed, chair, desk, table, nightstand, dresser, bookshelf, toilet, and bathtub. Consequently, we must identify the floor plane by selecting the horizontal plane with the lowest average Z value among all detected horizontal planes. 6.3.1.2 Constrained Insertion Parameter Search chap-6-sec:3.1.2 To address the question of where and how to place the object, we estimate specific insertion parameters: position (p), size (s), and pose (o). We propose an efficient constrained insertion parameter searching algorithm to calculate plausible insertion parameters while avoiding collisions with existing objects in the scene (Algorithm 3). Given the reconstructed floor plane, we first determine the search space for each parameter. For position, we want the inserted object to touch the floor, so we find the 3D bounding box of the object and calculate the center of the bottom surface (p) as the optimization parameter of position. To prevent potential collisions between the inserted object and existing assets in the original scene, we search for a suitable position around the center of the reconstructed floor. As shown in Figure 6.2(b), we first calculate the floor’s center c ← (cx, cy, cz), and set a search square, which uses twice the floor’s standard deviation along X axis, σx, and Y axis, σy, as square width and length. The insertion position is sampled from a Uniform distribution inside the search square px ∼ U[cx − σx, cx + σx] and py ∼ U[cy − σy, cy + σy], p ← (px, py, cz). For size (s), we use the height of the 3D bounding box of the object as the optimization parameter. For each object category, we first calculate the mean mh and standard deviation σh of the height of the object belonging to the same category in the original scene dataset. We then assume the height size follows a Normal distribution and sample a height size from this Normal distribution: s ∈ N (mh, σh). For the pose (o), we only allow the object to rotate along the Z-axis to maintain its stability. The optimization parameter is the rotation angles alone the Z-axis, which follows uniform distribution as o ∼ U[−π, π]. Algorithm 3 details the Constrained Insertion Parameter Search algorithm. We first set a search budget: k search iterations. For each iteration, we randomly sample each parameter (position, size, and pose) from 103 Algorithm 3 Constrained Insertion Parameter Search alg:insertion Input: An RGBD image of the scene, a reconstructed floor, a 3D object belonging to the class of interest, j Output: Position (pˆ: 3D bounding box bottom center), size (sˆ: 3D bounding box (bbox) height), and pose (oˆ: orientation along Z-axis) 22 Compute position search constrains: floor center c ← (cx, cy, cz), standard deviation σx and σy 23 Initialize search parameters: k ← 1000, degree of collision ˆl ← inf for i ∈ {1, 2, . . . , k} do 24 Sample position: px ∼ U[cx − σx, cx + σx] and py ∼ U[cy − σy, cy + σy], p ← (px, py, cz) 25 Sample size: s ∼ N (mh, σh), resize factor r ∼ U[1, r], s ← s/r, where mh and σh are mean and standard deviation of object height in class j in the raw dataset 26 Sample pose: o ∼ U[−π, π] 27 Calculate 3D bbox x3D, parameter based on the sampled insertion parameter (p, s and o) 28 Project 3D bbox to 2D bbox x2D in top view 29 Calculate collision score l = F(x2D) with existing objects in the scene if l == 0 then Return p, s, o if l < ˆl then pˆ ← p, sˆ ← s, oˆ ← o ˆl ← l 30 Return pˆ, sˆ, oˆ their corresponding search spaces and calculate the inserted object’s bounding box based on the sampled parameters. We then check for collisions with existing objects and quantitatively evaluate the degree of collisions. A direct approach for collision checking is to convert the inserted object into a point cloud and then calculate the overlap with existing objects’ point clouds. However, this method is time-consuming due to the large number of points involved. We simplify the problem by converting the original 3D collision into a 2D collision to speed up the collision check. Since the inserted objects are on the floor, if two objects collide, their 3D bounding box projections on the top view would also often collide (but not always, e.g., when an object may be placed under a table; we here ignore these candidate placements). In other words, we disregard the absolute value of the 3D volume and use the 2D collision projection as a relative collision score. Utilizing an efficient collision check allows us to set a relatively large search iteration number, such as k = 1000, while still maintaining a limited search time (less than 0.5 seconds). We also consider a resize factor r to shrink the size of the inserted object to handle inserting a large object in a small empty floor scenario. During the search, we terminate the process if we find an insertion with a collision score of 0; otherwise, we continue to track the best insertion with the lowest collision score and return it after completing k search iterations. 104 6.3.2 What Illumination is on the object chap-6-sec:3.2 6.3.2.1 Spatial-varying Illumination Estimation and Retrieval To answer the question of what kind of illumination should be cast on the object, we first need to estimate the spatially-varying illumination of the scene. This process involves encapsulating intricate global interactions at each spatial location. To achieve this, we utilize the deep inverse rendering framework proposed by [339]. Initially, we estimate intermediate geometric features such as albedo, normal, depth, and roughness. Subsequently, a LightNet structure, consisting of an encoder-decoder setup, ingests the raw image and the predicted intermediate features. This, in turn, enables the estimation of spatially-varying lighting across the scene. As depicted in Figure 6.2(c), the estimated spatially-varying illumination is represented as environment maps. Specifically, each 4x4 pixel region in the raw image is associated with an environment map, which captures the appearance of the surrounding environment and is used for reflection, refraction, or global illumination. These maps are spherical (equirectangular), representing the environment on a single 2D texture. The X-axis corresponds to longitude, and the Y-axis corresponds to latitude. Each point on the texture corresponds to a specific latitude and longitude on a sphere. To obtain the environment map associated with the position of the inserted object, we register and retrieve the corresponding environment map based on the estimated position after performing the constrained insertion parameter search. 6.3.2.2 Environment Map Refinement Coordinate transformation. The environment map, estimated for the inserted object, is based on the local coordinates of the insertion position. In particular, it establishes a coordinate system where the surface normal is designated as the Z-axis. In order to apply this map for relighting the inserted object using a rendering 105 method (such as Blender), it becomes necessary to transform the environment map to align with Blender’s coordinate system. Latitude completion. The estimated environment map only contains latitudes in the range (0, π/2) because the inverse rendering method cannot estimate the illumination beneath the surface. As shown in Figure 6.2(d), we complete the entire environment map by filling in artificial values in the second half. Intensity refinement. The estimated environment map is in Low Dynamic Range (LDR) format, lacking High Dynamic Range (HDR) details and high contrast. If we use the predicted value directly, the rendered shadow appears relatively fuzzy. We refine the value by adjusting the scale in log space to estimate the HDR value: IHDR = I γ LDR, where γ is a hyperparameter . Finally, we input the HDR environment map after transformation and refinement, along with the position, size, and pose, into an insertion renderer (e.g., Blender). This allows us to obtain the inserted image with 3D bounding boxes serving as ground truth. 6.3.3 Dataset Augmentation with Insertion and Downstream Model Training chap-6-sec:3.3 Given an indoor scene dataset and a set of interest classes C for potential insertion, we can identify external 3D objects set E that fall within these classes of interest. Before any insertion, we calculate the statistical parameters for each class of interest that we aim to augment. For every class j ∈ C, we assume the size parameter (for instance, the height) fits a Gaussian distribution. We then calculate the mean and standard deviation of this size parameter to guide the insertion of external objects. Here are the detailed steps for insertion: For each scene within the indoor scene dataset, we randomly select a category j from the class of interest set C. Next, we randomly choose an instance from the external 3D objects set E that belongs to the selected class j. We then utilize our physically plausible insertion method (Algorithm 3) to integrate this external 3D object into the scene. We could train any downstream monocular 3D object detection model with the augmented dataset because we automatically obtain the 3D annotations of the inserted objects. 106 Table 6.1: Statistics of external 3D objects from Objaverse [107]. chap-6-tab:1 Category Bed Table Sofa Chair Desk Dresser Nightstand Bookshelf Toilet Bathtub Number 190 854 361 934 317 52 13 99 142 24 6.4 Experiments This section presents experiments to assess the effectiveness of our proposed physically-plausible 3D object insertion method and evaluate how different insertion parameters affect the final performance of monocular 3D object detection. 6.4.1 Dataset and Model Setting chap-6-sec:4.1 Indoor scene dataset. We utilize the SUN RGB-D dataset [522] as our primary resource for indoor scenes. It is one of the most challenging benchmarks in indoor scene understanding. SUN RGB-D comprises 10,335 RGB-D images captured using four distinct sensors. The dataset is divided into 5,285 training scenes and 5,050 test scenes. Furthermore, it includes 146,617 2D polygons and 58,657 3D bounding boxes, providing a comprehensive dataset for our research. We also use ScanNet dataset [100]. ScanNet v2 is a large-scale RGB-D video dataset, which contains 1,201 videos/scenes in the training set and 312 scenes in the validation set. Adapting it for monocular 3D object detection, we utilized one RGB-D image per video, amounting to 1,201 RGB-D images for training and 312 for validation. We compute the ground truth 3D bounding box label for each of our used views from their provided scene level label, as some objects in the scene may not be visible in our monocular viewpoint. External 3D object assets. The quality of 3D objects is crucial for effective insertion. Hence, we use Objaverse [107], a robust dataset with over 800,000 annotated 3D objects. Using word parsing, we extract objects that align with the classes of interest for monocular 3D object detection within SUN RGB-D. Table 6.1 shows the selected Objaverse data for each SUN RGB-D class. 107 Table 6.2: ImVoxelNet 3D monocular object detection performance on the SUN RGB-D dataset with different object insertion methods. When inserting randomly, the accuracy of the downstream object detector drops, i.e., the detector suffers from random insertions (which may have collisions, occlusions, incorrect lighting, etc.). In contrast, by only applying physically plausible position, size, and pose, performance significantly improved (41.80%). Further, when plausible lighting and shadows are added, our 3D copy-paste improves the accuracy of the downstream detector to a new state-of-the-art accuracy (43.79%). We use mAP (%) with 0.25 IOU threshold. chap-6-tab:2 Setting Insertion Position, Pose, Size Insertion Illumination mAP@0.25 ImVoxelNet N/A N/A 40.96 ImVoxelNet + random insert Random Camera point light 37.02 ImVoxelNet + 3D Copy-Paste (w/o light) Plausible position, size, pose Camera point light 41.80 ImVoxelNet + 3D Copy-Paste Plausible position, size, pose Plausible dynamic light 43.79 Monocular 3D object detection model. We focus on the challenging task of monocular 3D object detection that relies solely on a single RGB image as input. We employ ImVoxelNet, which achieves state-ofthe-art performance on the raw SUN RGB-D dataset using only a single RGB image as input. Other existing methods either resort to using additional modalities and multiple datasets for extra supervision or exhibit underwhelming performance. For the purpose of monocular 3D object detection, we train the same ImVoxelNet model on the original SUN RGB-D dataset and its various versions, each augmented via different insertion methods. All mAP results are mAP@0.25. 6.4.2 Physically-plausible position, pose, size, and illumination leads to better monocular detection performance chap-6-sec:4.2 Our 3D Copy-Paste focuses on solving two challenges: (1) Where and how to put the object: we estimate the object’s position, orientation, and size for insertion while ensuring no collisions. (2) What illumination is on the object: we estimate the spatially-varying illumination and apply realistic lighting and shadows to the object rendering. The following experiments evaluate the model performance. Table 6.2 presents the results of monocular 3D object detection on the SUN RGB-D dataset, utilizing various object insertion augmentation techniques. The first row is the performance of ImVoxelNet trained on the raw SUN RGB-D dataset without any insertion. The “ImVoxelNet + random insert” row displays results 108 Table 6.3: Per class average precision (AP) of ImVoxelNet 3D monocular object detection performance on SUN RGB-D dataset. chap-6-tab:3 Setting mAP@0.25 bed chair sofa table bkshf desk bathtub toilet dresser nightstand ImVoxelNet 40.96 72.0 55.6 53.0 41.1 7.6 21.5 29.6 76.7 19.0 33.4 ImVoxelNet + 3D Copy-Paste 43.79 72.6 57.1 55.1 41.8 7.1 24.1 40.2 80.7 22.3 36.9 Table 6.4: ImVoxelNet 3D monocular object detection performance on the ScanNet dataset with different object insertion methods. chap-6-tab:4 Setting mAP@0.25 bed chair sofa table bkshf desk bathtub toilet ImVoxelNet 14.1 25.7 7.9 13.2 7.8 4.2 20.5 22.1 11.5 ImVoxelNet + 3D Copy-Paste 16.9 27.7 12.7 10.0 10.8 9.2 26.2 29.2 9.0 achieved through a naive 3D object insertion without applying physically plausible constraints (random location and Camera point light). This approach led to a drop in accuracy from 40.96% to 37.02%, likely due to the lack of physical plausibility causing severe collisions and occlusions in the final image. The “ImVoxelNet + 3D Copy-Paste (w/o light)” row showcases the performance after implementing our method for only estimating physically plausible insertion position, pose, and size. Despite using a rudimentary camera point light, this approach outperforms “ImVoxelNet” without any insertion, and also outperforms the naive “ImVoxelNet + random insert” (+4.78 % improvement). This result shows that applying plausible geometry is essential for downstream tasks and makes 3D data augmentation useful over a naive, random augmentation. After further applying physically plausible dynamic light, our proposed “ImVoxelNet + 3D Copy-Paste” further improved the performance and achieved new state-of-the-art, surpassing ImVoxelNet without insertion (+2.83 %) on monocular 3D object detection task. This performance improvement suggests that our 3D Copy-Paste insertion can serve as an efficient data augmentation method to positively benefit downstream 3D object detection tasks. Table 6.3 shows detailed SUN RGB-D monocular 3D object detection results with ImVoxelNet on each individual object category. Table 6.4 presents the results of monocular 3D object detection on the ScanNet dataset. We utilized one RGB-D image per video: 1,201 for training and 312 for validation. We compute the ground truth 3D bounding box label for each of our used views from their provided scene-level label. For the baseline, we 109 Table 6.5: ImVoxelNet 3D monocular object detection performance on SUN RGB-D dataset with different illumination during insertion rendering. All experiments use the same ImVoxelNet model, insertion also uses our proposed physically plausible position, size, and pose. chap-6-tab:5 Setting Light source type Intensity Direction With shadow? mAP@0.25 Point Light 1 Point 100W Camera position Yes 41.80 Point Light 2 Point 100W Side (left) Yes 42.38 Area Light 1 Area 100W Camera position Yes 42.67 Area Light 2 Area 100W Side (left) Yes 42.02 Spot Light 1 Spot 100W Camera position Yes 40.92 Spot Light 2 Spot 100W Side (left) Yes 42.10 Sun Light 1 Sun 5 Camera position Yes 42.11 Sun Light 2 Sun 5 Side (left) Yes 41.21 Ours (Dynamic Light) Estimated Plausible light Dynamic Dynamic No 41.83 Ours (Dynamic Light) Estimated Plausible light Dynamic Dynamic Yes 43.79 train an ImVoxelNet monocular 3D object detection model on the training set and test on the validation set. For our method, there are 8 overlapping categories (sofa, bookshelf, chair, table, bed, desk, toilet, bathtub) in the 18 classes of ScanNet with our collected Objaverse data. We use our 3D Copy-Paste to augment the training set and train an ImVoxelNet model. All the training parameters are the same as the training on SUN RGB-D dataset. We show the results on the average accuracy of the 8 overlapping classes (mAP@0.25) in the Table 6.4. Our 3D Copy-Paste improves ImVoxelNet by 2.8% mAP. 6.4.3 Ablation study on the influence of insertion illumination and position on monocular 3D object detection chap-6-sec:4.3 We first explore the influence of illumination of inserted objects on downstream monocular 3D object detection tasks. Table 6.5 shows the ImVoxelNet performance on SUN RGB-D with different illumination settings during 3D Copy-Paste. To eliminate the influence of other insertion parameters, we fix the estimated position, pose, and size for each scene among all experiments in Table 6.5. Figure 6.3 provides a visualization of the effects of various light sources and light parameters during the insertion rendering process. The corresponding monocular 3D object detection results are presented in Table 6.5. These illustrate how lighting not only impacts the visual perception of the inserted object from a 110 Camera position (w/o shadow) Camera position (w/ shadow) Left (w/ shadow) Point light Sun light Spot light Area light Physically plausible light Figure 6.3: Visualization of different illumination on inserted objects. fig:vis-diff-illu Table 6.6: Ablation study of global context influence on ImVoxelNet monocular 3D object detection performance on SUN RGB-D. chap-6-tab:6 Method Follow global context? Select class based on empty size? mAP@0.25 ImVoxelNet + 3D Copy-Paste Yes No 43.75 ImVoxelNet + 3D Copy-Paste Yes Yes 43.74 ImVoxelNet + 3D Copy-Paste No Yes 42.50 ImVoxelNet + 3D Copy-Paste No No 43.79 human observer’s standpoint but also considerably affects the performance of downstream detection tasks. Thus, an accurate and physically plausible lighting estimation is crucial for both understanding the scene and for the practical application of downstream detection tasks. ImVoxelNet ImVoxelNet + 3D Copy-Paste Figure 6.4: Qualitative results on the SUN RGB-D dataset. fig:vis-qual Table 6.2 shows the importance of physical position, pose, and size (local context) on monocular 3D object detection tasks. We also explored the importance of the global context to the detection performance. 111 The global context here means the semantic relationship of the inserted object to the whole scene. For instance, inserting a toilet into a living room may not satisfy the global context. We propose a plausible global context insertion method where the inserted object class considers the global scene information. Also, we could select an inserted class based on the floor size: insert larger size objects (e.g., bed, bookshelf) on only a large size floor. Table 6.6 shows results on different settings. We find considering the global context during the insertion is on par with the random category selecting setting, and the following downstream detection model may not be sensitive to that. 6.4.4 Qualitative Analysis chap-6-sec:4.4 Figure 6.4 shows the qualitative results of monocular 3D object detection on SUN RGB-D dataset. Our method demonstrates enhanced capabilities in detecting objects with significant occlusion, provides improved pose estimation, and effectively suppresses false positives. 6.5 Conclusion and Discussion Our work addresses the challenge of scarce large-scale annotated datasets for monocular 3D object detection by proposing a physically plausible indoor 3D object insertion approach. This technique allows us to effectively augment existing indoor scene datasets, such as SUN RGB-D, with large-scale annotated 3D objects that have both plausible physical location and illumination. The resulting augmented dataset enables training a monocular 3D object model that achieves new state-of-the-art performance. Our approach carefully considers physically feasible locations, sizes, and poses for inserted objects, avoiding collisions with the existing room layout, and estimates spatially-varying illumination to seamlessly integrate the objects into the original scene. We also systematically evaluate the impact of the physical position and illumination of the inserted objects on the performance of the final monocular 3D object detection model. This paper is the first to demonstrate that physically plausible 3D object insertion can serve as an effective generative data 112 augmentation technique, leading to state-of-the-art performance in discriminative downstream tasks like monocular 3D object detection. Our findings highlight the potential of 3D data augmentation in improving the performance of 3D perception tasks, opening up new avenues for research and practical applications. 113 Chapter 7 DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models chapter-7 The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distribution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including automatic evaluation and human assessment. Project website https://briannlongzhao.github.io/DreamDistribution 7.1 Introduction chap-7-sec:introduction Dreams have long been a source of inspiration and novel insights for many individuals [152, 130, 570]. These mysterious subconscious experiences often reflect our daily work and life [152]. However, these 114 Figure 7.1: DreamDistribution learns a prompt distribution D∗ that represents a distribution of descriptions corresponding to a set of reference images. We can sample new prompts from D∗ or modified D∗ by text-guided editing to generate images of diverse new instance that follows the visual attributes of reference training images (top). We can also apply a learned distribution flexibly to, for example, a pretrained text-to-3D model, and generate diverse new 3D assets following the reference images (bottom). chap-7-fig:teaser Figure 7.2: Overview of for learning a prompt distribution. We keep a set of K learnable soft prompts and model a distribution of them at the CLIP text encoder feature space. Only prompts are learnable, CLIP encoder and the T2I diffusion model are all fixed. We use a reparameterization trick to sample from the prompt distribution and update the learnable prompts through backpropagation. The training objective is to make the generated images aligns with the reference image. An additional orthogonal loss is incorporated to promote differentiation among learnable prompts. For inference, we similarly sample from the prompt distribution at text feature space to guide the pretrained T2I generation. chap-7-fig:method 115 reflections are not mere replicas; they often recombine elements of our reality in innovative ways, leading to fresh perspectives and ideas. We aim to emulate this fascinating mechanism in the realm of text-to-image generation. Text-to-image (T2I) generation has recently been popularized due to the astonishing performance of state-of-the-art diffusion models such as Stable Diffusion [467] and DALL·E 2 [439]. Variations of the T2I models have enabled several fascinating applications that allow user to control the generation, such as conditioned generation based on other input modalities [647, 337, 625], inpainting [367, 613], image editing [390, 56]. One such interesting application is personalization of T2I models, where user provides some reference images of the same instance (e.g. their pet dog), and the personalized model can generate images based on the references, with the flexibility of text-guided editing for new context. This is generally achieved by associating a token with the personalized concept through fine-tuning the model parameters [476, 303] or newly added learnable token embeddings [157, 571]. In many cases, however, user may want to personalize T2I generation over a more abstract visual attribute instead of a specific instance-level personalization. For example, a designer may seek inspiration by generating a variety of novel cartoon characters or scenery images following similar visual attributes presented in their previous works. In this case, trying over text prompts is not scalable and hard to get desired result that follows the desired visual attributes. On the other hand, using the existing personalization methods aforementioned is likely to fail when training images when the training images do not represent the same instance, but rather encompass a distribution sharing certain, yet challenging-to-articulate, commonalities. Additionally, existing personalization methods often result in limited diversity and variation during generation (Figure 7.3). Since the associated token is fixed, these methods will typically learn a token that is either overfitted to a combination of visual features, or learn a token that is overly generalized, which introduces more randomness into the uncontrollable diffusion process, thereby failing to follow desired visual attributes in generated images. 116 In this work, we propose , a prompt distribution learning approach on T2I diffusion model for various downstream tasks (Figure 7.1). Our proposed solution has three key components (Figure 7.2). First, to adapt a pretrained fixed T2I model, instead of fine-tuning diffusion model parameters, our method builds on prompt tuning [666, 667], where we use soft learnable prompt embeddings with the flexibility to concatenate with text, to associate with the training image set. This design have several advantages: (1) It prevents catastrophic forgetting of the pretrained model, enabling it to learn an almost infinite variety of target prompt distributions using the same T2I diffusion model. (2) It is highly efficient in terms of parameters, requiring only the prompt itself as the learnable element. (3) The learned prompts remain within the semantic space of natural language, offering text-guided editing capabilities and generalizing to other pre-trained diffusion models, such as text-to-3D. (4) The learned distribution increased flexibility in managing variations. Second, we introduce a distribution of prompts to model various attributes described by reference images at a broader level. The prompt distribution is modeled by a set of learnable prompt embeddings to associate with the training image set as a whole. The learned prompt distribution can be treated as a distribution of learned “descriptions” of the reference images and should be able to model the commonalities and variations of visual attributes, e.g., foreground, style, background, texture, pose. During inference, we sample from the prompt distribution, which should have a similar semantic meaning, understood by the downstream denoising network, to produce in-distribution outputs with appropriate variations. Lastly, to effectively optimize the set of soft prompts that models the distribution, we apply a simple reparameterization trick [293] and an orthogonal loss to update the prompts at token embedding space simultaneously and orthogonally. We first demonstrate the effectiveness of our approach in customizing image generation tasks (Section 7.4). By taking a small set of images of interest as training images, we demonstrate that our approach can generate diverse in-distribution images where baseline methods fail to generate desired output. The diversity and the quality of our synthetic images are verified via automatic and human evaluation (Section 7.4.2). We show that the learned distribution holds the capability of text-guided editing, as well as further controllability such 117 as scaling the variance and composition of distributions (Section 7.4.3). Next we highlight that the learned prompt distribution can be easily applied to other text-guided generation tasks such as pretrained text-to3D models (Section 7.4.4). Lastly we show the effectiveness of our method on personalized distribution generation through classification task with synthetic training data as a proxy (Section 7.4.5). In summary, our contributions are: • We propose a distribution based prompt tuning methods for personalized distribution generation by learning soft prompt distribution using T2I diffusion model. • Using a public available pretrained T2I diffusion model, we experiment our approach on customization T2I generation tasks and show that our approach can capture visual attributes into prompt distribution and can generate diverse in-distribution images that follows text-guided editing. • Further experiments show that our learned distribution is controllable and flexible and easy to be adapted to other generation tasks that requires text as input. • We further quantitatively demonstrate the effectiveness of our approach using synthetic image dataset generation tasks as a proxy and also through automatic evaluation metrics and human evaluation. 7.2 Related Works 7.2.1 Text-to-image Diffusion Models Diffusion models [238, 119, 521] have achieved great success in various image generation tasks. State-of-theart T2I models such as Imagen [483] and DALL·E 2 [439] trained on large scale data demonstrate remarkable synthesis quality and controllability. Latent Diffusion Models [467] and its open-source implementation, Stable Diffusion [467], have also become a prevailing family of generative models. In these T2I diffusion models, text is encoded into latent vectors by pretrained language encoders such as CLIP [434], and the 118 denoising process is conditioned on latent vectors to achieve text-to-image synthesis. However, such models trained on large scale text-image pairs are not designed to generate personalized images such as images of one’s pet dog, therefore only the text conditioning cannot provide fine-grained control over the generated images. 7.2.2 Personalized text-to-image Generation Various approaches are proposed to better control the text-guided diffusion models and achieve personalization. Textual Inversion [157] proposed to search for a new token in the embedding space representing a visual concept via optimizing a word embedding vector. DreamBooth [476] fine-tunes all parameters of the model to associate a personalized subject into an rarely used token. Custom Diffusion [303] employs that fine-tuning method but only fine-tune cross-attention layers to reduce the training time, with the ability to learn multiple concepts jointly. Subsequent works [571, 597, 505] mainly borrow the ideas from these works and focus on solving their drawbacks. 7.2.3 Prompt Learning Prompt learning is a popular method in natural language processing (NLP). The main idea is to transfer various downstream NLP tasks to masked language modeling problems via adopting proper prompt templates [57, 319, 334, 435] instead of fine-tuning the pretrained language model. Searching for the appropriate prompts is the key of this method. Prompt engineering [57, 435] adopts carefully-designed discrete (hard) prompts crafting by human, while prompt tuning [319, 334] automatically searches for the desired prompts in the embedding space via learning continuous (soft) prompts. The great success of NLP inspires computer vision researchers and prompt engineering is explored in pretrained vision-language models such as CLIP [434] and ALIGN [267]. CoOp [667] applies the idea of prompt tuning in vision-language tasks, which learns a continuous prompt via minimizing the classification loss of the downstream tasks. ProDA [366] learns 119 a distribution of diverse prompts to capture various representations of a visual concept instead of a single prompt in CoOp [667], which achieves better generalization. Most relevant to our work are Textual Inversion [157] and ProDA [366]. Textual Inversion learns a fixed token embedding associated with a pseudo-word. Ours learns a distribution of prompts in the CLIP feature space like ProDA [366], allowing for learning the visual concept with diverse visual representations and capturing the details for reconstructions and plausible synthesis. 7.3 Method chap-7-sec:method Given a set of images with some common visual attributes (e.g. same category, similar style), our goal is to capture the visual commonalities and variations and model by a prompt distribution in the text feature space, which could be compatible with natural language. The commonalities among reference images may be challenging to articulate with natural language prompts. We can thus sample prompts from the distribution to guide T2I diffusion model to generate diverse unseen images while at the same time following the common traits distribution. The inherent characteristics of the learned prompts are compatible with natural language instructions and other pretrained text-guided generation models. 7.3.1 Text-to-Image Diffusion Text-to-image diffusion models are a class of generative models that learns image or image latent distribution by gradually denoising a noise sampled from Gaussian distribution. Specifically, given a natural language text prompt, a tokenizer followed by a text embedding layer map the input text to a sequence of embedding vectors p. A text encoder converts the text embedding into text features c = E(p) used for conditioning the 120 generation process. An initial noise ϵ is sampled from N (0, I), and the denoising model ϵθ predicts the noise added to a noisy version of image of image latent x. The denoising model ϵθ is optimized using the objective: L = Ex,c,ϵ,t h ∥ϵ − ϵθ (xt , c, t)∥ 2 2 i (7.1) {eq:loss_dm} eq:loss_dm where x is the ground-truth image or image latent obtained from a learned autoencoder, xt is the noisy version of x at time-step t, and ϵ ∼ N (0, I). 7.3.2 Prompt Tuning Our proposed method is grounded in the notion of prompt tuning, which aims to learn a soft continuous prompt on target task and is widely used in fine-tuning NLP models. [319, 357, 334, 212, 356] Specifically, for a pretrained model that takes natural language prompt as input, we can formulate a prompt with continuous learnable token embeddings P = [PREFIX] V [SUFFIX] ∈ R L×d , where [PREFIX] and [SUFFIX] are word embeddings of natural language prefix and suffix if needed, and L represents the prompt length or the total number of tokens, and d represent the dimension of word embeddings. V = [v]1 . . . [v]M ∈ RM×d represents a sequence of M learnable token embedding vectors with same dimension as word embeddings. During fine-tuning, the parameters of the pretrained generation model remain fixed, and only the learnable token embeddings V are updated through direct optimization employing the corresponding loss function backpropagated through generator ϵθ and text encoder E. Formally, prompt tuning aims to find optimized embedding vectors V∗ = arg maxV P(Y | P, X), where X and Y are input data and output label, respectively. Prior works have shown the efficacy of adopting prompt tuning techniques on vision-language models for image classification tasks [667, 268, 666]. Gal et al. [157] adopts similar prompt tuning methods that enable personalized generation. However, the limitation of this approach lies in its constraint to personalize only one particular concept, such as a specific dog, as it employs a fixed token embedding for concept encoding. 121 7.3.3 Learning Prompt Distribution We aim to model more general commonalities and variations presented in the reference image set and generate diverse images of new instances that visually align, therefore we propose to model a learnable distribution of prompts for the reference images. Inspired by Lu et al. [366], which proposed to estimate a distribution of prompt for image classification tasks, we propose to model a distribution of learnable prompts over a sequence of M token embeddings to capture the distribution of visual attributes on T2I generation task leveraging diffusion model. Our methods builds on Stable Diffusion [467], where a pretraind CLIP [434] text encoder is used for obtaining text feature of the prompt. Due to the contrastive training objective of CLIP, features of texts that have similar semantic meaning have high cosine similarity and therefore close to each other in CLIP feature space [434]. Lu et al. [366] have also shown that for text prompts that describe images of the same category, the CLIP text feature c output from pretrained CLIP text encoder are adjacent to each other in a cluster. Therefore, it is natural to model a Gaussian distribution of c that describes images of same category or with shared attributes. To do so, instead of keeping one learnable soft prompt to optimize during training, we maintain a set of K learnable prompts P K = {Pk = [PREFIX] Vk [SUFFIX]} K k=1 corresponds to a set of similar reference images. Our goal is to optimize the set of learnable token embeddings {Vk} K k=1. With K learnable prompts, we can estimate the mean µc = µ E P K ∈ R L×dE and standard deviation σc = σ E P K ∈ R L×dE at E text encoder space, where dE is the feature dimension of text encoder space. Applying to the training objective of T2I diffusion model, Equation 7.1 becomes: L P K = Ex,˜c,ϵ t h ∥ϵ − ϵθ (xt ,˜c, t)∥ 2 2 i (7.2) {eq:loss_dm_pdl} eq:loss_dm_pdl where ˜c ∼ N µc,σ 2 c and ϵ ∼ N (0, I) is the sampled Gaussian noise added to the image or image latent. However, sampling ˜c from a distribution makes it not differentiable for optimization, therefore we apply the reparameterization trick similar to that used in VAE [293]. Formally, since ˜c ∼ N µc,σ 2 c , we can rewrite the optimization objective Equation 7.2 as: L P K = Ex,ω,ϵ,t h ∥ϵ − ϵθ (xt , µc + ωσc, t)∥ 2 2 i (7.3) where ω ∼ N (0, I) has the same dimension as µc and σc. Since the exact computation of L P K is intractable, we use a Monte Carlo approach to sample ω for S times to approximate the expected value for optimization: L P K = 1 S X S s=1 ∥ϵ − ϵθ (xt , µc + ωsσc, t)∥ 2 2 (7.4) {eq:loss_dm_pdl_reparam} eq:loss_dm_pdl_reparam In order to avoid the scenario wherein multiple prompt features converge to a same vector, which will result in a non-representative low-variance distribution, we apply a similar orthogonal loss proposed in [366] to penalize on the cosine similarity and encourage orthogonality between each pair of prompts: Lortho = 1 K(K − 1) X K i=1 X K j=i+1 |⟨E(Pi), E(Pj )⟩| (7.5) {eq:loss_dm_ortho} eq:loss_dm_ortho where ⟨·, ·⟩ is cosine similarity between a pair of vectors. The total loss is therefore: L = L(P K) + λLortho (7.6) {eq:loss_total} eq:loss_total where λ is a hyperparameter. Implementation Details In all experiments, we use Stable Diffusion 2.1 [467] and keep all the default hyperparameters. We use S = 4 and λ = 5 × 10−3 . We use K = 32 prompts in all personalized generation experiments, and K = 10 prompts to reduce computation in synthetic dataset experiments. We use 1,500 steps with constant learning rate of 10−3 . 7.4 Experiments sec:experiments In this section, we demonstrate several experiments and applications of our approach and show visual results of generated images. We show the ability of our approach to capture a distribution of reference images and generate in-distribution novel images in Section 7.4.1. We present additional quantitative results including automatic evaluation and user studies in Section 7.4.2. We also show the flexibility and effects of manipulating and text-guide editing learned prompt distribution in Section 7.4.3. We further highlight easy application of our learned prompt distribution to other text-based generation tasks using text-to-3D as an example in Section 7.4.4. Finally in Section 7.4.5 we present experiments that show the effectiveness of our approach in generating synthetic training dataset. 7.4.1 Diverse Personalized Generation sec:personalization We first demonstrate the ability of our approach to generate images that preserve general visual features shown in training set and at the same time generate new images with high diversity. Given a diverse set of few training images (typically 5-20) that are not easily describable in texts and at the same time share some similar visual attributes, we can generate diverse in-distribution images by simply sampling from the learned distribution as the input prompt text embedding to T2I diffusion model. Our learned prompt distribution can be therefore treated as a distribution of descriptions corresponding to the set of training images. Baselines. We compare with popular instance-level personalization methods including Textual Inversion [157], DreamBooth [476], Custom Diffusion [303]. We also evaluate against Short Caption that uses a short description as text prompt, and Long Caption that uses a longer text caption with detailed descriptions. 124 Figure 7.3: Comparison of results with existing methods. Given a set of training images (typically 5-20, we only show 4 here), we compare generation results with other existing methods. We use Stable Diffusion version 2.1 for all methods. As can be seen on the bottom row, our method is able to generate more diverse and coherent images (also quantitatively analyzed by automatic and human evaluation in Section 7.4.2). fig:comparison These comparisons emphasize our method’s ability to take care of both similarity and diversity referencing the training images. We use the same pretrained Stable Diffusion version 2.1 with default hyperparameters provided in baseline works. We use M = 8 context vectors without adding any prefix or suffix texts in either training or inference process for DreamDistribution. Results Figure 7.3 shows visualized comparison with baselines. In general, both short and long text prompting methods fail to generate results that visually follow the reference images since there is no training involved and the image details are hard to describe in language. Images generated using baseline methods generally show limited variation or inconsistent visual attributes in all examples. All these methods try to associate different visual concepts with a fixed token, which does not provide any semantic variations itself. Although the denoising process enables some randomness, the training objective of associating various concepts with a fixed token will either fail to capture a distribution due to non-convergence, leading to underfitting to generic image category information, or overfits to a visual combination of the training images. 125 By modeling multiple concepts using multiple prompts and optimizing the prompt distribution, our proposed method is able to produce substantial variations of style and view points, for example, following the reference images in the cathedral example (first column). Ours method can also model the texture and background information and generate new instance with significant variations in color and pose following the reference images of the Gundam example (second column), as well as patterns, lines, style as a whole and generate novel artistic creations as shown in the Basquiat’s painting example (third column). In all, DreamDistribution is able to produce substantial variations on style, viewpoints, pose, layout, etc., with appropriate visual attributes following the reference images. 7.4.2 Generation Quality and Diversity Evaluation sec:quantitative Model FID↓ CLIP-I↑ DINO↑ Density↑ Coverage↑ DreamBooth [476] 234.9071.87 0.790.06 0.460.10 0.910.52 0.740.32 Textual Inversion [157] 224.2375.49 0.830.04 0.480.10 1.280.44 0.820.17 Custom Diffusion [303] 236.6172.76 0.800.05 0.460.07 1.450.79 0.870.18 Ours 215.1572.65 0.840.03 0.500.09 1.590.47 0.930.09 Table 7.1: Our method achieves the best quality and diversity automatic metrics across 12 scenarios. Mean metrics are reported with standard deviations shown in subscript. tab:auto_eval_quality We quantitatively assess our methods in terms of diversity and quality, and further use synthetic ImageNet classification performance as a proxy in Section 7.4.5. We train DreamBooth, Textual Inversion, Custom Diffusion and DreamDistribution on 12 diverse image scenarios including photos of real objects in large and small scales, works of famous artists, as well as illustrations of cartoon characters and scenery images with prominent styles, sourced from illustrators from online communities. For our approach we use M = 4 learnable context with no prefix and suffix in both training and generating stages. Automatic Metrics We evaluate the generative images on established automatic evaluation metrics that measure the diversity of synthetic images and the similarity between real and synthetic images. Following prior works [231, 648, 396, 476], in Table 7.1 we evaluate image quality using FID [231] that measures the 126 distance between the distribution of generated images and the distribution of real images via InceptionV3 [542]; CLIP-I and DINO [476] that measures average pairwise cosine similarity between CLIP [434] and DINOv1 [69] embeddings. Our method achieves the best quality across all three quality measurements, suggesting that our method is capable of creating more high-quality images that fulfill the prompt requirement. Additionally, we report Density and Coverage [396] in Table 7.1. Density measures samples in regions where real samples are densely packed, while coverage calculates fraction of real samples whose neighbourhoods contain at least one generated sample. Both metrics are calculated with DINOv2 [411]. Our method achieves the best coverage and diversity across the board. Figure 7.4: Human Evaluation on image diversity (Section 7.4.2) aligns with automatic evaluation (Table 7.1). Our method shows significantly greater diversity, which may explain why it was able to better train image classifiers in Table 7.2. fig:human_eval_diversity Human Evaluation Admittedly, automatic evaluation does not fully capture the richness perceived by human observers. We further investigate if Table 7.1 correlates with human perception via conducting human evaluation based on those 12 sets of reference images. For each reference image set, we generate images using 127 Figure 7.5: Effect of scaling the variance of a learned prompt distribution. Image diversity increases as the scaling factor γ increases. fig:scale_var DreamBooth, Textual Inversion, Custom Diffusion, and our method, with 40 images per method, resulting in a total of 1,920 generated images in the evaluation set. We assign 10 independent annotators. For each of the 12 reference sets, annotators are asked to choose the most preferable set of generated images based on their perceived similarity with the reference set and the diversity within the generated set. The methods are anonymized so annotators are unaware of which generated set corresponds to which method. We collect a total of 120 samples and count the frequency of preferences. Figure 7.4 demonstrates that our generated images exhibit superior diversity compared to three baseline models, reinforcing our intuition that by learning distribution we are able to generate diverse images with coherent content and visual attributes presented in the reference image. 128 Figure 7.6: Composition of prompt distributions using linear interpolation between Chinese painting and Van Gogh. Mixing ratio changes linearly from left to right. The middle columns show mixtures of two styles. fig:composition 7.4.3 Controllability of Prompt Distribution sec:manipulation Since our learned prompt distribution is in the CLIP text feature space, it is natural to manipulate the learned distribution based on the property of CLIP text feature space. We show several interesting distribution manipulation methods, including text-guided editing, scaling the variance for diversity control, interpolation between multiple distributions. Text-guide Editing Similar to existing personalization methods [476, 157, 303], our learned distribution preserves the flexibility of text-guided editing . As shown in Figure 7.1 and Figure 7.7, we are able to generate diverse in-distribution Gundam figures that follows the style of reference images but with different pose, style, context, using user provided text-guidance at inference time. With a set of learned prompt, we concatenate them with the same text prefix and/or suffix to fit a new distribution at the CLIP text feature space to enable text-guided editing of a prompt distribution. Application includes but not limited to, generating 129 Figure 7.7: Results on text-editability of our methods. Left column shows samples of reference images, right columns are generated results with corresponding prompts. fig:text_edit objects of interests in a different background or context, transferring style using text, and controlling the pose, viewpoints, layout, of objects of interests. Scaling Variance for Diversity Control Once a prompt distribution is learned, we can easily control the diversity of generated images by changing the variance or standard deviation of the learned distribution. We show an example of the effect of multiplying different scale factors γ to the variance of a learned prompt distribution in Figure 7.5. When γ = 0, the generated images show very similar patterns following some of the reference images. As γ increases, more different layouts emerge, and when we further scale the variance for γ = 2, the generated images become more diverse with significant randomness. Composition of Distributions Given multiple prompt distributions in CLIP feature space, we can composite distributions by finding a linearly interpolated distribution between them. This distribution in the CLIP feature space should represent a text with semantic meaning that is a weighted mixture of the given prompt 130 distributions, thereby showing a mixture of visual attributes in the generated images. We naively use a weighted sum of the distributions to interpolate between distributions: µ ∗ c = X N i=1 αiµci , σ ∗ c = X N i=1 √ αiσci (7.7) where µ ∗ c and σ ∗ c are mean and standard deviations of the interpolated distribution, and αi is the weight of i-th prompt distribution with mean and standard deviation µci and σci respectively, and PN i=1 αi = 1 are mixing weight parameters. We show an example of mixing distributions of Chinese paintings and Van Gogh style paintings in Figure 7.6. From the left column to right, we adjust the mixing ratio to increase the weight of prompt distribution of Van Gogh and decrease the weight of Chinese painting. 7.4.4 Applying to Text-to-3D Generation sec:text23d Our learned distribution can be flexibly applied to other text-driven tasks, as long as the generation pipeline uses the same pretrained text encoder as the text feature extractor. In this section, we highlight and demonstrate the flexibility of our method by using a prompt distribution trained on T2I diffusion for text-to-3D task. We use MVDream [506], a state-of-the-art text-to-3D model that train a NeRF [384] and render a 3D asset following a text prompt, which in our case is a prompt sampled from prompt distribution. As shown in Figure 7.1 and Figure 7.8, although MVDream incorporates some extra prior in its modified multi-view diffusion model that leads to reduced diversity, our prompt distribution can still generate 3D assets with significant variation in design details. Moreover, as shown in Figure 7.9, the pipeline possesses text-guided editing capabilities akin to those of DreamBooth3D [437], yet it can generate instances that exhibit more diverse appearances. 131 Figure 7.8: 3D generation results by learning a prompt distribution over the reference images and then inference using MVDream [506] (without extra texts). fig:3d Top-1 Top5 Top-1 Top5 Top-1 Top5 Top-1 Top5 Top-1 Top5 Real 88.0 96.7 85.1 94.9 45.1 63.9 66.1 85.2 26.7 65.8 Class Names 45.5 70.0 46.2 72.5 24.1 43.3 53.6 75.8 8.1 38.8 CLIP Prompts [434] 45.6 69.2 46.1 69.6 36.2 60.1 58.8 81.1 12.2 45.7 ImageNet-SD [486] 55.4 77.5 55.8 77.5 29.4 49.0 59.8 80.0 15.9 49.4 DreamDistribution (Ours) 64.3 84.0 61.7 81.6 25.2 45.8 53.0 74.8 15.7 50.4 Training Dataset IN [550] IN-V2 [447] IN-Sketch [579] IN-R [225] IN-A [230] Table 7.2: Classification accuracy on different real test sets after training a classifier on synthetic ImageNet (IN) generated by a given method. When training on images from our method, the resulting classifier performs better on the respective test sets, indicating that the images synthesized by our method allowed the classifier to learn those object categories better. tab-class 132 Figure 7.9: 3D generation results by learning a prompt distribution over the reference images and then inference with text-guided editing using MVDream [506]. fig:3d_text_edit 133 7.4.5 Applying to Synthetic Dataset Generation sec:imagenet Our proposed method can also be effectively used in generating synthetic image classification datasets. By giving several dozens to hundreds of images that correspond to a class in a classification dataset, our method can capture and encode distributions of the dataset images into the learnable prompt distributions, and thereby generate diverse training images with similar distribution as the training set. We generate “synthetic copy” [168, 172, 486, 175] of ImageNet [480] via DreamDistribution using Stable Diffusion version 2.1 with default hyperparameters. Due to the large size of ImageNet-1K, we follow previous works [486] to mainly experiment on ImageNet-100 [550], a 100-class subset. For each class, we generate 2,000 synthetic images and use CLIP [434] to select top 1,300 images with highest cosine similarity to the embedding vector of the corresponding class name, resulting the same total number of images as real ImageNet training set. We also compare with four baselines: Real uses the real ImageNet training set, Class Names and CLIP Prompts generate images by feeding Stable Diffusion class name of each class or 80 diverse text prompts from CLIP. * ImageNet-SD [486] generates images using prompts in the form of “c, hc inside b”, where c represents the class name, hc represents the hypernym (WordNet parent class name) of the class, and b is a random background description from the class names from Places365 dataset [664]. We train a ResNet-50 [220] classifier on synthetic images only for 300 epochs using 0.2 alpha for mixup augmentation [643] and auto augment policy v0 via timm [600]. To analyze generalizability, we also evaluate the trained model on validation set of ImageNet variants including ImageNetV2 [447], ImageNet-Sketch [579], ImageNet-R [225], and ImageNet-A [230]. Top-1 and top-5 accuracy is reported in Table 7.2. In all settings, the classifier is exclusively exposed to synthetic images, but images generated using our method shows the highest classification accuracy on ImageNet validation set. This is because DreamDistribution can generate a diverse set of high-quality images following training set distribution, while other prompt engineering methods cannot follow the real image distribution and *e.g. “a photo of c”, “a drawing of a c”, where c represents the class name. 134 tend to show limited diversity within classes, therefore resulting in performance degradation. We also achieve the best results on ImageNet-V2 and comparable results on ImageNet-A. For the Sketch and Rendition variant, in contrast to our method, CLIP Prompts and ImageNet-SD offer specific prompts to generate images of other domains, which may account for our comparatively lower performance. 7.5 Limitations Despite the ability of our method to generate diverse novel in-distribution images, it does have certain limitations. Specifically, our method may struggle to capture visual features when the number of training images is limited and very diverse. Moreover, the Gaussian distribution assumption could be overly restrictive depending on the training images and the text encoder’s latent space. In the future, we hope to find a more robust approach to learning distributions from a few, highly diverse images, with more accurate assumptions and resilient distribution forms. 7.6 Conclusion We introduced DreamDistribution, a distribution based prompt tuning method for personalizing T2I diffusion models to generate diverse in-distribution images following a small set of reference images. The key idea of our methods lies in modeling the commonalities and variations of visual attributes using a prompt distribution at text feature space. We show a variety of experiments and application that is enabled by our method. 135 Chapter 8 Neural-Sim: Learning to Generate Training Data with NeRF chapter-8 Training computer vision models usually requires collecting and labeling vast amounts of imagery under a diverse set of scene configurations and properties. This process is incredibly time-consuming, and it is challenging to ensure that the captured data distribution maps well to the target domain of an application scenario. Recently, synthetic data has emerged as a way to address both of these issues. However, existing approaches either require human experts to manually tune each scene property or use automatic methods that provide little to no control; this requires rendering large amounts of random data variations, which is slow and is often suboptimal for the target domain. We present the first fully differentiable synthetic data pipeline that uses Neural Radiance Fields (NeRFs) in a closed-loop with a target application’s loss function. Our approach generates data on-demand, with no human labor, to maximize accuracy for a target task. We illustrate the effectiveness of our method on synthetic and real-world object detection tasks. We also introduce a new “YCBin-the-Wild” dataset and benchmark that provides a test scenario for object detection with varied poses in realworld environments. Code and data could be found at https://github.com/gyhandy/Neural-Sim-NeRF. 8.1 Introduction chap-8-sec:intro The traditional pipeline for building computer vision models involves collecting and labelling vast amounts of data, training models with different configurations, and deploying it to test environments [219, 451, 136 Test scenario 1 Test scenario 2 Test scenario N … Neural-Sim D-train 1 D-train 2 D-train N Generate as Demand … 40 50 60 70 80 90 100 Pose Optimization 50 60 70 80 90 100 Zoom Optimization 60 70 80 90 100 Illumination Optimization -31 +32 -30 +31 -16 +15 (a) (b) No train/test domain gap Neural-Sim optimize pose/zoom/illumination parameter to fill the gap Exist train/test domain gap caused by pose/zoom/illumination Figure 8.1: (a) On-demand synthetic data generation: Given a target task and a test dataset, our approach “Neural-sim" generates data on-demand using a fully differentiable synthetic data generation pipeline which maximises accuracy for the target task. (b) Train/test domain gap causes significant detection accuracy drop (yellow bar to gray bar). We dynamically optimize the render parameters (pose/zoom/illumination) to generate the best data to fill the gap (blue bar). chap-8-fig1 499]. Key to achieving good performance is collecting training data that mimics the test environment with similar properties relating to the object (pose, geometry, appearance), camera (pose and angle), and scene (illumination, semantic structures)[32]. However, the traditional pipeline does not work very well in many real-world applications as collecting large amounts of training data which captures all variations of objects and environments is quite challenging. Furthermore, in many applications, users may want to learn models for unique objects with novel structures, textures, or other such properties. Such scenarios are very common particularly in business scenarios where there is desire to create object detectors for new products introduced in the market. Recent advances in rendering, such as photo-realistic renderers [114, 183] and generative models (GANs [55], VAEs [120, 233]), have brought the promise of generating high-quality images of complex scenes. This has motivated the field to explore synthetic data as source of training data [123, 258, 129, 215, 241, 458, 470, 557, 166, 651, 172]. However, doing so in an offline fashion has similar issues as the traditional pipeline. While it alleviates certain difficulties, e.g., capturing camera/lighting variations, it create dependency on 3D asset creation, which is time-consuming. 137 Recently, a new image generation technique called the Neural Radiance Field (NeRF) [383] was introduced as a way to replace the traditional rasterization and ray-tracing graphics pipelines with a neural-network based renderer. This approach can generate high-quality novel views of scenes without requiring explicit 3D understanding. More recent advancements in NeRFs allow to control other rendering parameters, like illumination, material, albedo, appearance, etc. [528, 376, 650, 41, 264]. As a result, they have attracted significant attention and have been widely adopted in various graphics and vision tasks [156, 41, 528, 415]. NeRF and their variants possess some alluring properties: (i) differentiable rendering, (ii) control over scene properties unlike GANs and VAEs, and (iii) they are data-driven in contrast to traditional renderers which require carefully crafting 3D models and scenes. These properties make them suitable for generating the optimal data on-demand for a given target task. To this end, we propose a bilevel optimization process to jointly optimize neural rendering parameters for data generation and model training. Further, we also propose a reparameterization trick, sample approximation, and patch-wise optimization methods for developing a memory efficient optimization algorithm. To demonstrate the efficacy of the proposed algorithm, we evaluate the algorithm on three settings: controlled settings in simulation, on the YCB-video dataset [610], and in controlled settings on YCB objects captured in the wild. This third setting is with our newly created “YCB-in-the-wild” dataset, which involves capturing YCB objects in real environments with control over object pose and scale. Finally, we also provide results showing the interpretability of the method in achieving high performance on downstream tasks. Our key contributions are as follows: (1) To the best of our knowledge, for the first time, we show that NeRF can substitute the traditional graphics pipeline and synthesize useful images to train downstream tasks (object detection). (2) We propose a novel bilevel optimization algorithm to automatically optimize rendering parameters (pose, zoom, illumination) to generate optimal data for downstream tasks using NeRF and its variants. 138 (3) We demonstrate the performance of our approach on controlled settings in simulation, controlled settings in YCB-in-wild and YCB-video datasets. We release YCB-in-wild dataset for future research. 8.2 Related work chap-8-sec:related_work Traditional Graphics rendering methods can synthesize high-quality images with controllable image properties, such as object pose, geometry, texture, camera parameters, and illumination [458, 114, 183, 241, 457]. Interestingly, NeRF has some important benefits over the traditional graphics pipelines, which make it more suitable for learning to generate synthetic datasets. First, NeRF learns to generate data from new views based only on image data and camera pose information. In contrast, the traditional graphics pipeline requires 3D models of objects as input. Getting accurate 3D models with correct geometry, material, and texture properties generally requires human experts (i.e. an artist or modeler). This, in turn, limits the scalability of the traditional graphics pipeline in large-scale rendering for many new objects or scenes. Second, NeRF is a differentiable renderer, thus allowing backpropagation through the rendering pipeline for learning how to control data generation in a model and scene-centric way. Deep generative models, such as GANs [196, 55], VAEs [120, 233] and normalizing flows [101] are differentiable and require less human involvement. However, most of them do not provide direct control of rendering parameters. While some recent GAN approaches allow some control [612, 11, 177] over parameters, it is not as explicit and can mostly only change the 2D properties of images. Further, most generative models need a relatively large dataset to train. In comparison, NeRF can generate parameter-controllable high-quality images and requires a lesser number of images to train. Moreover, advancements in NeRF now allow the control of illumination, materials, and object shape alongside camera pose and scale [528, 376, 650, 41, 264]. We use NeRF and their variants (NeRF-in-the-wild [376]) to optimize pose, zoom and illumination as representative rendering parameters. 139 Learning simulator parameters. Related works in this space focus on learning non-differentiable simulator parameters for e.g., learning-to-simulate (LTS) [477], Meta-Sim [280], Meta-Sim2 [117], Auto-Sim [35], and others [622, 160, 364]. Our work in contrast has two differences: (i) a difference in the renderer used (NeRF vs traditional rendering engines), and (ii) a difference in the optimization approach. We discuss the different renderers and their suitability for this task in the previous subsection. LTS [477] proposed a bilevel optimization algorithm to learn simulator parameters that maximized accuracy on downstream tasks. It assumed both data-generation and model-training as a black-box optimization process and used REINFORCE-based [602] gradient estimation to optimize parameters. This requires many intermediate data generation steps. Meta-sim [280] is also a REINFORCE based approach, which requires a grammar of scene graphs. Our approach does not use scene grammar. Most similar to our work is the work of Auto-Simulate [35] that proposed a local approximation of the bilevel optimization to efficiently solve the problem. However, since they optimized non-differentiable simulators like Blender [114] and Arnold [183], they used REINFORCE-based [602] gradient update. Further, they have not shown optimization of pose parameter whose search space is very large. In comparison, our proposed Neural-Sim approach can learn to optimize over pose parameters as well. chap-8-sec:method 8.3 Neural-Sim The goal of our method is to automatically synthesize optimal training data to maximize accuracy for a target task. In this work, we consider object detection as our target task. Furthermore, in recent times, NeRF and its variants (NeRFs) have been used to synthesize high-resolution photorealistic images for complex scenes [528, 376, 650, 41, 264]. This motivates us to explore NeRFs as potential sources of generating training data for computer vision models. We propose a technique to optimize rendering parameters of NeRFs to generate the optimal set of images for training object detection models. 140 Rendering Parameter (~) Training on Validation on Val score ℒ Neural Renderer (NeRF) Synthetic Dataset Bi-level Optimization Object Detector (RetinaNet) NeRF TV Figure 8.2: Neural-Sim pipeline: Our pipeline finds the optimal parameters for generating views from a trained neural renderer (NeRF) to use as training data for object detection. The objective is to find the optimal NeRF rendering parameters ψ that can generate synthetic training data Dtrain, such that the model (RetinaNet, in our experiments) trained on Dtrain, maximizes accuracy on a downstream task represented by the validation set Dval. chap-8-fig:pipeline NeRF model: NeRF [383, 627] takes as input the viewing direction (or camera pose) denoted as V = (ϕ, ρ), and renders an image x = NeRF(V ) of a scene as viewed along V . Note that our proposed technique is broadly applicable to differentiable renderers in general. In this work, we also optimize NeRFin-the-wild (NeRF-w) [376] as it allows for appearance and illumination variations alongside pose variation. We first discuss our framework for optimizing the original NeRF model and later we discuss optimization of NeRF-w in Section 8.3.2. Synthetic training data generation: Consider a parametric probability distribution pψ over rendering parameters V , where ψ denotes the parameters of the distribution. It should be noted that ψ corresponds to all rendering parameters including pose/zoom/illumination, here, for simplicity, we consider ψ to denote pose variable. To generate the synthetic training data, we first sample rendering parameters V1, V2, ..., VN ∼ pψ. We then use NeRF to generate synthetic training images xi = NeRF(Vi) with respective rendering parameters Vi . We use an off-the-shelf foreground extractor to obtain labels y1, y2, . . . , yN . the training dataset thus generated is denoted as Dtrain = {(x1, y1),(x2, y2), . . . ,(xN , yN )}. 141 Optimizing synthetic data generation Our goal is to optimize over the rendering distribution pψ such that training an object detection model on Dtrain leads to good performance on Dval. We formulate this problem as a bi-level optimization [96, 150, 35] as below: min ψ Lval( ˆθ(ψ)); s.t. ˆθ(ψ) ∈ arg min θ Ltrain(θ, ψ), (8.1a) {eq:inner_problem} eq:inner_problem eq:original_bilevel_problem where θ denotes the parameters of the object detection model, Ltrain(θ, ψ) = EV ∼pψ l(x, θ) ≈ 1 N PN i=1 l(xi , θ) is the training loss over the synthetic dataset from NeRF,* and Lval is the loss on the task-specific validation set Dval. The bi-level optimization problem in Equation 8.1 is challenging to solve; for example, any gradient based algorithm would need access to an efficient approximation of ∇ψ ˆθ(ψ), which in turn requires propagating gradients through the entire training trajectory of a neural network. Thus, we look to numerical approximations to solve this problem. Recently, Behl et. al. [35] developed a technique for numerical gradient computation based on local approximation of the bi-level optimization. Without going into their derivation, we borrow the gradient term for the outer update, which at time step t takes the form: ∂Lval( ˆθ(ψ)) ∂ψ ψ=ψt ≈ − ∇NeRF z }| { ∂ ∂ψ h∂Ltrain ˆθ(ψt), ψ) ∂θ iT ψ=ψt H( ˆθ(ψt), ψ) −1 dLval( ˆθ(ψt)) dθ | {z } ∇T V . (8.2) {eq:2} eq:2 We have divided the gradient term into two parts: ∇NeRF corresponds to backpropagation through the dataset generation from NeRF, and ∇T V corresponds to approximate backpropagation through training and validation (Figure 8.2). ∇T V is computed using the conjugate gradient method [35]. However, [35] treated the data generation as a black box and used REINFORCE [601] to compute the approximate gradient because *For simplicity, we have dropped the dependence of loss ℓ on labels y 14 they used non-differentiable renderers for data generation. However, REINFORCE is considered noisy process and is known to lead to high-variance estimates of gradients. In contrast, NeRF is differentiable, which gives us tools to obtain more accurate gradients. We propose an efficient technique for computing ∇NeRF , which we discuss in the next section. 8.3.1 Backprop through data generation from NeRF A good gradient estimation should possess the following properties: (i) high accuracy and low noise, (ii) computational efficiency, (iii) low memory footprint. We leverage different properties of NeRF, i.e., its differentiability and pixel-wise rendering, to design a customized technique which satisfies the above properties. In computation of ∇NeRF in Equation 8.2, we approximate Ltrain(θ, ψ) using samples in Dtrain as Ltrain(θ, ψ) ≈ 1 N PN i=1 l(xi , θ). Using chain rule we then have partial derivative computation over l(x, θ) as follows: ∂ ∂ψ ∂l(xi , ˆθ(ψt)) ∂θ = ∂( ∂l(xi,θˆ(ψt)) ∂θ ) ∂xi ∂xi ∂Vi dVi dψ (8.3) {eq:9} eq:9 The first term is the second order derivative through an object detection network and can be computed analytically for each image xi . The second term is the gradient of the rendered image w.r.t NeRF inputs, which again is well defined and can be obtained by backpropagating through the differentiable NeRF rendering xi = NeRF(Vi). While both these terms have exact analytical expressions, naively computing and using them in Equation 8.2 becomes impractical even for small problems (see below in Tool2 and Tool3 for details and proposed solutions). Finally the third term dVi dψ requires gradient computation over probabilistic sampling Vi ∼ pψ. We consider pψ over discretized bins of pose parameters. For such discrete distributions dVi dψ is not well defined. Instead, we approximate this term using a reparameterization technique described below in Tool1. We summarize our technical tools below: 143 For distributions pψ over a discrete bins of pose parameters, we propose a reparametrization of ψ that provides efficient approximation of dVi dψ (Tool1). We dramatically reduce memory and computation overhead of implementing the gradient approximation in Equation 8.2 using a new twice-forward-once-backward approach (Tool2). Without this new technique the implementation would require high computation involving large matrices and computational graphs. Even with the above technique, the computation of first and second terms in Equation 8.3 has a large overhead in terms of GPU memory that depends on image size. We overcome this using a patch-wise gradient computation approach described in Tool 3. 8.3.1.1 Tool 1: Reparametrization of pose sampling sec:reparametrization NeRF renders images xj using camera pose Vj=(ϕi , ρj ), where ϕj ∈ [0, 360], ρj ∈ [0, 360]. For simplicity we describe our method for optimizing over just ϕ, while keeping ρ fixed to be uniform. We discretize the pose space into k equal sized bins over the range of ϕ as B1 = 0, 360 k , B2 = 360 k , 360×2 k , . . .. and define the distribution over ϕ as the categorical distribution with pi as the probability of ϕ belonging to Bi . This distribution is thus parametrized by ψ ≡ p = [p1, ..., pk]. To back propagate through the sampling process, we approximate the sample from the categorical distribution by using Gumble-softmax “reparameterization trick” with parameters y ∈ R k , where yi are given as follows: yi = GSi(p) = exp[(Gi + log(pi))/τ ]/ X j exp[(Gi + log(pj ))/τ ], (8.4) {eq:gumbel} eq:gumbel where Gi ∼ Gumbel(0, 1) are i.i.d. samples from the standard Gumbel distribution and τ is temperature parameter. The random vector y defined as above satisfies the property that the coordinate (index) of the largest element in y ∈ R k follows the categorical distribution with parameter p. 144 Phi: optimize (-180~180) Theta: fix (85~95) 10° range Categorical Uniform Bin sample with Gumble-Softmax 315 225 180 0 135 90 45 270 Pose sample from Uniform (φ, θ) NeRF = [1, 2, … , ] j Figure 8.4: A concrete example to one time sample, starting form a particular value of ψ, we can follow reparametrization sampling and obtain a pose. Each sample represents a pose that is input in NeRF to render one image. fig:example We now approximate sampling from the categorical distribution (see Figure 8.3 and Figure 8.4 for depiction). Denote the bin center of Bi as B¯ce i = 360(i − 0.5)/k; and the bin range as ¯b ra = 360/k. We generate Vj = (ϕj , ρj ) ∼ pψ as below: Generate yi’s for i = 1, 2, . . . k from Equation 8.4 Define b ce j = P i yiB¯ce i as the approximate bin center. Define the bin for the j th sample centered around b ce j as [b st j , ben j ] = [b ce j − ¯b ra/2, bce j + ¯b ra/2] We sample ϕj from uniform distribution over [b st j , ben j ] which has a standard reparametrization for diffentiability: U(b st j , ben j ) ≡ (1 − ϵ)b st j + ϵben j s.t. ϵ ∼ U(0, 1). ρj ∼ U[0, 360], or can follow same process as ϕj . Note that in general the approximate bin centers b ce j need not be aligned with original categorical distribution, however we can control the approximation using the temperature parameter τ . As τ → 0, y will be a one-hot vector and exactly emulate sampling from categorical distribution. log(1) log(2) … log() 1 … 2 exp(/ ) σi=1 k exp(/ ) 1 2 … 360 ×0.5 360 … ×1.5 360 ×(−0.5) × + + ×( - ) Bin sample with Gumble-Softmax Pose sample from Uniform Figure 8.3: Bin sampling: We first discretize the pose space into a set of k bins, which we will then sample to generate the view parameters for the NeRF. To backpropagate through the sampling process, we approximate the sample from the categorical (i.e. bin) distribution by using a Gumblesoftmax “reparameterization trick”. Within each bin we sample uniformly. fig:gumble 145 We now have the full expression for approximate gradient of ∇NeRF using Equation 8.3 and reparametrization as follows: ∇NeRF ≈ 1 N X N j=1 ∂( ∂l(xj ,θˆ(ψt)) ∂θ ) ∂xj ∂xj ∂Vj ∂Vj ∂(b st j , ben j ) ∂(b st i , ben i ) ∂y ∂y ∂p. (8.5) {eq:chain} eq:chain Below we present tools that drastically improve the compute and memory efficiency and are crucial for our pipeline. 8.3.1.2 Tool 2: Twice-forward-once-backward sec:twice The full gradient update of our bi-level optimization problem involves using the approximation of ∇NeRF in Equation 8.5 and back in Equation 8.2. This computation has three terms with the following dimensions: [(1)] ∂( ∂l(xj ,θˆ(ψt )) ∂θ ) ∂xj ∈ R m×d , ∂xj ∂ψ ∈ R d×k , ∇T V = H( ˆθ(ψt), ψ) −1 dLval(θˆ(ψt)) dθ ∈ R m×1 , where m = |θ| is the # of parameters in object detection model, d is the # of pixels in x, and k is # of pose bins. Implementing eq. Equation 8.2 with the naive sequence of (1)-(2)-(3) involves computing and multiplying large matrices of sizes m × d and d × k. Further, this sequence also generates a huge computation graph. These would lead to prohibitive memory and compute requirements as m is often in many millions. On the other hand, if we could follow the sequence of (3)-(1)-(2), then we can use the produce of 1 × m output of (3) to do a weighted autograd which leads computing and storing only vectors rather than matrices. However, the computation of (3) needs the rendered image involving forward pass of (2) (more details in supp.). To take advantage of the efficient sequence, we propose a twice-forward-once backward method where we do two forward passes over NeRF rendering. In the first forward path, we do not compute the gradients, we only render images to form Dtrain and save random samples of y, ϕj used for rendering. We then compute 146 (3) by turning on gradients. In the second pass through NeRF, we keep the same samples and this time compute the gradient (1) and (2). 8.3.1.3 Tool 3: Patch-wise gradient computation sec:patch_wise Even though we have optimized the computation dependence on m = |θ| with the tool described above, computing (1)-(2) sequence in the above description still scales with the size of images d. This too can lead to large memory footprint for even moderate size images (e.g., even with the twice-forward-once-backward approach, the pipeline over a 32×32 image already does not fit into a 2080T i GPU). To optimize the memory further, we propose patch-wise computation, where we divide the image into S patches x = (x 1 , x2 , . . . , xS )) and compute Equation 8.3 as follows: ∂ ∂ψ ∂l(x, ˆθ(ψt)) ∂θ = X S c=1 ∂( ∂l(x c ,θˆ(ψt)) ∂θ ) ∂xc ∂xc ∂ψ . (8.6) Since NeRF renders an image pixel by pixel, it is easy to compute the gradient of patch with respect to ψ in the memory efficient patch-wise optimization. 8.3.2 Nerf-in-the-wild sec:nerfw NeRF-in-the-wild (NeRF-w) extends the vanilla NeRF model to allow image dependent appearance and illumination variations such that photometric discrepancies between images can be modeled explicitly. NeRFw takes as input an appearance embedding denoted as ℓ alongside the viewing direction V to render an image as x = NeRF(V, ℓ). For NERF-w, the optimization of pose (V) remains the same as discussed above. For efficient optimization of lighting we exploit a noteworthy property of NeRF-w: it allows smooth interpolations between color and lighting. This enables us to optimize lighting as a continuous variable, where the lighting (ℓ) can be written as an affine function of the available lighting embeddings (ℓi) as ℓ = P i ψi ∗ ℓi where P i ψi = 1. To calculate the gradient from Equation 8.3, ∂xi ∂ℓ is computed in the same way as described 147 above utilizing our tools 2 and 3, and the term dℓ dψ is straightforward and is optimized with projected gradient descent. chap-8-sec:experiment 8.4 Experiments We now evaluate the effectiveness of our proposed Neural-Sim approach in generating optimal training data on object detection task. We provide results under two variations of our Neural-Sim method. In the first case, we use Neural-Sim without using bi-level optimization steps. In this case, data from NeRF are always generated from the same initial distribution. The second case involves our complete Neural-Sim pipeline with bi-level optimization updates (Equation 8.2). In the following sections, we use terms NS and NSO for Neural-Sim without and Neural-Sim with bi-level optimization respectively. We first demonstrate that NeRF can successfully generate data for downstream tasks as a substitute for a traditional graphic pipeline (e.g., BlenderProc) (Section 8.4.1) with similar performance. Then we conduct experiments to demonstrate the efficacy of Neural-Sim in three different scenarios: controllable synthetic tasks on YCB-synthetic dataset (Section 8.4.2); controllable real-world tasks on YCB-in-the-wild dataset (Section 8.4.3); general real-world tasks on YCB-Video dataset (Section 8.4.4). We also show the interpretable properties of the Neural-Sim approach (NSO) during training data synthesis (Section 8.4.5). All three datasets are based on the objects from the YCB-video dataset [610, 240, 65]. It contains 21 objects from daily life and provides high-resolution RGBD images with ground truth annotation for object bounding boxes. The dataset consists of both digital and physical objects, which we use to create both real and synthetic datasets. Implementation details: We train one NeRF-w model for each YCB object using 100 images with different camera pose and zoom factors using BlenderProc. We use RetinaNet [346] as our downstream object detector. To accelerate the optimization, we fix the backbone during training. During bi-level optimization steps, we 148 50 60 70 80 90 Id-1: Pitcher 80 85 90 95 100 Id2: Cheese box 50 60 70 80 90 Id1: Masterchef 40 50 60 70 80 90 Id2: Gelatin 60 70 80 90 Id1: Driller 60 70 80 90 100 Id2: Banana (a) Pose Distribution Gap Influence on Object Detection (c) Illumination Distribution Gap Influence on Object Detection Train/test Pose Train/test zoom Train/test illumination 0 102030405060708090 full overlap no overlap [No-Opt] [Learning-to-Simulate] [Auto-Sim] [Neural-Sim (ours)] Full overlap Partial overlap No overlap (b) Zoom Distribution Gap Influence on Object Detection Full overlap Partial overlap No overlap AP Full overlap No overlap Full overlap No overlap Full overlap No overlap Full overlap No overlap AP AP AP AP AP Figure 8.5: Neural-Sim performance on YCB-Synthetic. When there are distribution gap between train and test sets ((a) pose (b) zoom (c) illumination gap), with the gap increase, object detection faces larger accuracy drop (black line). With the help of Neural-Sim (NSO) in blue line, the performance drop are filled. Observe improvement of NSO over LTS [477] (red line) and Auto-Sim [35] (green line). chap-8-fig:5 use Gumble-softmax temperature τ = 0.1. In each optimization iteration, we render 50 images for each object class and train RetinaNet for two epochs. Baselines: We compare our proposed approach against two popular state-of-the-art approaches that learn simulator parameters. The first baseline is Learning to simulate [477] which proposed a REINFORCE-based approach to optimize simulator parameters. Also note that the meta-sim [280] is a REINFORCE-based approach. Next, we consider Auto-Sim [35] which proposed an efficient optimization method to learn simulator parameters. We implemented our own version of Learning to simulate work and we received code from the authors of Auto-Sim. 8.4.1 NeRF to generate data for downstream tasks chap-8-sec:5.1 First, it is important to show that NeRF is a suitable replacement for a traditional renderer like BlenderProc [114] when generating data for object detection. To test this, we use YCB-video dataset objects and we render images from NeRF and BlenderProc [114] using the same camera pose and zoom parameters. We use these images to conduct object detection tasks under same training and test setting. Both object detectors trained on NeRF synthesized images and BlenderProc images have nearly same accuracy. 149 8.4.2 YCB-synthetic dataset chap-8-sec:5.2 Next, we conduct a series of experiments on a YCB-synthetic dataset to show how NSO helps to solve a drop in performance due to distribution shifts between the training and test data. Dataset setting We select six objects that are easily confused with each other: masterchef and pitcher are both blue cylinders and cheezit, gelatin, mug and driller are all red colored objects. To conduct controlled experiments, we generate data with a gap in the distribution of poses between the training and test sets. For this, we divide the object pose space into k= 8 bins. For each objects oj and pose bin i combination, we use BlenderProc † to synthesize 100 images. These images of the six selected objects with pose bin-labels form YCB-synthetic data. Train/test biasness We create controlled experiments by varying the degree of pose distribution overlap between the training and test sets. For each object (e.g. pitcher) we fix its pose distribution in the test set (e.g. images are generated with pose from bin 1) and change its pose distribution in training set in three ways. First, images are generated with pose with same distribution as test set (bin1 is dominant), uniform distribution (pose values uniformly selected from bin1 to bin 8) and totally different distribution from the test set (other bins are dominant except bin 1). We introduce such pose biasness in two of the six objects, pitcher and driller. For other four objects, test images are generated from an uniform distribution. The test set has 600 images (100 images per object). Results Quantitative results are shown in Figure 8.5. First, we show the performance of our NS based training images rendered using three initial distributions described earlier. We observe that the object detection performance drops by almost 30% and 10% for pitcher and driller objects respectively when there is object pose gap between training and test distributions. Next we show that our Neural-Sim with bi-level optimization (NSO) is able to automatically find the optimal pose distribution of the test set. NeRF then uses the optimal distribution to synthesize training †BlenderProc is a popular code-base to generate photo realistic synthetic data using traditional graphics pipeline. 150 data. The object detection model trained on the optimal data helps improve performance significantly; average precision accuracy for the pticher and driller objects have been improved by almost 30% and 10%, respectively. The blue lines in Figure 8.5 represent the performance of NSO which fill the gap caused by distribution mismatch. Note there is similar significant improvement in experiments where there is gap in camera zoom when using the proposed NSO approach. We compare our NSO with the recent work Learning-to-simulate (LTS) [477] and Auto-Sim [35] that use REINFORCE for non-differentiable simulator optimization (Figure 8.5(a)(b)). We observe that on pose optimization, the proposed NSO achieves almost 34% improvement over LTS and 11% improvement over Auto-Sim on on the pitcher object. On zoom optimization, NSO achieves almost 27% improvement over LTS and 26% improvement over Auto-Sim on Masterchef object. This highlights the gradients from differentiable NSO are more effective and can generate better data than REINFORCE based LTS and Auto-Sim. Experiments on illumination optimization. To verify the effectiveness of Neural-Sim on illumination, we substitute vanilla NeRF model with NeRF-w. We conduct similar experiments as the pose and zoom experiments in Section 8.4.2 on illumination with YCB-synthetic dataset. The results show in Figure 8.5(c). NSO has great performance on illumination optimization with 16% and 15% improvements on driller and banana objects respectively. Large scale YCB-Synthetic dataset experiments Here we highlight the results of our large-scale experiments on the YCB-synthetic dataset. Experiments demonstrate that our proposed NSO approach helps to solve a drop in performance due to distribution shifts between the train and test sets. We use the same setting as previous experiment except we conduct object detection on all 21 objects on the YCB-Synthetic dataset. We create controlled experiments by varying the degree of pose distribution overlap between the training and test sets. For each object, we fix its pose distribution in the test set and change its pose distribution in the training set: training images are generated from totally different distributions from the test set. The test set has 2100 images (100 images per object). The experiment results are shown in Table 8.1. We compare the proposed 151 Table 8.1: Large scale YCB-synthetic experiments table:ycb_syn_21 Objects mAP master chef can cracker box sugar box tomato soup can mustard bottle tuna fish can pudding box gelatin box potted meat can banana NS 68.4 93.5 96.6 58.3 83.9 78.4 44.3 78.0 65.2 55.3 89.4 Auto-Sim 69.3 96.0 82.5 92.3 37.4 81.3 52.0 80.6 79.4 74.4 83.4 NSO 82.1 98.5 98.4 98.2 81.8 90.5 64.6 84.1 57.6 92.2 91.6 Objects pitcher base bleach cleanser bowl mug power drill wood block scissor large marker large clamp extra large clamp foam brick NS 29.0 49.9 78.7 46.8 89.3 97.8 67.9 42.9 47.8 72.7 69.6 Auto-Sim 7.7 81.5 78.3 60.0 83.2 95.6 64.1 41.5 46.6 79.0 57.9 NSO 83.5 93.4 98.5 87.9 93.6 98.7 55.3 56.9 50.8 78.6 68.2 NS and NSO approaches with the baseline Auto-Sim [35] method. Note that our proposed NSO achieves improvements of almost 14 % and 13 % points over NS and Auto-Sim baselines respectively. 8.4.3 YCB-in-the-wild dataset chap-8-sec:5.3 To evaluate the performance of the proposed NS and NSO approaches on a real world dataset, we have created a real world YCB-in-the-wild dataset. The dataset has 6 YCB objects in it, which are same as in the YCB-synthetic dataset: masterchef, cheezit, gelatin, pitcher, mug and driller. All images are captured using a smartphone camera in a common indoor environments: living room, kitchen, bedroom and bathroom, under natural pose and illumination. We manually labelled each image with object bounding boxes. Further, to explore the effect of distribution shifts on the object detection task, we manually labelled the object pose in each image using the the same eight bins discussed earlier. The dataset consists of total around 1300 test images with each object having over 200 images. Some of the images from the dataset are shown in the Figure 8.1. We will release the original images, ground truth object detection labels and pose bin-labels. To explore the performance of the NS and NSO under the training and test distribution gap on the YCB-in-the-wild, we use the same experiment setup as in Section 8.4.2. The test images are selected from YCB-in-the-wild and training images are synthesized by NeRF. The training data is generated under two categorical distributions: uniform distribution and a random bin as dominant bin. 152 Bin-1 Bin-2 Bin-3 Bin-4 Bin-1 Bin-2 Bin-3 Bin-1,2 Bin-1,3 Bin-1,4 Bin-1,2,3 Bin-1,3,4 AP AP AP NS [Ours w/o Opt] Fix as Uniform bins NSO [Ours with Opt] Initialize as Uniform bins NS [Ours w/o Opt ] Fix as Random bin NSO[Ours with Opt] Initialize as Random bin 0 10 20 30 40 50 60 70 80 90 Pitcher 0 5 10 15 20 25 30 35 40 45 Pitcher-multi model 0 10 20 30 40 50 60 70 80 Cheezit box [Auto-Sim] Initialize as Uniform bins [Auto-Sim] Initialize as Random bin 0 10 20 30 40 50 60 70 80 90 Bin-1 Bin-2 Bin-3 Bin-4 Pitcher 0 10 20 30 40 50 60 70 80 Bin-1 Bin-2 Bin-3 Cheezit box 0 5 10 15 20 25 30 35 40 45 mix Bin-1,2 mix Bin-1,3 mix Bin-1,4 mix Bin-1,2,3 mix Bin-1,3,4 Pitcher-multi model NS [Ours w/o Opt] Fix as Uniform bins NSO [Ours with Opt] Initialize as Uniform bins NS [Ours w/o Opt ] Fix as Random bin NSO[Ours with Opt] Initialize as Random bin [Auto-Sim] Initialize as Uniform bins [Auto-Sim] Initialize as Random bin [Learning-to-Sim] Initialize as Uniform bins [Learning-to-sim] Initialize as Random bin Figure 8.6: Performance of Neural-Sim on the YCB-in-the-wild dataset. We observe that the Neural-Sim optimization (NSO) can consistently achieve 20% to 60% improvement in accuracy over our method without optimization (NS) case and large improvements over LTS (up to 58%) and Auto-Sim (up to 60%). Here each bin on x−axis represents bin from which test data is generated. We observe large improvement in both single-modal and multi-modal test data. fig:wild-result Quantitative results are provided in the Figure 8.6. First we highlight the performance achieved by our NS approach to generate data according two different initial pose distributions. We observe that NS generated data helps achieve up to 30% in object detection accuracy on different objects starting from two different initial distributions. Moreover, our NSO approach achieves remarkable improvement in every experimental setup. For example, on pitcher, starting from uniform and random distributions, our optimization improve performance by almost 60%. Compared with other optimization methods LTS and Auto-Sim, we observe large improvement upto 58% improvement over LTS and 60% improvement over Auto-Sim on the pitcher object. We observe a similar behavior on the cheeze box and also on multi-modal experiment setting. This highlights three points. First, NeRF can be used to generate good data to solve object detection task in the wild; far more importantly, our Neural-Sim with bi-level optimization (NSO) approach can automatically find the optimal data that can help achieve remarkable improvements in accuracy on images captured in the wild. Third, the gradients from NSO are more effective and can generate better data than REINFORCE based LTS and Auto-Sim. 153 8.4.4 YCB Video dataset chap-8-sec:5.4 To show the performance of the proposed NS and NSO approaches on a general real world dataset, we also conduct experiments on the YCB-Video dataset [610, 240]. Each image in this dataset consists of multiple YCB objects (usually 3 to 6 different objects) in a real world scene. The YCB-Video training dataset consists of 80 videos from different setups in the real world. Since there are many duplicate frames in each video, we select every 50th frame to form the training set, which results in just over 2200 training images (YCBVtrain). YCB-Video testset contains 900 images. YCB-Video train and test sets have all 21 YCB objects. In order to show the benefit of synthetic data, we create two different training scenarios (1) Few-shot setting, where we randomly select 10 and 25 images from (YCBVtrain) to form different few shot training sets. (2) Limited dataset setting, where we randomly select 1%, 5%, 10% images from (YCBVtrain) to form limited training sets. Using a similar setting as in Section 8.4.3, we demonstrate performance of the proposed NS and NSO approaches starting from uniform distributions and compare with four baselines. First baseline-1 involves training RetinaNet using few-shot or limited training images from YCBVtrain data, and baseline-2 involves training RetinaNet using the images that were used to train NeRF. Baseline-3 is Learning-to-sim and baseline4 is Auto-Sim. Further, we also combine the real-world few-shot or limited training images from YCBVtrain along with NeRF synthesized images during our Neural-Sim optimization steps for training object detection model. This Combined setting reduces the domain gap between synthetic and real data. All the models have been evaluated on YCB-Video testset. For the normal Few-shot setting (rows 2, 3, 4 in Table 8.2(a)), NS starting from the uniform distribution achieves almost 3.45 and 4.11% improvement over the baseline in 10 and 25 shots settings, respectively. Further, when we optimize the parameters using NSO, we observe improvements of 4.45, 4.41% over the baseline and 1.0, 0.3% improvements over the NS case in 10, 25 shot settings respectively. We also observe almost 1.8% improvement in the zero-shot case. 154 In addition, for the Combined Few-shot setting (rows 5,6,7,8 in Table 8.2(a)), we observe similar large improvements in accuracy. For example, an improvement of 22.51% over the baseline and 2% improvements over the without optimization cases respectively have been observed in the 25 shot settings. Compared with Learning-to-sim and Auto-Sim, NSO shows consistent improvement on both 10 shot and 25 shot. We observe similar large performance improvements in the limited data settings (Table 8.2(b)). For example, in the Combined limited data settings (rows 6, 7, 8, 9 in Table 8.2(b)), we observe that the the proposed NS achieves an improvement of almost 30.93, 34.72, 35.4% over the baseline in the 1, 5, 10% data regime, respectively. Further, after using NSO we observe an improvement of almost 31.63, 36.02, 36.4% over the baseline. Finally, we also find 0.7, 1.3, 1.0% improvements over NS approach, 0.5, 0.8, 0.7% improvements over Learning-to-sim and 0.3, 1.2, 0.6% improvements over Auto-Sim in 1, 5, 10% settings respectively. 8.4.5 Interpretability of Neural-Sim chap-8-sec:5.5 We have observed significant improvement in accuracy even when there exists large distribution gap between training and test sets using the proposed Neural-Sim approach. This raises a question: does the Neural-Sim optimization provide interpretable results? In order to demonstrate this behavior, we conduct experiment on YCB-in-the-wild dataset illustrated in Figure 8.7. As shown, the test set images are sampled from the categorical distribution where bin one is Few-shot setting 0-shot 10-shot 25-shot Only YCBV-train N/A 0.45 0.49 train(pre)+ours (w/o opt) 2.3 3.9 4.6 train(pre)+ours (with opt) 4.5 4.9 4.9 Learning-to-sim (com) N/A 12.4 22.5 Auto-Sim (com) N/A 12.9 22.2 train(com)+ours (w/o opt) N/A 12.2 21.0 train(com)+ours (with opt) N/A 13.1 23.0 (a) Zero and few-shot setting (YCB-Video). Percent of YCBVtrain 0.01 0.05 0.1 Only YCBV-train 5.77 8.88 12.5 Only images to train NeRF 3.9 3.9 3.9 train(pre)+ours (w/o opt) 7.9 11.8 14.4 train(pre)+ours (with opt) 8.9 12.4 14.5 Learning-to-sim (com) 36.9 44.1 48.2 Auto-Sim (com) 37.1 43.7 48.3 train(com)+ours (w/o opt) 36.7 43.6 47.9 train(com)+ours (with opt) 37.4 44.9 48.9 (b) limited data setting (YCB-Video) Table 8.2: YCB-Video performance. Observe large improvement of the proposed Neural-Sim approaches before and after optimization over the baselines. chap-8-table:few-shot 155 0 10 20 30 40 50 60 70 80 90 bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 Before optimization distribution Test distribution train-1-start distribution train-2-start distribution 0 10 20 30 40 50 60 70 80 90 bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 End optimization distribution Test distribution train-1-end distribution train-2-end distribution Test distribution Train end distributions match test distribution Train start distributions Figure 8.7: Visualization provides evidence that proposed Neural-Sim (NSO) approach generates interpretable outputs. In the shown example, test images are sampled from distribution bin 1 as dominant bin. For NeuralSim optimization (NSO), initial training pose distributions are uniform and bin 4 as dominant bin. Observe the bin distribution at the optimization - the final bin distribution at the end of Neural-Sim training matches with the test bin distribution. chap-8-fig:10 dominant. As described in Section 8.4.3, we consider two starting pose bin distributions for our Neural-Sim approach: a uniform distribution and a randomly selected bin as a dominant bin (e.g., most images come from bin four). After optimization, we visualize the learned object pose distribution (Figure 8.7 (b)). We find that no matter what the starting distributions the Neural-Sim approach used, the learned optimal ψ ∗ is always aligned with the test distribution. This explains the reason why Neural-Sim can improve the downstream object detection performance: it is because Neural-Sim can automatically generate data that will closely matching distribution as the test set. We can find similar interpretable results in camera zoom experiments. chap-8-sec:conclusion 8.5 Discussion and Future Work It has been said that “Data is food for AI”[399]. While computer vision has made wondrous progress in neural network models in the last decade, the data side has seen much less advancement. There has been an explosion in the number and scale of datasets, but the process has evolved little, still requiring a painstaking amount of labor. 156 Synthetic data is one of the most promising directions for transforming the data component of AI. While it has been used to show some impressive results, its wide-spread use has been limited, as creating good synthetic data still requires a large investment and specialized expertise. We believe we have taken a big step towards making synthetic data easier to use for a broader population. By optimizing for how to synthesize data for training a neural network, we have shown big benefits over current synthetic data approaches. We have shown through extensive experiment that the data found by our system is better for training models. We have removed the need for any 3D modeling and for an expert to hand-tune the rendering parameters. This brings the promise of synthetic data closer for those that don’t have the resources to use the current approaches. We have handled camera pose, zoom and illumination; and our approach can be extended to other parameters (such as materials, etc.), by incorporating new advances in neural rendering. For future work, we hope to improve the ease of use of our approach, such as performing our optimization using lower quality, faster rendering using a smaller network for the neural rendering component, and then using the learned parameters to generate high quality data to train the final model. We hope that our work in this space will inspire future research. 157 Chapter 9 BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation chapter-9 The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI environment, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as “filled” and “folded”), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. All code and data will be made public. 158 BEHAVIOR Vision Suite Applications 8000+ 3D Objects Customizable Data Generator Camera pose Lighting Object properties Object states Spatial relationships Object States / Relations Prediction Controlled Evaluation of Vision Algorithms Scene Understanding 1000 Scene Instances Controlled generation Figure 9.1: Overview of BEHAVIOR Vision Suite (BVS), our proposed toolkit for computer vision research. BVS builds upon extended object assets and scene instances from BEHAVIOR-1K [323], and provides a customizable data generator that allows users to generate photorealistic, physically plausible labeled data in a controlled manner. We demonstrate BVS with three representative applications. fig:capgen 9.1 Introduction chap-9-sec:intro Large-scale datasets and benchmarks have fueled computer vision research in the past decade [112, 348, 138, 300, 179, 201, 511, 610, 377, 62, 209, 202]. Driven by these datasets and benchmarks, thousands of models and algorithms tackling different perception challenges are being proposed every year, on the topics of object detection [678], segmentation [296], action recognition [581], video understanding [345] and beyond. Despite their success, these real-world datasets have a few inherent limitations. First, the ground-truth object/pixel-level labels are either prohibitively expensive to acquire (e.g. segmentation masks) [349] or inaccurate/noisy (e.g. depth sensing) [398]. As a result, each real dataset often only offers a limited set of labels, thus hindering the development and evaluation of computer vision models that perform a wide range of perception tasks on the same image inputs. Even if annotation is free and accurate, real-world datasets are bounded by the availability of source images. For example, images of rare events such as traffic accidents or low-light conditions might be difficult to acquire from the Internet or real-world sensors. Lastly, once 159 collected, these real-world datasets have a fixed data distribution and cannot be easily customized. This last limitation makes it almost impossible for researchers to conduct customized experiments and forces them to adopt the experimental setup decided by dataset creators, often leading models to overfit to the datasets and the eventual obsoletion of the benchmarks. To circumvent this limitation, researchers and practitioners have come up with a variety of ways to generate synthetic datasets that complement the real ones. In the realm of indoor scene understanding, 3D reconstruction datasets [608, 74, 438] provide a promising avenue to generate source images from arbitrary viewpoints and free (geometric) annotations. Due to the imperfect nature of 3D reconstruction techniques, however, the rendered images are not very realistic. With each entire scene as a static mesh, these datasets also offer very limited customizability other than camera trajectories. Recent synthetic indoor datasets (often designed by 3D artists) [461, 333, 340, 153] come to the rescue because they not only provide free annotations (both geometric and semantic) but also allow for object layout reconfiguration (since the objects are usually independent CAD models). However, these datasets do not guarantee physical plausibility (object penetration and levitation happen often) and do not provide any customization capability beyond changing object poses. 3D simulators [543, 298, 109, 159, 500, 322], on the other hand, guarantee physical plausibility because of their underlying physics engines. They also allow users to customize the joint configuration of articulated objects and even more advanced object states such as “cooked” or “sliced” [298, 322]. Yet these 3D simulators generally cater to Embodied AI and robotics researchers, and as a result, they lack photorealism compared to the synthetic datasets mentioned before (usually due to speed constraints) and they don’t provide off-the-shelf tooling to generate customized image/video datasets for computer vision researchers. To overcome the aforementioned challenges, we propose BEHAVIOR Vision Suite (BVS), a customizable data generation tool that allows for systematic evaluation and understanding of computer vision models (see Figure 9.1 for an overview). First, we expand the 3D asset library in BEHAVIOR-1K li2023behavior1k, focusing on enhancing both object diversity and scene variety as well as adding features to increase value 160 of the assets for vision tasks. Then, we introduce Customizable Dataset Generator, which leverages the simulator from the BEHAVIOR-1K benchmark [529, 323] to generate custom vision datasets. We build a versatile and customizable toolbox to generate high-quality synthetic data for systematic model evaluation and understanding. In a nutshell, BEHAVIOR Vision Suite has the following unique combination of desirable features: 1. offers exhaustive image/object/pixel-level labels for free (scene graph, point cloud, depth, segmentation, etc) 2. covers a wide variety of indoor scenes and objects (8K+ objects, 1K scene instances, fluid, soft bodies, etc) 3. guarantees high physical plausibility and photorealism 4. provides maximum customization capability in terms of object models, poses, joint configurations, semantic states, lighting, texture, material, camera setting, etc. 5. includes easy-to-use tooling to generate customized data for new use cases. To demonstrate the usefulness of BVS, we showcase three example applications: 1) parametrically evaluating model robustness across different conditions such as lighting and occlusion, 2) evaluating different types of representative computer vision models on the same set of images, and 3) training and evaluating sim2real transfer for object states and relations prediction. By showing these three examples, we hope that BVS can unlock more possibilities for the computer vision community. 161 Camera View Obj Pose Obj State CV Toolkit Real RGB-D Datasets ✗ ✗ ✗ N/A Real 3D Reconstruction Datasets ✓ ✗ ✗ N/A Medium Synthetic Datasets ✓ ✓ ✗ ∼ High 3D Simulators ✓ ✓ ✓ ✗ Low BEHAVIOR Vision Suite (Ours) ✓ ✓ ✓ ✓ High Dataset Category Customizability Visual Quality Table 9.1: Comparison of real and different types of synthetic datasets to BEHAVIOR Vision Suite. Camera View indicates whether images can be rendered from arbitrary viewing angles. Obj Pose indicates whether object layout can be modified. Obj State indicates whether object physical states (e.g. open/close, folded) and semantic states (cooked, soaked, etc) can be modified. CV toolkit indicates whether utility functions are provided to sample camera poses that satisfy certain constraints (those that capture half-open kitchen cabinets filled with grocery items, for instance). Visual Quality indicates how photorealistic the images are. tab:comparison_main 9.2 Related works chap-9-sec:relatedwork In this section, we will compare BEHAVIOR Vision Suite against other real RGB-D datasets, 3D reconstruction datasets, synthetic datasets and 3D simulators in terms of customizability and visual quality (see Table 9.1). 9.2.1 Real Indoor Scene RGB-D Datasets RGB-D image datasets of real indoor scenes [398, 523, 100, 34, 628] have driven advancement in 3D perception and holistic scene understanding. Recent works include ARKitScenes [34] and ScanNet++ [628] that provide dense semantic and 3D annotations. While these real datasets capture image distribution from the real world, they are expensive to annotate and inherently static: users are unable to generate images from new camera views, acquire new types of annotations, or modify the scenes in any way. Our work is thus complementary, offering users a fully customizable generator of photorealistic synthetic data. 9.2.2 3D Reconstruction Datasets 3D reconstruction datasets like Gibson and Matterport [74, 608] allow rendering of novel views. HM3DSem [438, 619] scales up to 1,000 scenes, improves the reconstruction quality, and provides more 162 accurate dense semantic annotations. While these datasets have tremendously benefited the embodied navigation community, their application to computer vision is limited. Each scene is a single 3D mesh and hence prohibits further customization such as object layout. Furthermore, the visual quality of novel view rendering highly depends on reconstruction fidelity—artifacts like glasses still exist. Semantic label acquisition is also very expensive. Our work, in contrast, is capable of generating images with customized object layouts, consistent visual quality, together with free, comprehensive labels. 9.2.3 Synthetic Datasets Synthetic datasets offer an alternative approach that saves the cost of semantic labeling. Hypersim [461], 3D-FUTURE [153] and InteriorNet [333] render virtual images from artist-created scenes where objects are independent models. OpenRooms [340] generates scene layouts from real scans and provides configurable rendering options. Objaverse [108, 106] also provides a number of interior scenes along with large-scale 3D object models. However, despite being photorealistic, these datasets do not guarantee physical plausibility: objects often penetrate with each other or slightly levitate in the air. Also, most, if not all, objects are non-articulated and don’t support any semantic state changes. Our work, on the other hand, has more customization capability (e.g. joint configuration, semantic states such as “cooked” or “filled”) and physical plausibility. 9.2.4 3D Simulators A large number of 3D simulators with physical realism have been developed recently. iGibson [322, 500] and Habitat 2.0 [543] introduce reconfigurable indoor scenes with articulated assets, while the former also highlights their support for extended object states such as wetness level. ThreeDWorld [159] emphasizes physical interaction modeling, especially with non-rigid objects. ProcTHOR [109] automates large-scale generation of semantically plausible virtual environments. Since these 3D simulators cater towards the 163 Raw scenes Augmented scene instances Folded: False Open: 100% Filled: Milk, 30% (a) Examples of 3D objects and the semantic properties they support (b) Distribution of scenes, room types and objects (c) Examples of raw scenes and augmented scene instances Figure 9.2: Overview of Extended BEHAVIOR-1K Assets: Covering a wide range of object categories and scene types, our 3D assets have high visual and physical fidelity, and rich annotations of semantic properties, allowing us to generate 1,000+ realistic scene configurations. fig:asset Embodied AI and robotics community, their visual quality is limited. In contrast, we leverage a new 3D simulator called OmniGibson that has been shown to be significantly more photorealistic than all the ones mentioned before, according to a human survey conducted in [323], making our work a better candidate for computer vision research. Furthermore, our work offers many utility functions that allow users to easily generate near-infinite images that suit their specific needs, which most existing 3D simulators lack. 9.3 BEHAVIOR Vision Suite sec:BVisionSuite BEHAVIOR Vision Suite is composed of two main components: the Extended BEHAVIOR-1K Assets and the Customizable Dataset Generator. The assets serve as the foundation, while the generator uses the assets to generate vision datasets that suit downstream tasks of interest. 164 9.3.1 Extended BEHAVIOR-1K Assets sec:BAsset The Extended BEHAVIOR-1K Assets are a collection of 8,841 object models and 1,000 scene instances that are variations of 51 artist-designed raw scenes. Out of the 8,841 objects, 2,156 are structural elements such as walls, floors, ceilings, and 6,685 are non-structural objects belonging to 1,937 categories. These categories are of great variety such as food, tools, electronics, clothing, office supplies, and others. The distribution of the objects into these semantic categories can be seen in Figure 9.2. The 51 raw scenes consist of houses (23), offices (5), restaurants (6), grocery stores (4), hotels (3), schools (5), and generic halls (4) as well as a simulated twin of a mock apartment at our research lab. This collection of assets is the result of a year-long effort to extend the BEHAVIOR-1K [323] assets to increase their usability and value for computer vision applications. In terms of quantity, we increased the object count from 5,215 to 8,841 through 1) acquisition of more everyday objects, 2) segmentation of building structure into individual objects (e.g. walls are segmented into linear components to make 3D bounding box labels more useful), 3) procedural generation of sliced fruits and vegetables. We also procedurally generate 1,000 diverse scene instances from the original 51 raw scenes by varying the object models for furniture and inserting additional everyday objects into the scenes. To improve physical realism, we significantly improved the collision mesh quality by first applying V-HACD [373] and CoACD [596] with different parameters and manually selecting the best option, balancing physical accuracy, affordance preservation and simulation efficiency. For over 2,000 objects, this pipeline failed to generate satisfying candidates, so we manually designed their collision meshes. In terms of lighting, we annotate realistic light sources on objects like lamps and ceilings so that the scene is lit up the same way as it will be in the real world. And in terms of semantic property annotation, we further annotate appropriate fillable volumes for containers (e.g. cups, pots) and fluid source/sink locations (e.g. faucets, drains, sprayers) so that we can spawn fluids in the scene realistically. Scene objects were annotated as non-randomizable when necessary, e.g. when they physically support other objects. Similarly, clutter 165 objects in the scenes were annotated as such, allowing them to be removed and replaced with alternative clutter. Altogether, we designed the assets to form a strong basis for custom data generation (discussed in the next section), with a functional organization that allows accurate object randomization, and the annotations to provide a large number of modifiable parameters at both the object and scene levels. 9.3.2 Customizable Dataset Generator sec:BEHAVIOR-Dataset-Generator The Customizable Dataset Generator is the software component of the BEHAVIOR Vision Suite designed to generate synthetic datasets with specific characteristics. Built on OmniGibson [323], it leverages NVIDIA Omniverse’s photorealistic, real-time ray-tracing renderer and OmniGibson’s procedural sampling functions for object states to generate custom images and videos that satisfy arbitrary requirements. The produced datasets include rich, comprehensive annotations (segmentation masks, 2D/3D bounding boxes, depth, surface normals, flows, point clouds) for free. More importantly, users take full control of the dataset generation process by configuring specific scenes, objects, states, camera angles, and lighting conditions, while physical plausibility is guaranteed by the underlying physics engine. 9.3.2.1 Capabilities At the core of the generator are its generative capabilities: • Scene Object Randomization: The generator can swap the objects in a particular scene with other objects in the same category. In the assets, the objects are organized into categories that consist of objects that share similar visual and affordance characteristics. By randomizing the objects, we can drastically change the scene appearance while keeping the semantic realism of object layout. • Physically Realistic Pose Generation: The generator can procedurally change the physical states of the objects to satisfy certain predicates. This includes 1) placing objects with respect to other objects 166 in the scene in a certain way (e.g. inside, on top of, under), 2) opening articulated objects, 3) filling containers with fluids, and 4) folding/unfolding pieces of cloth. The generator can generate various valid configurations for the same predicates and ensure physical plausibility. • Predicate-Based Rich Labelling: Beyond providing the usual set of labels (semantic & instance segmentation, bounding boxes, surface normals, depth, etc.), the generator can also label unary predicates for an object (e.g. whether an articulated object is open, or an appliance is toggled on), binary predicates between two objects (e.g. whether or not an object is touching, on top of, next to, etc. another object), binary predicates between an object and a substance (e.g. is an object filled/covered/soaked with a substance), as well as continuous labels (joint openness fraction for articulated objects, filledness fraction for containers, current temperature, etc). • Camera Trajectory Generation: To go from 3D scenes to 2D images, it is necessary to render from a camera, and the placement of such a camera in a 3D scene is challenging: the camera must point at an interesting subject and not be too occluded by any object. The generator uses occupancy grids and hand-crafted heuristics to generate not only static camera poses that satisfy these constraints, for use in image models, but also physically plausible camera trajectories for video/scene understanding models. • Configurable Rendering: The generator also provides an easy API to manipulate rendering parameters, such as lighting and camera intrinsics like aperture and FOV. 9.3.2.2 Dataset Generation Process To generate a BVS dataset, we repeat the following steps: 1. Scene Sampling: We select one of the 51 raw scenes from the user-configured scene category (say, an office). 2. Object Randomization: We randomize the scene objects with similar objects in the same category. 167 3. Object Insertion: We decide what additional objects need to be added into the scene, again depending on the user configuration. We place the objects into the scene using the pose generation capabilities, based on requirements specified by the user. This might include cluttering certain areas (e.g. filling fridges with perishables) or individually manipulating certain objects’ states (making a cabinet open or a table covered with water) for downstream predicate prediction. 4. Updating Camera Pose and Rendering Environment: We then generate a camera pose (or a sequence of poses as a camera trajectory) as well as randomizing the scene’s lighting parameters and the camera’s intrinsics based on the user’s specification. 5. Rendering with Ground Truth Labels: We then render an image (or a sequence of images) and record it alongside all relevant labels requested by the user, including additional modalities (depth/segmentation/etc.), bounding boxes, and predicate and object state values. 9.4 Applications and Experiments sec:Use CDG We present three applications and their corresponding experiments to demonstrate the utility of BVS: first, systematically evaluating model robustness against various continuous domain shifts like lighting condition (Section 9.4.1); second, assessing various scene understanding models using a consistent set of images with comprehensive annotations (Section 9.4.2); and third, training and testing the efficacy of simulation-to-real transfer for a new vision task, specifically focusing on object states and relations prediction (Section 9.4.3). 9.4.1 Parametric Model Evaluation sec:App-PME Parametric model evaluation is essential for developing and understanding perception models, as it systematically assesses model performance against various domain shifts. 168 Figure 9.3: Parametric evaluation of object detection models on five example video clips. Selected frames from the clips are shown on the left, with the target object highlighted in magenta. Average Precision (AP) for our baseline models in Section 9.4.2 are plotted on the right. Since BVS allows for full customization of scene layout and camera viewpoints, we can systematically evaluate model robustness to changes in object articulation, lighting conditions, visibility, zoom (object proximity), and pitch (object pose). As we can see, current SOTA models are far from robust to these axes of variation, and we encourage researchers that develop new vision models to use BVS for debugging and parametric evaluation. fig:parametric_dataset_and_plot 169 Axis #Scenes #Video clips Articulation 17 237 Lighting 16 441 Visibility 14 211 Zoom 9 215 Pitch 16 268 Table 9.2: We generate up to 200-500 short video clips with diverse scene configurations for parametric evaluation (Section 9.4.1). Each video clip varies along one continuous axis with respect to a single target object. On average, each video has 300 frames. tab:param_eval_stat Figure 9.4: Mean performance of open-vocab object detection and segmentation models across five axes. The larger the colored envelope is for a model, the more robust it is. With the help of BVS, new vision models can be systematically tested for their robustness along these five dimensions and beyond: our users can easily add new axes of domain shift with only a few lines of code. fig:parametric_eval_radar Task design and dataset generation. We concentrate on five critical parameters that significantly impact model performance but are challenging to rigorously control in real-world datasets: object articulation, lighting, object visibility, camera zoom and camera pitch. For each parameter, we vary along a continuous axis and evaluate our baseline models along the way. For instance, object visibility varies from the target object being fully occluded to fully visible. We generate 200 to 500 videos for each axis, featuring diverse target objects from our over 8,000 3D assets. Each video includes a target object with changes focused on a single parameter under examination. Figure 9.3 shows examples of the target objects with variations in each parameter. (Table 9.2). We control the 170 Traversal Videos Annotations Figure 9.5: Holistic Scene Understanding Dataset. We generate 10,000 videos across 1,000 scene instance, each scene instance with 10 different camera trajectories. For each image, BVS generates a wide variety of labels (scene graphs, segmentation masks, depth, etc) shown on the right. On average, each video is 1 minute long with 3,000+ frames. fig:holistic Open-vocab Detection AP ↑ APsmall ↑ APmedium ↑ APlarge ↑ GLIP [330] 41.4 7.0 27.5 61.8 RAM [652] 41.3 6.4 27.8 63.9 Grounding DINO [354] 44.7 tab:open-vocab-detection 11.9 31.2 66.3 Open-vocab Segmentation AP ↑ APsmall ↑ APmedium ↑ APlarge ↑ ODISE [616] 57.1 41.0 53.2 65.0 OpenSeeD [642] 57.3 42.0 54.1 64.8 Grounding SAM [257] 59.2 tab:open-vocab-segmentation 42.9 54.4 65.1 Depth Estimation RMS ↓ AbsRel ↓ Log10 ↓ δ1 ↑ δ2 ↑ δ3 ↑ DPT [441] 0.66 0.14 0.05 0.09 0.15 0.20 NVDS [591] 0.58 0.13 0.04 0.10 0.15 0.21 iDisc [424] 0.49 0.13tab:depth-estimation 0.04 0.12 0.19 0.22 Point Cloud Reconstruction Completion Ratio ↑ Completion ↑ Accuracy ↓ GradSLAM [266] 50.0 14.8 29.8 NICE-SLAM [677] 66.3 12.0 23.5 tab:3d-scene-reconstruction Table 9.3: A comprehensive evaluation of SOTA models on four vision tasks. Our synthetic dataset can be a faithful proxy for real datasets as the relative performance between different models closely correlates to that of the real datasets. tab:holistic_result remaining aspects of the environment and systematically synthesize images while varying the main parameter of interest alone. Baselines and metrics. We conduct experiments on two representative object-centric vision tasks: openvocabulary detection and open-vocabulary segmentation. We believe models developed for these tasks might be sensitive to the object-centric domain shift that we inject. For baselines, we consider the current SOTA models on real datasets: GLIP [330], RAM [652] and Grounding DINO [354] for detection and ODISE [642], OpenSeeD [616] and Grounding SAM [257] for segmentation. 171 Results and analysis. In Figure 9.3 and Figure 9.4 we show example images when varying each parameter as well as respective detection Average Precision (AP) performance. To measure the model’s ability to recognize the target object (highlighted in magenta), we compute AP solely for the target object as the single ground truth. The following are more detailed analysis of the results. • Articulation varies the joint angles of the articulated target object, ranging from fully closed to fully open. Examples include the opening and closing of drawers, refrigerators, and doors, as well as the folding and unfolding of laptops, etc. Interestingly, we observe a negative correlation between model performance and the degree of articulation. This trend might be attributed to the fact that in existing benchmarks, articulated objects are predominantly depicted in a closed state (e.g. washing machines and microwaves). Consequently, the models are less exposed to scenarios with open articulated objects, leading to decreased performance. • Lighting varies the global illumination of the environment, ranging from being dark to being bright. We observe an increasing trend in model performance until a brightness level of 0.5, indicating that while current models suffer from low light conditions, their performance saturates once the brightness level surpasses a certian threshold. • Visibility varies the visibility of the target object, ranging from fully occluded to fully visible. Visibility is computed as the ratio of the target object’s visible pixels over its total pixels. We observe that the model performance quickly degrades when visibility goes below 0.5, leaving a large room for improvements for future models. • Zoom varies how zoomed-in the camera is facing the target object, ranging from very zoomed-in to very zoomed-out. When the view is very zoomed in, with the entire image occupied by a partial view of the target object, the model performance is poor. This suggests that models rely on surrounding semantic context for detection. As the target object is increasingly zoomed out, it also becomes more challenging to detect due to its reduced size. As we expect, the peak model performance lies somewhere in the middle. 172 • Pitch varies the pitch angle of the camera facing the target object, ranging from looking up to looking down. Our results indicate that the models are not robust to seemingly benign changes in camera viewpoint and tend to perform better if the camera looks down at the target objects. One potential explanation is that in large-scale real datasets, where the models are trained on, it’s more common for objects to be slightly below the camera. To summarize, we observe significant performance variances across three models on all five axes, indicating the lack of robustness of the current SOTA models on extreme or out-of-distribution test environments. By generating large-scale synthetic datasets with controlled variability, BVS provides a unique and powerful test bed to evaluate model performance. Furthermore, our findings align with the observations in Section 9.4.2 — relative performance across different models is generally consistent across five axes. 9.4.2 Holistic Scene Understanding sec:App-HSU One of the major advantages of synthetic datasets including BVS is that they offer various types of labels (segmentation masks, depth maps, bounding boxes) for the same sets of input images. We believe this feature can fuel the development of versatile vision models in the future that can perform multiple perception tasks at the same time. Since such models are not currently available, we instead evaluate the current SOTA methods on a subset of the tasks that BVS supports (see below). This will also serve as a validation of the photorealism of our datasets, i.e. models trained on real datasets should perform reasonably without fine-tuning. Task design and dataset generation. Equipped with BVS’s powerful generator (see Section 9.3.2), we generate an extensive dataset of 10,000 videos in 1,000 scene instances with per-frame ground truth annotations in multiple modalities. Figure 9.5 shows an overview of the generated dataset. Baselines and metrics. In Table 9.3 we assess 11 models across four tasks. Specifically, we consider Detection and Segmentation tasks, both in the challenging open vocabulary setting [642, 354]. We also evaluate Depth Estimation and Point Cloud Reconstruction tasks. Standard metrics are used for all tasks. 173 Method Precision Recall F1 Zero-shot CLIP 0.293 0.282 0.271 Ours 0.863 0.817 0.839 Table 9.4: Classification results on the real test set. Task-specific training on syntactic data boosts performance on real images. tab:pred_result_comp Open Close Ontop Inside Under Avg Precision 0.962 0.897 0.947 0.989 0.874 0.932 Recall 0.822 0.978 0.913 0.995 0.949 0.929 Precision 0.943 0.958 0.545 0.906 0.948 0.863 Recall 0.757 0.915 0.913 0.776 0.703 0.817 Test on Synthetic Real Table 9.5: Classification results on held-out synthetic eval set and real test set for our method adapted from [382]. tab:spatial_stat Results and analysis. We summarize all our evaluation results in Table 9.3. We observe that the relative performance of these models on our synthetic dataset has high correlation with that on the real datasets such as MS COCO [349] or NYUv2 [398], indicating that our generated synthetic datasets can be a faithful proxy for real datasets. In summary, we provide a comprehensive benchmark to score and understand a wide range of existing models for each of the four tasks on exactly the same images. While the majority of current vision models focus on single output modality, we hope BEHAVIOR Vision Suite could motivate researchers and practitioners to develop versatile models that concurrently predict multiple modalities in the future, where our benchmarking results of single-task SOTA methods in this section could serve as a useful reference. 9.4.3 Object States and Relations Prediction sec:App-UB BVS’s capabilities extend beyond model evaluation shown in Section 9.4.1 and Section 9.4.2. Users can also leverage BVS to generate training data with specific object configurations that are hard to collect in large quantities in the real world or difficult to label. In this section, we showcase one practical application of using BVS to generate a synthetic dataset of diverse kinematic object states and relations, and then training a 174 Synthetic Real OnTop Under Inside Open / Close Figure 9.6: Sample images of each class from our generated synthetic and collected real datasets. fig:spatial vision model capable of zero-shot transferring to real-world images on the task of object states and relations prediction [222, 168, 174, 71]. Task design and dataset generation. The task of predicting object states and relations, such as open and inside, is an important perception task [300, 375, 382]. Yet, in the real world, it is challenging to collect such data, let alone the costly annotations. We leverage our generator to synthesize 12.5k images with five labels (open, close, ontop, inside, under). Each image contains one or more desired labels, e.g. a toy inside an open cabinet. In addition, we manually collected and labeled 850 real images, with unseen object instances and scenes to test for sim2real performance. Examples are shown in Figure 9.6. Baselines and metrics. Adapting from [382], our model takes an image and the bounding boxes for the target objects as input, and outputs a five-way classification over the five labels. We define open/close as a binary relationship between the movable link and the unmovable base of an articulated object. For example, the model can be queried whether each of the drawer of a cabinet is open separately, offering very fine-grained understanding of the object state. 175 We compare our model with zero-shot CLIP, which doesn’t train on the synthetic dataset. Specifically, by harnessing CLIP’s zero-shot capabilities [434], this baseline outputs a five-way classification prediction by comparing the image embeddings with the five verbalized prompts’ text embeddings. We evaluate our model and zero-shot CLIP baseline in terms of precision, recall, and F1, on the synthetic eval set and the real test set. Results and analysis. Table 9.5 shows the quantitative results on the held-out synthetic dataset and the real dataset for our method. Although there is some performance gap, our model trained on only synthetic data can zero-shot transfer to real images with good overall accuracy. This indicates that BVS offers a promising way to obtain realistic synthetic data that researchers can use not only for evaluation (as shown in Section 9.4.1 and Section 9.4.2), but also for training models that can then be transferred to the real world. In fact, from Table 9.4, we observe that task-specific training on synthetic data is crucial for good performance on real images. 9.5 Conclusion chap-9-sec:conclude We introduced the BEHAVIOR Vision Suite (BVS), a novel toolset designed to help systematic evaluation and understanding of computer vision models under varying conditions. BVS allows researchers to control a multitude of parameters at various levels—scene, object, and camera—thereby enabling the creation of highly tailored datasets for specific computer vision tasks. Our experiments highlight the versatility and effectiveness of BVS with three key application scenarios. Firstly, we demonstrated its capability in evaluating model robustness against a range of domain shifts, showcasing its utility in helping understand how models perform under diverse and challenging conditions. Secondly, we provided comprehensive benchmarking results of various scene understanding models on a single, common dataset, to show the potential of developing a multitask method using a single BVS dataset. Finally, we explored the potential of BVS in facilitating sim2real transfer for novel vision tasks, object states and relations prediction. We aim to provide the computer vision community with a powerful tool that addresses the current data generation challenges. BVS demonstrates 176 the potential of synthetic data in advancing the field, offering researchers a means to generate high-quality, diverse, and realistic datasets tailored to their specific needs. 177 Chapter 10 A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts chapter-10 Despite substantial progress in applying neural networks (NN) to a wide variety of areas, they still largely suffer from a lack of transparency and interpretability. While recent developments in explainable artificial intelligence attempt to bridge this gap (e.g., by visualizing the correlation between input pixels and final outputs), these approaches are limited to explaining low-level relationships, and crucially, do not provide insights on error correction. In this work, we propose a framework (VRX) to interpret classification NNs with intuitive structural visual concepts. Given a trained classification model, the proposed VRX extracts relevant class-specific visual concepts and organizes them using structural concept graphs (SCG) based on pairwise concept relationships. By means of knowledge distillation, we show VRX can take a step towards mimicking the reasoning process of NNs and provide logical, concept-level explanations for final model decisions. With extensive experiments, we empirically show VRX can meaningfully answer “why" and “why not" questions about the prediction, providing easy-to-understand insights about the reasoning process. We also show that these insights can potentially provide guidance on improving NN’s performance. 10.1 Introduction With the use of machine learning increasing dramatically in recent years in areas ranging from security [58] to medicine [501], it is critical that these neural network (NN) models are transparent and explainable as this relates directly to an end-user’s trust in the algorithm [199, 7]. Consequently, explainable AI (xAI) has emerged as an important research topic with substantial progress in the past few years. Most recent xAI Input Image Reasoning Interpretation Why not Ambulance Why Fire engine 1 2 3 4 1 2 3 4 1 23 4 Why not School bus Original NN prediction Fire engine Visual Reasoning Explanation Negative Positive 1 3 2 4 3 2 1 4 3 1 2 4 (a) (b) (c) Figure 10.1: An example result with the proposed VRX. To explain the prediction (i.e., fire engine and not alternatives like ambulance), VRX provides both visual and structural clues. Colors of visual concepts (numbered circles) and structural relationships (arrows) represent the positive or negative contribution computed by VRX to the final decision (see color scale inset). (a): The four detected concepts (1-engine grill, 2-bumper, 3-wheel, 4-ladder) and their relationships provide a positive contribution (blue) for fire engine prediction. (b, c): Unlike (a), the top 4 concepts, and their relationships, for ambulance/school bus are not well matched and contribute negatively to the decision (green/yellow/red colors). fig:why approaches attempt to explain NN decision reasoning process with visualizations depicting the correlation between input pixels ( or low-level features) and the final output [633, 371, 662, 578, 496, 563, 288, 77, 538, 179 455], with perturbation-based [538, 455] and gradient-based [496, 77] methods receiving particular attention in the community. Despite impressive progress, we identify some key limitations of these methods that motivate our work. First, the resulting explanations are limited to low-level relationships and are insufficient to provide in-depth reasoning for model inference. Second, these methods do not have systematic processes to verify the reliability of the proposed model explanations [287, 186]. Finally, they do not offer guidance on how to correct mistakes made by the original model. We contend that explaining the underlying decision reasoning process of the NN is critical to addressing the aforementioned issues. In addition to providing in-depth understanding and precise causality of a model’s inference process, such a capability can help diagnose errors in the original model and improve performance, thereby helping take a step towards building next-generation human-in-the-loop AI systems. To take a step towards these goals, we propose the visual reasoning explanation framework (VRX) with the following key contributions: • To understand what an NN pays attention to, given an input image, we use high-level category-specific visual concepts and their pairwise relationships to build structural concepts graphs (SCGs) that help to highlight spatial relationships between visual concepts. Furthermore, our proposed method can in-principle encode higher-order relationships between visual concepts. • To explain an NN’s reasoning process, we propose a GNN-based graph reasoning network (GRN) framework that comprises a distillation-based knowledge transfer algorithm between the original NN and the GRN. With SCGs as input, the GRN helps optimize the underlying structural relationships between concepts that are important for the original NN’s final decision, providing a procedure to explain the original NN. • Our proposed GRN is designed to answer interpretability questions such as why and why not as they relate to the original NN’s inference decisions, helping provide systematic verification techniques to 180 demonstrate the causality between our explanations and the model decision (Figure 10.1). We provide qualitative and quantitative results to show efficacy and reliability. • As a useful by-product, in addition to visual reasoning explanations, our method can help take a step towards diagnosing reasons for any incorrect predictions and guide the model towards improved performance. Convolutional layer Fully connected layer … Prediction: jeep (a) (b) Build Structural Concept Graph Grad-based attribution Grad-Cam filter Concept 1 Concept 2 Concept 3 Concept 4 ACE(segmentation → img2vec → cluster → sort by TCAV) … … … Visual Concept Extractor Represent image as Structural Concept Graph fire engine? beach wagon? jeep? … eji_ fire engine eji_ beach wagon eji_ jeep Concatenate MLP Graph Reasoning Network Knowledge Distillation Prediction: jeep Why not beach wagon Why jeep Negative Positive 1 2 3 4 1 2 3 4 1 2 3 4 Why not fire engine 3 2 1 4 3 1 2 4 … … … … … 1 2 3 4 Visual Decision Interpreter (c) (d) Figure 10.2: Pipeline for Visual Reasoning Explanation framework. (a) The Visual Concept Extractor (VCE) discovers the class-specific important visual concepts. (b) In original NN, the representation of the top N concepts is distributed throughout the network (colored discs and rectangles). (c) Using Visual Concept Graphs that are specific to each image class, our VRX learns the respective contributions from visual concepts and from their spatial relationships, through distillation, to explain the network’s decision. (d) In this example, the concept graphs colored according to contributions from concepts and relations towards each class explain why the network decides that this input is a Jeep and not others. chap-10-fig:2 10.2 Related Work In this section, we review existing literature relevant to our work interpreting convolutional neural networks, graph neural networks, and knowledge distillation to differentiate our method from others. 181 Interpreting neural networks. The substantial recent increase in the practical adoption of deep learning has necessitated the development of explainability and interpretability methods for neural networks (NNs), and convolutional neural networks (CNNs) in particular. One line of work focuses on pixel-level interpretation [496, 77, 662, 155, 328, 659, 584], producing attention maps to highlight the relevant image regions contributing to the final model decision. These methods can further be categorized into gradient-based and response-based methods. Response-based approaches use an additional computational unit to calculate the importance score of spatial image locations. For example, CAM [662] utilized an auxiliary fully-connected layer to produce the spatial attention map and highlight image pixels contributing to the network decision. On the other hand, gradient-based methods, e.g., Grad-CAM [496], generate class-specific attention maps based on gradients backpropagated to the last convolutional layer given the model prediction. In addition to pixel-level interpretation, several recent works proposed to extract more human-intuitive concept-level explanations for interpreting neural networks [288, 187]. Specifically, Kim et al. [288] proposed TCAV where directional derivatives are used to quantify the sensitivity of the network’s prediction with respect to input user-defined concepts. Ghorbani et al. proposed an automatic concept selection algorithm [187] based on the TCAV scores to produce meaningful concept-level explanations. While our framework also produces concept explanations automatically, it goes beyond this and learns explicit inter-concept relationships, producing more insightful interpretations. Graph Networks. Graph neural networks (GNNs) have been successfully applied to tasks ranging from node classification [295, 213, 618], edge classification [419, 205] to graph classification [189, 81]. Based on “message passing", powerful extensions such as GCNs [295], graph attention network (GAT) [566], SAGE [213] and k-GNNs [393] have been proposed. Due to their trackable information-communication properties, GNNs can also be used for reasoning tasks, such as VQA [545, 405] and scene understanding [335]. In this work, we adopt the GCN to learn semantic relationships and interactions between human-interpretable concepts, providing more thorough explanations. 182 Knowledge distillation. Knowledge distillation can effectively learn a small student model from a large ensembled teacher model [236], which finds broad applications in different areas, like model compression [425] and knowledge transfer [423]. In a similar spirit, in this work, we learn an easy-to-understand graph reasoning network (GRN) that produces the same classification decisions as the original NN model while also learning structural relationships between concepts to generate in-depth explanations for the original NN inference decisions. 10.3 Visual Reasoning Explanation Framework chap-10-sec:3 Our proposed visual reasoning explanation framework (VRX) to explain the underlying decision reasoning process of a given NN is visually summarized in Figure 10.2. VRX comprises three main components: a visual concept extractor (VCE) to identify primitive category-specific visual concepts from the given neural network; a graph reasoning network (GRN) to organize category-specific visual concepts, represented as structural concept graphs (SCGs), based on their structural relationships, to mimic the decision of the original NN with knowledge transfer and distillation; and a visual decision interpreter (VDI) to visualize the reasoning process of the neural network given a certain prediction. We next explain each of these components in detail. 10.3.1 Visual Concept Extractor chap-10-sec:3-1 While most existing neural network explanation techniques focus on producing low-level saliency maps, these results may be suboptimal as they may not be intuitive for human users to understand. Inspired by the concept-based explanations (ACE) technique [187], we propose to use visual concepts to represent an input image given class-specific knowledge of the trained neural network to help interpret its underlying decision-making processes. While ACE [187] is reasonably effective in extracting class-specific visual concepts, its performance is dependent on the availability of sufficient image samples for the given class of interest. As we show in 183 Figure 10.3: Concept discovery with and without Grad-Cam filter. fig:filter Figure 10.3 (left), for a class (ambulance here) with a small number of training images (50), the ACE concepts mostly fall on the background region, presenting challenges for a downstream visual explanation. To alleviate this issue, given an image I, we propose to use top-down gradient attention [496] to first constrain the relevant regions for concept proposals to the foreground segments, thereby helping rule out irrelevant background patterns. Given the class-specific attention map M, we use a threshold τ to binarize M as M¯ (pixel values lower than τ set to 0, others set to 1), which is used to generate the masked image ¯I = I × M¯ (× is element-wise multiplication) for further processing. Specifically, following ACE, we extract the top-N visual concepts and their mean feature vectors for each class of interest using the original trained NN. Figure 10.3 demonstrates the importance of the proposed gradient attention pre-filtering discussed above using top-3 visual concepts for the ambulance class (concepts with the pre-filtering focus more clearly on the foreground). 184 10.3.2 Graph Reasoning Network 10.3.2.1 Representing Images as SCGs chap-10-sec:3-2-1 Given the aforementioned class-specific visual concepts (see Section 10.3.1), we represent images using structural concept graphs (SCGs), which, as input to our proposed graph reasoning network (GRN), helps learn structural relationships between concepts and produce visual explanations for the original NN. Specifically, given an image, we use multi-resolution segmentation to obtain image patches (also called concept candidates), as inputs to the original NN to compute patch features, and then match these features to the mean concept feature vectors derived above (from Section 10.3.1). For each class of interest, we construct an SCG with concepts/patches detected from the input image, based on the Euclidean distance between patch feature and mean concept feature. Specifically, if the Euclidean distance between image patch feature and mean concept feature is larger than a threshold t, we identify this patch as a detected concept. For undetected concepts, we use dummy node feature representation (all feature values equal to a small constant ϵ), to ensure network dimension consistency. Note that we have n SCGs generated for the same input image considering all n classes of interest. SCG is a fully connected graph (V, E) with bidirectional edges where each node vi ∈ V represents one relevant visual concept. Each directed edge edgeji = (vj , vi) ∈ E has two attributes: 1) a representation of spatial structure relationship between nodes edgeji, initialized with the normalized image locations [xj , yj , xi , yi ] of the two visual concepts it connects and updated in each layer of GRN; 2) a measure of dependency eji (a trainable scalar) between concepts vi , vj (see Figure 10.2 (c) and Figure 10.5 for an overview). Such a design helps our framework not only discover human-interpretable visual concepts contributing to network prediction but also how their underlying interactions (with eji capturing the dependencies) affect the final decision. 185 10.3.2.2 Imitate the Reasoning Process of NN In addition to learning concept representations and capturing the structural relationship between visual concepts we also need to ensure the proposed GRN follows the same reasoning process as the original NN. Since we represent images as SCGs, this problem comes down to optimizing the GRN, with SCG inputs, so it gives the same output/prediction as the original NN with image inputs. We realize this with a distillation-based training strategy. Specifically, given an input image I and a trained NN classifier F(·), along with n SCG hypotheses h = {h1, h2, ...hn} extracted from the input image, we seek to learn the GRN G for h such that G(h) = F(I), i.e., ensuring prediction consistency between the GRN and the original NN. The proposed G(·) comprises two modules: 1) a GNN G is applied for all classes with different class-specific eji to learn the graph representation of SCGs; 2) an embedding network E is used to fuse multi-category SCGs for final class prediction, i.e.: G(h) = E(G(h)) = F(I) (10.1) Figure 10.2 (b-c) give an overview of the component relationship between the original NN (b) and the proposed GRN (c), showing how GRN learns the “embedding" for each hypothesis and through knowledge distillation ensures the same prediction as the original NN. We use GraphConv [393] as G’s backbone network and modify the aggregate weights. For each graph convolutional layer, we have: f i k+1 = W1f i k + X j∈N (i) e c jiW2f j k (10.2) {eq:eij_1} eq:eij_1 where f i k denotes the feature of node vi (representing a concept) in layer k, W1 and W2 denote the shared linear transformation parameters for center node vi and neighbor node vj respectively, N (i) denotes the neighboring node sets connected to node i, and e c ji denotes the aggregation weight from start node vj to end node vi for a certain class c, indicating the inter-dependency of concepts i on j. Instead of using shared edges 186 for all classes of interest, GRN learns class-specific e c ji, i.e. different aggregation weights for different classes to capture varying structural relationships between class-specific concepts. In order to better capture inter-concept relationships, we concatenate edge features with neighboring node features, denoted as C(e c jiW2f j k , edgeji k ), and Equation 10.2 becomes: f i k+1 = W1f i k + X j∈N (i) W3C(e c jiW2f j k , edgeji k ) (10.3) {chap-10-Eq.3} chap-10-Eq.3 With edgeji k+1 = W4edgeji k , and W3 and W4 denoting one layer linear transformation for concatenated message feature and edge feature respectively. Since e c ji is a trainable parameter by design in our G, it helps learn concept inter-dependency as measured by the overall training objective (see Figure 10.5(b) for a fire engine image example). The embedding network E concatenates all the feature vectors output from G and maps it into a n−dimensional vector with an MLP (n is the number of classes of interest). The GRN is then trained to imitate the original NN (see Figure 10.4) by minimizing: Ld = ||σ(G(h)) − σ(F(I))||l1 (10.4) where σ(·) is a normalization function. To imitation robustly, we randomly mask out one of the detected visual concepts on the input image. Figure 10.4 demonstrates the prediction comparison between the learned G and the original NN. {class name}_detect{N} denotes images from category class name with concept N masked out. 10.3.3 Visual Decision Interpreter chap-10-sec:3-3 Once our GRN is trained to be a structural-concept-level representation of the original neural network, we can then interpret the original model decisions with our visual decision interpreter (VDI) module. As 187 0.850 0.825 0.800 0.775 0.750 0.725 0.700 0.675 0.650 Figure 10.4: Decision comparison between original NN and proposed GRN. fig:training shown in Figure 10.2(c-d), after feeding an image to both the original NN and the GRN, we obtain the final prediction y representing the probability of all class of interest, y = E(G(h)) = E(C m i=1(Gi (hi))). where Gi represents the shared G equipped with class i’s aggregate weight e i ji and Gi (hi) is the graph embedding for the ith hypothesis SCG composed of the extracted concept node and edge feature representations; C denotes concatenation operation. For each interested class c, we have a class prediction score y c and compute gradients of y c with respect to the graph embeddings from m hypothesis as: αi = ∂yc ∂Gi(hi) , i = 1, ..., m (10.5) {chap-10-Eq.6} chap-10-Eq.6 188 1 2 3 4 1 0.152 0.644 0.319 1.115 2 0.722 0.033 0.369 1.124 3 0.207 0.197 0.311 0.715 4 0.069 0.649 0.322 1.04 (a) (b) Figure 10.5: (a) Class-specific importance weights eji highlight the important concept relationships for different classes (b) eji reveals the information transformation between concepts, which shows the dependency between concepts: concept 1 and 2 contribute most information to other concepts, which makes them the 2 most discriminating concepts for a fire engine. fig:eij Figure 10.6: Visual Reasoning Explanation and logic consistency experiment example.fig:ex_ex where αi denotes the contribution weight vector of hypothesis hi . The contribution score si for each hypothesis hi w.r.t the prediction of y c is computed as the weighted sum of αi and G(hi): si = α T i G i (hi), i = 1, ..., m (10.6) {chap-10-Eq.7} chap-10-Eq.7 189 We then use the contribution score si computed from Equation 10.6 to indicate the positive or negative contribution (contribution score) of each node (concept) or edge (spatial and dependency conceptual relationship) to the decision made by the neural network (positive contribution score means positive contribution and vice versa). 10.4 Experiments and results We conduct four different experiments to demonstrate the effectiveness of our proposed VRX in interpreting the underlying reasoning logic of neural network’s decision, guiding network diagnosis and improving the performance of the original neural network. In our experiments, we use Xception [94] and GoogLeNet models [541] pre-trained on the ILSVRC2012 dataset (ImageNet) [112] as the target neural networks. 10.4.1 Visual Reasoning Explanation Experiment Figure 10.6 (a-b) shows two examples (one correct and one incorrect prediction) of how our VRX can help to explain the decision behind neural networks by performing experiments on GoogLeNet and Xception, respectively. Given a pre-trained GoogLeNet on ImageNet, we develop a VRX as introduced in Section 10.3 to explain the reasoning logic. As shown in Figure 10.6 (a), for the input school bus image, both GoogLeNet and our VRX correctly predict the input as a school bus, with VRX outputs nearly identical prediction vector as original GoogLeNet which aligns with our expectation that our VRX ideally should imitate the behavior of original NN. We then use our proposed VRX to compute the contribution score for each concept node and edge to analyze how the detected human-interpretable concepts along with their structural relationships contributing to the network’s decision. In this case, we ask ‘why school bus?’ (why the original NN predict this image as a school bus?): from a visual/conceptual perspective, all detected top 4 important concepts have high positive contribution (blue) to the prediction probability of school bus (Row 3 of Figure 10.6 (a)), 190 indicating the network is able to discover meaningful visual regions contributive to the correct prediction; from a structural perspective, the spatial location and relationship between concepts represented by edge arrows also contribute positively (light or dark blue), meaning the network identifies correct spatial correlations between detected visual concepts. Similarly, to answer ‘why not fire engine?’ and ‘why not ambulance?’, VRX identifies nearly all detected concepts negatively contribute to the corresponding prediction class, and all structure relationships between concepts have negative contributions to the class prediction as well. Based on the explanation above, VRX can give a systematically in-depth and easy-to-understand interpretation of the decision-making logic of GoogLeNet, from the visual and structural perspectives respectively. The second example is shown in Figure 10.6 (b) for Xception network. Given an image of a fire engine, both the original Xception and our VRX wrongly predict ambulance as output. To understand why original Xception makes the incorrect prediction, our VRX is able to provide both visual and structural clues as well. From Figure 10.6 (b) Row 1, we can see that the detected visual concepts 3 (wheels of the vehicle) and 4 have negative contribution to the prediction of fire engine class, indicating that the wheel region of the input image is not consistent with the model’s knowledge of fire engine (with negative contribution). To answer "why ambulance", concept 3 and 4 have positive contribution to ambulance prediction, which explains why the original Xception network incorrectly predicts the input image as an ambulance. 10.4.2 Logic Consistency between VRX and NN To verify that the explanation of VRX is logically consistent with the reasoning of Xception, we present two experiments as follows. First, as shown in Figure 10.6 (c), for the wrong prediction example same as Figure 10.6 (b), we substitute the flawed fire engine concept 3, which has a negative contribution (low contribution score), with a good concept 3 (high contribution score) from another fire engine image and form a new modified image. Then, we use Xception to re-predict the class of the modified image, it corrects the error and predicts the input as a fire engine correctly. To show a causal relationship between VRX’s 191 Cause of error Error type total concept structure both Before correction 119 5 6 108 Substitute with Random patches 117 5 6 106 Change good concepts 115 5 6 104 VRX guided correction 5 1 2 2 Table 10.1: VRX model helps correction. Out of 119 images initially misclassified by Xception, only 5 remain misclassified after VRX-guided image editing. Over 30% of the samples have missing concepts and over 95% of them have been correctly explained. In contrast, 117 and 115 images remain misclassified after substituting bad concepts with random image patches, or substituting good concepts with other good concepts from other images from the same class. chap-10-tab-1 explanation and the reasoning logic of Xception, we perform two additional contrastive experiments: a) Random substitute: if we substitute concept 3 with random patches, Xception does not achieve a correct prediction; b) Substitute good: if we substitute concepts 1 or 2 with other equivalently good patches from other images of fire engines, Xception also does not produce a correct decision. Thus, we conclude that VRX has correctly diagnosed the cause of Xception’s error (here, a bad concept 3). Below, we show how this can be used to further guide improved training of original NN without manually modifying the image. For the wrongly predicted class, ambulance, if we delete a concept patch with a high contribution to ambulance probability, the prediction of Xception shows a decreased probability prediction of ambulance class and a higher probability prediction of fire engine. In total, we applied this experiment to 119 images that were initially wrongly predicted by Xception (Table 10.1). The results show that with the guidance of VRX (confusing/bad concept detected), most wrong prediction cases can be corrected through learning-based modifications of the images. 10.4.3 Interpretation Sensitive of Visual and Structure We have demonstrated that VRX can help explain why and why not the model makes the decision, and shows a causal relationship between VRX’s explanation and the original NN’s decision. In this section, we 192 Fire engine: 0.17 Ambulance: 0.66 School bus: 0.17 Visual concept contribution: [10.43, 15.75, 3.28, 13.19] substitute concept 2 Fire engine: 0.18 Ambulance: 0.57 School bus: 0.25 Visual concept contribution: [-5.39, -2.79, 0.38, 10.5] 3 4 1 2 3 4 1 2 Fire engine: 0.09 Ambulance: 0.78 School bus: 0.13 Visual concept contribution: [27.32, 27.26, 5.93, 14.89] substitute concept 1 Fire engine: 0.11 Ambulance: 0.73 School bus: 0.16 Visual concept contribution: [-2.75, -8.31, 0.77, 5.01] 3 2 1 4 3 4 1 2 3 2 1 4 Fire engine: 0.76 Ambulance: 0.18 School bus: 0.06 1 3 2 4 1 3 2 4 Edges contribution : [Edge_31: 0.56, Edge_32: 0.49, Edge_34: 0.62 ] Fire engine: 0.71 Ambulance: 0.17 School bus: 0.12 Edges contribution : [Edge_31: -0.34, Edge_32: -0.21, Edge_34: -0.1 ] Move concept 3 Fire engine: 0.16 Ambulance: 0.66 School bus: 0.18 1 3 2 4 1 3 2 4 Edges contribution : [Edge_41: 0.30, Edge_42: 0.44, Edge_43: 0.96 ] Fire engine: 0.29 Ambulance: 0.57 School bus: 0.14 Edges contribution : [Edge_41: -0.02, Edge_42: -0.05, Edge_43: -0.28 ] Move concept 4 (a) Visual sensitive Experiments (b) Structure sensitive Experiments Figure 10.7: Interpretation from VRX is sensitive to visual and structure aspects. (a) visual sensitive (b) structure sensitive. fig:sensitive focus on the sensitivity analysis of VRX’s explanation from visual and structural aspects, respectively. We design two experiments accordingly: first, when we substitute a relatively good concept (with high positive contribution scores to corresponding class prediction) patch with a relatively bad concept (with lower positive or even negative contribution score) patch in an image, we want to see if VRX can capture the difference and precisely locate the correct modification, which shows the sensitivity of VRX to visual explanation. Second, when we move one concept’s location from a reasonable place to an abnormal location, we want to make sure if VRX can precisely capture the structural abnormality and produce a corresponding explanation that correctly matches our modification. 193 Figure 10.7(a) demonstrates two visual sensitivity experiment examples. In the top row, given an ambulance image with a correct prediction from a trained Xception (Figure 10.7(a) left), VRX explains that all detected concepts and relative structure relationship have positive contributions to the prediction of ambulance class. We then substitute the original good concept 2 with relatively bad concept 2 from another ambulance image and form a modified ambulance image (Figure 10.7(a) right), to check the sensitivity of our VRX with respect to visual perturbation. From Figure 10.7(a), we can see that after the substitution, the class prediction score from both VRX and original Xception decrease as expected. While VRX gives a clear explanation for this performance decrease due to: less contributive concept 1 and 2 (negative contribution to the ambulance prediction), and invariant structure contributions, which correctly matches our modification in the original image. This proves the sensitivity of our VRX to visual perturbations. The second row of Figure 10.7(a) shows an additional example of visual sensitivity test. Figure 10.7(b) illustrates two structure sensitivity experiments. Given a fire engine image with a correct prediction from trained Xception, VRX shows that concept 3 and the structural relationships of concept 3 to all adjacent concepts are positively contributive for class prediction. We then move concept 3 from the original location to an abnormal location (we move the wheels from the bottom to the sky) and form a modified fire engine image (Figure 10.7(b) right) to test the structural sensitivity of our VRX. Similarly, VRX produces consistent explanation with respect to structure perturbation as well, where the spatial relationship importance score between concept 3 to all adjacent concepts decrease after the substitution, which demonstrates the good sensitivity of our VRX to structural information. A second example in Figure 10.7(b) shows similar results. 10.4.4 Model Diagnosis with VRX With the explainability of VRX, reasoning results generated by VRX can be further utilized to guide improving the performance and generalizability of the original NN. Figure 10.8 shows a 6-class confusion matrix with Xception. With VRX, the type of error Xception makes can be categorized as the following: 194 Figure 10.8: Model diagnosis and improving performance fig:confu (1) Confused visual concepts between classes. The top k concepts of different classes may share certain overlaps. For instance, most vehicles have concepts related to ’wheels’. Hence judging only by this concept, the neural network may confuse one type of vehicle with another. There are existing approaches [329] which can guide the network in growing its attentive region and alleviating the impact from biases in training data. (2) False alarms in concept detection/recognition. To VRX this usually means one or more patches are incorrectly labeled, which means either the neural network’s feature extraction can be improved, or the most important visual concepts for specific classes are not discriminative enough. (3) Variance not seen in training. For instance, the distribution of viewpoints of a class of interest is biased in the training set of the NN. When the same object with an unseen viewpoint is presented to the NN, it may fail to recognize it. In these cases, in VRX’s decision reasoning, it may appear that most of the detected 195 1 2 3 4 1 2 3 4 Military SVCG Tank SVCG Sum of score of edges: -3.41 Sum of score of edges: 2.06 Visual concept score:[0.89, 0.21, 1.40, 0.24] Visual concept score:[-3.91, 1.15, -3.94, -3.95] Original image (a) (b) Bus Tank Military Figure 10.9: Diagnosis and improvement experiment on iLab-20M. chap-10-fig:ilab concepts are very close matches. However, the edge features seem off, suggesting the structural or spatial relationships between concepts are the cause for the NN to make incorrect predictions. Augmenting the training images with more diversity in viewpoints may solve the problem, as the further experiment shown below with the iLab-20M [48] dataset. To further demonstrate the capability of NN diagnosis, we design an experiment on iLab-20M. iLab-20M is an attributed dataset with images of toy vehicles on a turntable captured with 11 cameras from different viewpoints. We sampled a subset from iLab-20M with similar identity and pose: we focus on three classes of vehicles: bus, military, and tank. In the training set, each class has 1000 images. We manually introduce biases with the pose of each class: all buses are with pose 1, all military are with pose 2 and all tanks are with pose 3 (Figure 10.9). We designed an unbiased test set where each kind of vehicle has all the 3 poses. 196 original setting 1 setting 2 Average accuracy 50 60 50 Table 10.2: Testing set accuracy comparison for VRX boost original model performance. All numbers are in %. chap-10-table:ilab We train a Resnet-18 [220] to classify the 3 types of vehicles with the training set and test the accuracy on the test set (Table 10.2). To explain the reasoning logic of the trained network, we trained a GRN with VRX and explained the logic of common mistakes made by the Resnet-18. For most incorrectly classified samples in the test set, given the input image (in Figure 10.9, the military is wrongly predicted as tank), VRX’s interpretation shows that most of the detected visual concepts had a positive contribution to the correct class while the structure relationship between concepts contributed mostly negatively, which leads to the incorrect prediction. To verify the “diagnosis", we designed a follow-up experiment, focusing on improving performance for the military class. Setting 1: we add images of additional poses (150 for each of the three poses) for the military in the training set and test the performance on the test set; setting 2: we add the same amount of images (450) as setting 1 but with images of the same pose as in the original training set. Table 2 shows that the accuracy with the augmented training set using setting 1 obtains much higher performance compared to the initial experiment and the follow-up experiment with setting 2 which does not bring any improvement. This suggests that VRX can help to diagnose the root cause of mistakes a neural network made, and potentially provide useful suggestions to improve the original NN’s performance. 10.5 Conclusion We considered the challenging problem of interpreting the decision process of a neural network for better transparency and explainability. We proposed a visual reasoning explanation framework (VRX) which can extract category-specific primitive visual concepts from a given neural network, and imitate the neural network’s decision-making process. Our experiments showed that the VRX can visualize the reasoning process 197 behind neural network’s predictions at the concept level, which is intuitive for human users. Furthermore, with the interpretation from VRX, we demonstrated that it can provide diagnostic analysis and insights on the neural network, potentially providing guidance on its performance improvement. We believe that this is a small but important step forward towards better transparency and interpretability for deep neural networks. 198 Chapter 11 How to interpret, teach, and interact with neural networks chapter-11 How could humans better teach, understand, and communicate with artificial neural networks to avoid making some mistakes and learn new knowledge? Currently, network reasoning is mostly opaque. Attempts at modifying it are usually through costly addition of new labeled data and retraining, with no guarantee that the desired improvement will be achieved. Here, we develop a framework that allows humans to understand the reasoning logic of a network easily and intuitively in graphical form. We provide means for humans to leverage their broader contextual knowledge, common sense, and causal inference abilities, to simply modify that graph as needed, to correct any underlying flawed network reasoning. Finally, we automatically merge and distill the modified knowledge back into the original network, so that the improved network can exactly replace the original, but performing better thanks to human teaching. We show viability of the approach on large-scale image classification and zero-shot learning tasks. 11.1 Introduction Teaching and learning serve as the cornerstones of societal and technological progress, fostering the growth and development of individuals and communities alike [572]. The current approach to supervised teaching of neural networks, however, resembles the repetitive act of "cramming" for an exam by repeatedly rehearsing content from flash cards (annotated training datasets). The prevailing paradigm thus often positions humans 199 School bus … … c-SCG Ambulance c-SCG … … … … Zebra c-SCG … … Tiger c-SCG School bus image samples Ambulance image samples Zebra image samples Tiger image samples Figure 11.1: Network-to-human path in our approach shows the reasoning logic of a network to a human, using a Structural Concept Graph as "language". Four examples of object classes are shown (one per row). In each one, we highlight the four most important visual concepts according to the original network (different colors), in three instance images. These visual concepts are the ones that most influence the decision of the original network, as discovered through an automated analysis of the network (visual concept extractor). Aggregating these instance-level concepts produces a class-level structural concept graph (c-SCG; rightmost column) for each object class, which captures the most discriminative visual concepts or parts for that class, according to the original neural network, as well as their relationships. Sometimes, the most important concepts for the original network are wrong, possibly because of spurious correlations in the training data, or for some other reasons detailed below (e.g., red circle on top row is a patch of background foliage, which may have often appeared next to school buses during training, but actually is not part of school bus; likewise for a patch of grass with Zebra). chap-11-fig:1 as peripheral to a neural network, emphasizing data collection and model tuning. This structure hinders direct and efficient knowledge exchange, making it challenging for humans to convey their knowledge to neural networks and vice versa [50]. As Socrates famously said, "I cannot teach anybody anything. I can only make them think" [28]. This naturally prompts us to explore a new role for humans in teaching neural networks, not only by providing large amounts of annotated data, but also by guiding and refining the network’s thought process. Here, we hence define methods for improving the efficiency of human-neural network interaction, emphasizing the need for a better common language [307] and techniques that enable direct teaching and 200 learning. By bridging the communication gap between humans and neural networks, we can unlock the full potential of this knowledge exchange, aiming towards a Socratic conversation or dialogue between humans and networks, rather than only cramming of annotated datasets. As machine learning (ML) systems become ubiquitous in our lives [501, 103, 58, 46, 274, 468], the need to understand, explain, and trust ML systems is ever growing [635]. Both humans and neural networks have their respective strengths and can offer valuable insights to one another [235]. Especially when a network makes mistakes, a good human teacher may be able to contribute additional prior and domain expertise knowledge, causal inference abilities, and common sense, to help correct these mistakes. However, the lack of an effective knowledge exchange interface has made it hard for a human to locate the reason for a network’s error, not to mention correct the error. There are two main challenges for the interaction between humans and networks: (1) Interpretability [125, 351, 191], for humans, i.e., how to understand the reasoning logic of a network and how to correctly locate the reasons for errors [171]. (2) Changing a network’s logic and decision, once humans locate an error of the network, how to correct it and improve the network’s performance. Interpreting networks has traditionally been approached through "feature-based" explanation methods [187] that involve modifying input features (e.g., pixels, super-pixels, word vectors) by either removing them (through zeroing-out, blurring, shuffling) [455] or perturbing them [538, 517]. These methods aim to approximate the importance of each feature in the model’s predictions. For visual interpretation, class-discriminative attention maps can be generated to highlight image regions that strongly support the network’s decision [663, 495]. However, these approaches have been criticized for reliability issues [186, 8], vulnerability to adversarial perturbations [292], and susceptibility to human confirmation biases [287]. Additionally, they do not necessarily improve human understanding of the model [426]. To address these concerns, recent "concept-based" research provides explanations in the form of highlevel human concepts [665, 287, 186, 187]. These methods focus on extracting or revealing important 201 visual concepts, rather than pixels or features, to explain the original model. However, these approaches do not clarify the network’s reasoning logic or elucidate how spatial relationships and interactions among image regions or concepts may affect decisions. The recently proposed visual reasoning explanation [171] mimics the original network’s reasoning logic and provides logical, easy-to-understand explanations for final decisions, but it cannot directly influence the network’s performance to achieve closed-loop interaction. Interacting with neural networks (network) has been a topic of interest in Human-In-the-Loop Machine Learning (HILML) [110] and Interactive Machine Learning (IML) [593]. These ML-centered approaches typically involve a pipeline where models are retrained using human-curated data [110]. In this process, humans essentially act as "servers" around the ML process, by being involved in data production, ML modeling, and model evaluation and refinement [369]. However, this limited role constrains human involvement and the potential for domain expert contributions. Recent research has explored more human-friendly interactions between humans and neural networks. "Tell me where to look" [329] employs an explainable attention map to rectify segmentation errors in networks, but this approach is restricted to pixel-level masks. Revising Neuro-Symbolic [530] attempts to use explanations as feedback for correcting errors or biases in the original network with human intervention. This method necessitates humans to examine each image to identify potential issues and engage in dataset-specific model training, leading to high costs and low effectiveness. Furthermore, the approach lacks generalization to real-world datasets and does not consider spatial relationships. User interaction has also been introduced in image generation tasks. For instance, Interactive Image Generation [389] can repeatedly modify images based on modifications to the scene graph while keeping the contents generated over previous steps. To address these challenges, we introduce the Human-Neural Network Interface (HNI) for knowledge exchange, offering the following key contributions: (1) HNI employs high-level class-specific visual concepts and their relationships to construct a class-specific structural concept graph (c-SCG) for each class of interest in an image classification task (Figure 11.1). The c-SCG represents the key components (concepts; 202 graph nodes) of an object class and their spatial relationships (graph edges). This allows both humans and networks to understand each other using c-SCG as a common "language" for communication, interaction, and knowledge exchange. (2) Through the network-to-human path, the network can utilize c-SCG to present its reasoning logic in a manner that is easily comprehensible to humans. (3) Along the human-to-network path, humans can analyze the network’s reasoning logic (c-SCG) and can modify it with their prior knowledge. HNI then employs a Graph Reasoning Network and partial knowledge distillation to transfer knowledge from humans back to the network, enabling the network to acquire new knowledge from human input. (4) By creating new c-SCGs or modifying existing ones, humans can teach the network to recognize previously unseen objects, thereby establishing a novel pipeline for zero-shot learning. 11.2 Results c11-sec:2 To demonstrate the proposed approach, we focus on image classification. HNI can be used as a generic interface for knowledge exchange between human users and neural networks. We conduct experiments to demonstrate several applications of HNI: (1) network could show their reasoning logic (represented as class-specific structural concept graph (c-SCG)) to human (Section 11.2.1). (2) Human users can improve network’s performance by modifying the c-SCG (updating the logic of important concepts and relationships between them; Section 11.2.2). (3) Human users can guide the network in zero-shot learning to learn new objects (Section 11.2.3). The network is first trained in a conventional supervised manner using gradient descent [473]. Through this process, the network learns low-level visual features, intermediate-level representations, and high-level visual concepts that will support its classification decision on novel test images. 203 Human-Network Interface (a) Neural Network Network shows reasoning logic to human Human change Network’s reasoning logic (b) Visual Concept Extractor (d) Human Intelligence (e) Graph Neural Network Mimic human logic (c) Graph Reasoning Network Mimic original network logic (f) Partial knowledge distillation Human modify c-SCG Structural concept graph (c-SCG) Human analysis Figure 11.2: Pipeline for the proposed Human-Network Interface. Top arrow represents the network-tohuman path, which shows the reasoning logic of the original network (a) to a human, using structural concept graphs (SCG) as a language. It consists of a Visual Concept Extractor (b), which discovers the most important visual concepts for the network, and a Graph reasoning Network (GRN; c), which aggregates and summarizes concepts and their relationships from many training images into a single class-level structural concept graph (c-SCG) for each class. Bottom arrow represents the Human-to-network path, which changes the network’s decision making through human intervention, which is made easy and intuitive by allowing humans to interact with the c-SCG. This path consists of three steps: (1) humans (d) inspect and possibly change a given c-SCG, using their common sense, domain knowledge, and understanding of how spurious correlations may cause errors, to fix errors in the c-SCG. For example, at top-right, tree foliage was used by the original network to recognize school buses, but this is likely a spurious correlation in the training set (many school buses were shown in front of trees); conversely, a wheel was used by the original network, but is not ideal because it is not discriminative of other wheeled vehicles. Humans can choose to substitute these visual concepts with others from the pool extracted by the Visual Concept Extractor, initially ranked less important by the network. Humans can also modify the edges of the c-SCG, to add, remove, or correct relationships between visual concepts. (2) The framework then trains GRN with human logic (e), and (3) transfers human knowledge to network by partial knowledge distillation (f). The revised network has exactly the same structure as the original, but its weights have been modified following the human interaction. We show in our results that this pipeline is effective at rapidly (in terms of human effort) and interactively correcting network mistakes. chap-11-fig:2 11.2.1 Neural Network shows reasoning logic to human c11-sec:2.1 The network-to-human path explains the network’s reasoning logic for each decision (instance-level explanation), and, more importantly, the understanding of network for each class, represented as class-specific Structural Concept Graph (c-SCG). Figure 11.1 shows examples of c-SCGs, which capture the most discriminative visual concepts or parts that have been learned by the original neural network through gradient descent, as well as their relationships. Take ambulance as an example, the 4 most important visual concepts that 204 the original network discovered during training are the corner of front and side window (blue circle), front bumper and light (green circle), side logo (pink circle) and wheels (red circle). Besides visual concepts, their spatial relationships represent a specific structure among them, which the network has also learned during training. An arrow represents that the start node’s information is important to the end node; for instance, the relative positions between the first two concepts are important to each other (knowing one could help find the other one). Here, we extract the learned visual concepts and relationships from the original network by distilling it into a graph neural network [171]. Figure 11.2 shows the operation of our pipeline, using two passes: First, network-to-human (Figure 11.2, top) explains the reasoning logic of network to human users, in the following steps (Section 11.4.1): (i) A Visual Concept Extractor (VCE) reveals the representative visual concepts for each class of interest, as learned by the network. (ii) We then train a Graph Reasoning Network (GRN) to mimic the decision-making process of the original network, using knowledge distillation. The nodes represent the visual concepts that the original network considered most important in identifying the class of interest. The edges (structural relationships and dependencies) of c-SCG between nodes (visual concepts) are fully connected at the beginning and then, during distillation, we select and only keep the important edges that best capture the learned behavior of the original network. After distillation, we obtain the c-SCG as global (class-wise) reasoning logic. Each c-SCG is bound to one class; for example, Figure 11.1 shows the c-SCG of the school bus, ambulance, zebra, and tiger. 11.2.2 Human improves network’s performance with HNI c11-sec:2.2 Through the human-to-network path (Figure 11.2, bottom), humans can modify c-SCGs with their knowledge and then transfer the knowledge back to the original network (Section 11.4.2). This process can improve network’s performance when the network was originally not able to learn generalizable and robust logic during training. This is quite common, especially when training data is scarce or biased. Scarce data can cause a distribution mismatch between the training and test sets, preventing the network from really ’understanding’ 205 the classes of interest. Bias in training data can mislead the network to focus on spurious patterns irrelevant to task objectives. We show below how human users can use the human-to-network path to improve the original network’s performance (Figure 11.2, bottom): (1) Human user modifies c-SCG: After understanding the neural network’s reasoning logic through the network-to-human path, users can verify if the decision logic aligns with their understanding. If not, they can actively correct the decision logic by updating the c-SCG (e.g., deleting a visual concept or modifying the structural relationship between concepts) efficiently. (2) Representing human-modified logic: We use the modified c-SCG as a template to automatically rebuild instance-level SCGs (I-SCGs) for images and train a new Graph Reasoning Network (GRN) with ground truth image labels. (3) Enabling the original network to learn human-modified logic: We propose partial knowledge distillation to transfer the logic of the GRN, which incorporates human knowledge and priors, back to the original neural network. We describe these three steps in detail in Section 11.4.2 Table 11.1: Human improves a network’s performance with HNI: experiments on six different image classification tasks (performance is tabulated as percent correct classification). Datasets # images Classes # classes original performance performance with HNI modified classes 2 88.33 91.64 (+3.31) all classes 12 93.06 93.61 (+0.55) modified classes 2 83.33 91.67 (+8.34) all classes 10 86.33 88.33 (+2.00) modified classes 3 78.75 85.00 (+6.25) all classes 10 90.00 93.61 (+3.61) modified classes 3 78.94 85.76 (+6.82) all classes 10 82.37 85.57 (+3.2) modified classes 3 61.71 67.14 (+5.43) all classes 6 73.79 74.37 (+0.58) modified classes 3 47.78 57.78 (+10.00) all classes 17 53.73 54.51 (+0.78) Cats 2382 Cars 8144 Monkeys 1642 Flowers 22267 Fashion 16186 Buildings 5063 tab:6exp 206 11.2.2.1 Experimental results on six image classification tasks We first evaluate the performance of HNI on six different tasks: Cats classification, Cars classification, Monkeys classification, Flowers classification, Fashion products classification and Buildings classification. The model was first trained on the original dataset for each task. We could then inspect the confusion matrix and find the most confused classes as the interest classes for which we could modify the logic. The network-to-human path shows the reasoning logic for those interest classes with structural concept graphs. Then, humans conducted modification on interest classes and the human-to-network path transferred the human reasoning logic back to the original model. Specifically, in each dataset, an inspection of the confusion matrix after the initial network training revealed the most confused classes (e.g., for Cars classification, it was Convertible vs. coupe, for building, it was Cornmarket, Hertford, and Oxford, for Flowers classification, it was Orchid, Lily, and Iris). We then asked human to edit the corresponding c-SCGs: substitute a visual concept in graph with another promising concept from the concept pool, or modify the structural relationship between concepts) efficiently. It is important to note that humans only need to edit one graph per class, and the modifications will automatically propagate to every image when transferring the modified logic back to the original network via the human-to-network path. Humans do not need to modify each image in the dataset individually. Table 11.1 shows the performance before and after humans modified the reasoning logic through HNI to improve the original network’s performance. Human modification improved classification accuracy for all modified classes (+3.31% to +10%). In addition, this process did not decrease accuracy on the unmodified classes (in fact, those also improved slightly, likely because they might also have sometimes been confused with the classes of interest). Overall, with typically ∼ 1 minute of human work per class to inspect and possibly correct one c-SCG per class, a significantly improved network was created that is a direct drop-in replacement for the original network (exactly identical structure as the original). 207 (b1) Confusion matrix of vehicles classes 1 1000 1 1000 250 500 750 250 500 750 id id Classes Accuracy vehicles 68.78 non-vehicles 69.79 all 69.77 Classes Accuracy vehicles 72.78 (+4.00) non-vehicles 69.95 (+0.16) all 69.93 (+0.16) Classes Accuracy mammals 76.31 Non-mammals 69.67 all 69.77 Classes Accuracy mammals 79.39 (+3.08) Non-mammals 70.20 (+0.53) all 70.26 (+0.49) … (a) ImageNet confusion matrix (b2) Confusion matrix of mammals classes (c1) Accuracy before human modification (vehicles) (c2) Accuracy before human modification (mammals) (e1) Accuracy after human modification (vehicles) (e2) Accuracy after human modification (mammals) (d1) Human modify reasoning logic (d2) Human modify reasoning logic -1.0 -0.8 -0.6 -0.4 -0.2 -0 … … … Figure 11.3: Humans can improve a network’s performance with HNI. We conduct large-scale experiments on the ImageNet dataset, which contains 1,000 real-world classes. (a) Confusion matrix of a 1,000-class original GoogleNet image classification network trained on ImageNet. Most of the errors are within each of 12 super-classes that correspond to groups of related classes (e.g., mammals, vehicles, birds, etc). There are two main challenges: (1) How to correct the errors and improve accuracy within a super-class (local logic) with the help of human involvement? (2) How to maintain the performance of all other classes in the 1,000 classes? We show results of two large-scale experiments, for the superclasses of vehicles (23 classes, total 23,000 training images) and mammals (13 classes, 13,000 images), to show how one can use HNI to improve performance within a superclass without degrading performance of other classes. We first consider the super-class of vehicles (b1), with original accuracy over these 68.78% (c1). For each class, the network-to-human pass was used to show the reasoning logic of the original network as a c-SCG to a human operator (d1). The operator spotted and corrected any reasoning errors of the network. The human-to-network pass then distilled the human-modified logic back to the original network with the help of Graph neural network and partial knowledge distillation. Performance was improved on the vehicle classes, without degradation of non-vehicle classes (e1), demonstrating how humans could use their own knowledge to correct reasoning errors of the network and improve network accuracy. The same process is also shown for the superclass of mammals (b2, c2, d2, e2). chap-11-fig:3 11.2.2.2 Experimental results on ImageNet classification tasks We then evaluate the performance of an ImageNet pretrained GoogleNet classifier [541] on the validation dataset of ILSVRC2012 (ImageNet) [112]. Figure 11.3(a) shows the confusion matrix (the misclassified 208 samples). We re-order the class sequence based on the class hierarchy proposed in [43]. Specifically, the classes in the same red squares belong to the same super-class; for instance, vehicles, mammals, birds, etc. We use two experiments (the local logic of vehicles and mammals) to show how humans can improve the performance of network by modifying the reasoning logic. Most of the time, we do not need to modify the reasoning logic of all classes, but only of classes of interest, which form a local logic. For instance, as different classes of vehicles are easily confused, we asked human users to help improve the vehicle classification performance (local logic) with our framework (Figure 11.3(b1) shows the confusion matrix of 23 vehicle classes). The goal is to improve the performance of classes of interest (23 classes of vehicles), while avoiding performance degradation of other classes (Figure 11.3(c1)). We first use the network-to-Human path of HNI to visualize the reasoning logic of the original network. A human user can modify the concepts in question. Figure 11.3(d1) shows the c-SCG comparison before and after modification. We then train GRN with the updated c-SCG (details in method section). To transfer human logic back to the original network, we use partial knowledge distillation (details in method section). The test results of the modified network are shown in Figure 11.3(e1). We can see that the accuracy of vehicles improved +4% and the non-vehicle classes improved a little (we do not explicitly change their logic) and the overall performance also improved. The experimental results shows that humans can accurately modify the reasoning logic of the interest class to improve the performance of the original network. We conduct a similar experiment on the local logic of mammals. We could find similar improved results on both classes of interest (mammals) and other classes (non-mammals). 11.2.3 Zero-shot learning: Human teach network to learn new object through HNI c11-sec:2.3 Zero-shot learning [609] is a popular and challenging task, where, at test time, a learner needs to classify samples from classes not seen during training. We introduce a novel zero-shot learning pipeline with the proposed Human-Network-Interface (HNI). The high-level idea is that understanding a new object can be 209 A B C D E F G H Graph Reasoning Network Learned objects Build custom SCG for I, J, K … New objects Visual Concept Extractor Concept match and Build SCG Concept 3 Concept 2 Concept 1 Shared visual concepts Predict: I, J, K Use only ABCDEFGH images Graph Reasoning Network Partial Knowledge Distillation (Step1) (Step2) (Step3) (Step4) Train Inference time I, J, K images Obtain classify I, J, K ability Humans analyze the relationship between I, J, K, shared concepts, structure, and learned ABCDEFGH Concept 4 Object A Object B Object C Object D Object E Object F Object G Object H … Object I Object K Object J A B C D E F G H I J K A B C D E F G H I J K A B C D E F G H I J K (Result) Figure 11.4: Zero-shot learning: Human users teach network to learn to encode new objects with HNI. Section 11.2.3 provide explanation for each step. Bottom results shows the performance of Zero-shot learning with HNI. The original ResNet-18 network (pretrained on ImageNet) trained with images of objects A-H cannot identify new objects I, J, K in the test set. Humans can teach the ResNet-18 to encode and recognize new objects I, J, K wih HNI. chap-11-fig:4 represented as class-specific c-SCG, consisting of visual concepts (nodes) and concept relationships (edges). Our HNI allows humans to design new c-SCGs for new object categories, using existing primitive visual concepts (nodes) or relationships (edges) discovered from other classes. The new c-SCGs can then be distilled back to the original network, thereby guiding the original network to encode new object categories (i.e., zero-shot learning). While learning to recognize new classes, the original network will not "forget" the old classes. In our experiment, each learned 8 objects A, B, C, D, E, F, G and H have 300 training images and 216 test images, while the new objects I, J, K have only 216 test images each. Please refer to Figure 11.4 for the detailed workflow: Step 1: network-to-Human: We train a classifier for 8 objects A to H (Figure 11.4 Step 1). We use VCE to discover the visual concepts and find the shared concepts. We match the shared visual concepts from training images and form image-level SCGs (I-SCG) for all training images of objects A to H. 210 Step 2: Human-to-network: Building new c-SCGs for new objects I, J, K and training new GRN: (Figure 11.4 Step 2) It is straightforward for humans to learn about a new class as they can relate the patterns and components on the new class to those they have seen in the past. We try to implement a similar mechanism here in describing the new class with SCG to GRN. We construct novel I-SCG training instances with visual concepts and relationship from learned classes, in an automated fashion. For instance, to form a I-SCG for new object I, we know some of its components are overlapping with objects A to H. Hence we randomly sample one I-SCG from object A and use its concept 1 as the node of concept 1 in I’s I-SCG. Similarly, we obtain I’s nodes of concept 2 and 3 by randomly sampling I-SCGs of objects E and G. To construct I-SCG edges for object I, we sample I-SCGs of D and form the edges of I based on their similar structures. Building I-SCG for new objects J and K are similar. We form a new I-SCG training set by adding the novel I-SCGs of I,J,K into the original training set and then train a GRN that can classify objects A to H and I, J, K. Step 3: Transfer knowledge from GRN back to original network: (Figure 11.4 Step 3) we use knowledge distillation to transfer the knowledge about new object I,J,K from GRN back to the original network. In this process, we only use the images of A to H as the training set, and only use the soft label form GRN without any hard label to avoid bias toward old classes. Step 4: Network learns the knowledge to encode new objects without forgetting the knowledge about old classes (Figure 11.4 Step 2). Figure 11.4 bottom result shows the performance of zero-shot learning with our HNI. We will make the OBJECT dataset which we created and used here public. 11.3 Discussion chap-11-sec:3 We showed that HNI addresses three key challenges in human and Neural Network (network) knowledge exchange. The first is "what is the language to communicate". The interface needs a "language" to represent the reasoning logic and knowledge in a manner comprehensible to both humans and networks. We propose using Structural Concept Graphs (SCGs) to represent the network’s reasoning logic in a format that is 211 understandable by humans. We demonstrate that the network-to-Human path employs SCGs to offer local and global explanations of the network’s reasoning logic, making it accessible and clear to human users: (1) Class-specific Structural Concept Graphs (c-SCGs) elucidate the network’s understanding of each class by highlighting the critical visual concepts employed during decision-making and the relationships between them (global explanation). The local explanation (image-wise) addresses "Why" and "Why not" questions, revealing the network’s reasoning logic behind its decisions. Furthermore, the Network-to-Human path can impart new knowledge to humans by demonstrating the network’s reasoning process. The second challenge is "how can humans share their knowledge with the network". Unlike traditional interaction methods where humans participate in the network’s training process by gathering new data and retraining the network, our Human-to-Network path enables humans to directly modify the c-SCG with their knowledge, thereby altering the network’s reasoning logic. We demonstrate that using Graph Reasoning Network and Partial knowledge distillation, humans could improve the performance of the original network for a local logic of interest, without degrading performance on the unmodified classes. Notably, human modification of c-SCG is highly efficient, because one only needs to inspect one graph with a few nodes and edges for each class. This efficiency significantly reduces time and monetary costs compared to traditional methods, and, importantly, it also offers the intuitive interpretability advantage of our HNI. The Third challenge is "how humans can teach networks new knowledge". For instance, humans may want to teach networks about novel classes. Compared with mainstream zero-shot learning settings ([181, 154, 609]), which provide attribute descriptions for images (e.g., stripes, horse-like shape, big four-legged animal), our method considers a more general setting and we make no assumption on the availability of attribute labels. Instead, we rely on the unsupervised mining of primitive concepts from the training dataset, without requiring attribute or concept labels. By combining different subsets of these learned primitive concepts and varying the structural relationships between them, we can use GCN to represent novel classes and eventually guide the network to learn to encode them by HNI. While our method has limitations, particularly when a 212 new class cannot be easily represented by the learned primitive visual concepts, it offers flexibility in defining new structural or spatial relationships based on learned relationships. This strength makes our method more extensible and capable of encoding novel objects with fewer assumptions, not being confined to a given list of attributes for describing relationships or structures of components and parts of novel objects. 11.4 Methods: Human-network Interface c11-sec:4 Our proposed Human-network Interface (HNI) to bridge the interaction between human and neural networks is visually summarized in Figure 11.2. There are two main paths. 11.4.1 Network-to-Human c11-sec:3-1 The network-to-human path explains the network’s reasoning logic, the understanding of the network for each class, represented as class-specific Structural Concept Graph (c-SCG). Each c-SCG is bound to one class (Figure 11.2 shows the c-SCG of school bus), where the nodes represent the important visual concepts that the original network considered most important in identifying the class of interest, and edges represent the pairwise structural relationships (dependencies) between concepts. As shown in Figure 11.2 (top), given a trained network, there are two main steps to explain the reasoning logic of the network to human users: (1) Using Visual Concept Extractor (VCE) to discover representative visual concepts for each class of interest. The detailed procedure follows [171]: To discover concepts for each class, we collect 50 to 100 images of the class. We first use top-down gradient attention (Grad-Cam [495]) to constrain the relevant regions for concept proposals to the foreground segments, thereby ruling out irrelevant background patterns for this class. Then we follow the same workflow as the ACE paper [187]: multi-resolution segmentation, feature extraction, clustering patches in latent space to obtain the concept candidates, and sorting these concept candidates based on an importance score, similar to [287]. After that, we obtain the concept pool (each concept is represented by one mean feature vector, sorted by importance score) for each class, which 213 will serve as a source of concept candidates when human users modify concepts (nodes) for c-SCG. To build a c-SCG that reveals the reasoning logic for each class of interest of the original network, we select the top k (k=4 in our experiments) important concepts and their mean feature vectors as nodes, as well as edges between them. Figure 11.1 shows the c-SCG which represents the understanding of the original network to each class. Each directed edge in an SCG edgeji = (vj , vi) has two attributes: 1) representation of the spatial structural relationship between nodes; 2) dependency eji (a trainable scalar) between concepts vi , vj . The edge features can reveal the importance of interactions between visual concepts crucial for the final decision. c-SCG extracts the relationship between visual concepts during training (next step) and incorporates them as edge features in c-SCG. By showing c-SCG for each class of interest, the network-to-human path provides easy-to-understand insights for human users on the reasoning process of the network, which is also a foundation for the Human-to-network path. The edges (structural relationship and dependency) are fully connected at the beginning, and then, using the learning in the following step 2, we select and only keep the important edges. (2) Using Graph Reasoning Network (GRN) to mimic the decision-making process of the original network with knowledge distillation. Specifically, we distill the reasoning logic of the original network into a graph-based network, GRN, which uses more interpretable visual concepts. As a Graph Neural Network (GNN) based network, GRN takes a graph as input but has a similar decision as the original network (CNN). During knowledge distillation, To train GRN based on the built c-SCGs (one for each of n classes of interest), we need to establish connections between training images and the c-SCGs. The whole pipeline is similar to Figure 11.5, while the c-SCG is not human-modified. Also, the ground truth is the original network prediction (knowledge distillation). To this end, we create, for each training image I, a set of up to n image-level structural concept graphs (I-SCGs). Each I-SCG is computed from both the training image I and one of the n c-SCGs: Given the input image I, we use multi-resolution segmentation SLIC [5] to break the image into patches, which become the concept candidates (similar to Figure 11.5 step 1). In the concept matching step 214 (similar to Figure 11.5 step 2), for each class of interest c, we match features of the segmented patches to the stored anchor representation (i.e., mean feature vectors) of top k concepts deemed important for c using a similarity metric (e.g. Euclidean distance). When at least one of the top k concepts of class c is detected in image I, an I-SCG for class c will be constructed based on the template from the c-SCG of class c. Here I-SCG uses patch features instead of the concept anchor features as node features and we calculate the edge features based on the spatial relationship between detected concepts in image I. This way, we can generate up to n I-SCGs for the input image I considering all n classes of interest. GRN takes the I-SCGs as input, and we use knowledge distillation to transfer the decision-making logic of the original network to GRN. Using SCG to effectively reveal the original network’s reasoning logic effectively has been validated with extensive experiments in [171], in which the authors evaluated the logical consistency and faithfulness between SCG explanation and the original network. 11.4.2 Human-to-network c11-sec:3-2 Human-to-network path transfers human’s knowledge to network, in order to improve the original network’s performance and generalizability. There are three main steps: (1) User modifies c-SCG (Section 11.4.2.1): after understanding the network’s reasoning logic with network-to-human path, users can verify whether the decision logic is reasonable or consistent with their understandings. If not, human users are able to actively correct the decision logic by updating the c-SCG (e.g., deleting a visual concept and changing the structural relationship between concepts) efficiently. (2) To represent human-modified logic, we use the modified c-SCG as a template to automatically rebuild I-SCGs for images and train a new Graph Reasoning Network (GRN)(Section 11.4.2.2), with ground truth image labels. (3) To let the original network learn human-modified logic, we propose partial knowledge distillation (Section 11.4.2.3) to transfer the logic of GRN, which has incorporated the knowledge and prior from human users, back to the original network. We describe the three steps in detail in the following subsections. 215 Concept match based on human modified c-SCG … … ejifire engine ejiambulance ejischool bus Concatena te MLP … Multi-resolution segmentation fire engine? ambulance? school bus? Groun d Truth Cross Entropy Loss Apply important edges modified by human to build I-SCG … Human Modified c-SCG Graph Neural Network Figure 11.5: Pipeline of training Graph Reasoning Network with Human modified c-SCG. Given input I, we conduct multi-resolution segmentation and concept match based on human-modified c-SCG. In the concept matching step, we attempt to match the c-SCG of each class of interest to the concepts extracted from the current input image. Color circles represent the matched concepts for each class of interest. Black dummy nodes denote undetected concepts. For example, for the input image shown, all concepts for the Fire Engine class were matched, but only 2 concepts of Ambulance could be found in the image, and only 1 concept of School Bus. Subsequently, the GRN aggregates all matched concepts and uses those to support its predictions. (Section 11.4.2.2) chap-11-fig:5 11.4.2.1 Human modifies c-SCG c11-sec:4.2.1 The Network’s understanding of any specific class can be shown as a single c-SCG: the nodes (visual concepts) represent the crucial visual evidence or clue for network to identify this class; the edges encode the structure relationships and dependency between concepts. After understanding the meaning of c-SCG, human users can then intuitively make modifications to c-SCGs (e.g., removing the incorrect nodes/edges) based on their knowledge or other priors, in order to improve the network’s performance. There are two main types of modification: nodes and edges, corresponding to changing the visual concepts and the relationships between concepts respectively. Figure 11.2 bottom path and Figure 11.3 (d1, d2) show examples of human users modifying c-SCG. Node (concept) modification: human users can easily identify non-casual concepts in c-SCG. Figure 11.3(d1) shows an example of node modification. In some cases, nodes may be irrelevant to the class-of-interest (e.g., 216 a background object always appears together with the object of interest; e.g., see branch and fire engine), or not representative/unique (e.g., an object part that is common among many classes; see the wheel in different vehicle classes). To substitute these two concepts with more representative and discriminative ones, human users can go back to the concept pool extracted by the VCE in the network-to-human path and select better visual concepts to improve the c-SCG (Figure 11.3(d1, d2)). Edge (concept relationship) modification: Edges shown by c-SCG are the important dependencies selected based on the values of eji. Humans can modify them to remove non-stable or independent relationships between concepts. This may happen when substantial biases exist in training, when network may discover stable and dependent relationships between concepts which in fact do not always hold in real-world scenarios (e.g., the relative position of a cheetah’s body and tail in Figure 11.3(d2)). Human users can remove this edge on the c-SCG to correct the bias. In practice, modifications of nodes and edges can happen simultaneously to handle more complex situations. Note that to modify the decision reasoning logic for one class, human users only need to modify the corresponding c-SCG once: e.g., they can substitute one concept with another by changing the mean vector, or add/delete edges. They do not need to modify image-level I-SCG for every training image. After updating the c-SCG, our framework automatically applies this modification to all image-level I-SCGs. Human modification of c-SCGs is the first step in the human-to-network path, where our framework provides an intuitive way for human users to convert their knowledge, common sense, and priors into a description in the same language that our framework uses. Next, we will show how the knowledge of human users can be transferred back to the network in the following subsections. 11.4.2.2 Training Graph Reasoning Network (GRN) with Human’s Logic c11-sec:4.2.2 Typically, the set SI of classes that require human intervention is a subset of the set of all S classes (SI ⊂ S). This setting is flexible and efficient: no matter how many classes the original network can predict (e.g., 1,000 classes in ImageNet), users may only want to modify the logic of a small subset of classes in question (e.g., 217 some vehicles are easily confused with each other). In this case, we build a GRN that targets the logics of these classes only, which is more efficient to users. For each class of interest c ∈ SI , the network-to-human path reveals its reasoning logic c-SCGc network. After human’s analysis and modification, some classes may have updated c-SCGs, c-SCGc H after incorporating human user’s knowledge. c-SCG as a class representation cannot produce final decisions by itself; hence we reuse the GRN from network-to-Human path to infer a prediction. Figure 11.5 shows the pipeline of training GRN with the c-SCG updated by human users. Given an input image I, the first two steps of the processing (Multi-resolution segmentation and Concept match to build I-SCG) are the same as the GRN training in network-to-human path, while the objective is different. Here our goal is to obtain better performance by incorporating human user knowledge. The matched I-SCGs go through the graph convolution backbone and MLP in GRN and finally predict the image label with cross-entropy loss as the objective function. The trained GRN can then produce decisions based on human-corrected reasoning logic because the input I-SCGs are derived from c-SCGc H. 11.4.2.3 Transfer Reasoning Logic to Network with Partial Knowledge Distillation c11-sec:4.2.3 The last step to is to transfer the user-updated reasoning logic and knowledge in GRN back to the original network. To avoid catastrophic forgetting and negative impact on the classification performance of other classes, we developed partial knowledge distillation as a new knowledge transfer method. Figure 11.6 illustrates the process of partial knowledge distillation to transfer human user’s knowledge from GRN back to the original network. As described in Section 11.4.2.2, modified classes SI are a subset of all classes S. For the set of unmodified classes SU = S \ SI , we want to maintain their reasoning logic while we update those of SI in the original network. Hence two teacher models provide soft labels together. GRN NetT1 provides the probabilities of modified classes; the original network NetT2 with fixed parameters provides the probabilities of unmodified classes. The student model NetS shares the same architecture as the 218 Hard Labels Teacher 1: Graph Reasoning Network Soft labels for unmodified classes Soft labels for modified classes Concatenate Soft labels ������� ������ Teacher 2: ResNet18 Predictions Student: ResNet18 Soft Loss ⊗ ×�� Hard Loss � = �� � = ��� � = ��� ������������ ��������� ��� … … Figure 11.6: The pipeline of Partial Knowledge Distillation. Different from traditional knowledge distillation, partial knowledge distillation adopts two teachers with different expertise: GRN (teacher 1) focuses on classes of interest (6 classes in this example), and the fixed original network (teacher 2) focuses on the remaining classes (the classes we don’t want to change, 14 classes in this example). After distillation with different temperatures and concatenation, we can use both soft labels and hard labels to train the student model. chap-11-fig:6 original network and is initialized with the weights of the original network. Formally, the overall loss during partial knowledge distillation is as follows: L = αLsof t + βLhard (11.1) {chap-11-Eq.1} chap-11-Eq.1 where α and β are the weightings of the two terms during distillation. For the soft label term: 219 Lsof t = − X N c=1 pˆ TT c log(q TS c ); q TS c = exp(zc/TS) PN k=1 exp(zk/TS) (11.2) {chap-11-Eq.2} chap-11-Eq.2 where pˆ TT c denotes the probability value of class c in the combined soft label with temperature TT from two teacher models. q TS c denotes the probability value of class c in the student prediction vector with temperature TS. N = |S| denotes the number of classes in the original network. For q TS c , zc denotes the logits of NetS, which is the un-normalised predictions. The combined soft label pˆ TT is the combination of two soft labels p TT 1 and p TT 2 from two teacher models NetT1 and NetT2. p TT 2 is a vector with length N, p TT 2 ∈ R N , while p TT 1 is a vector with length n, p TT 1 ∈ R n , where n = |SI | denotes the number of classes in the GRN, which is also the number of modified classes, v 1 c and v 2 c denote the logits of the teacher models NetT1 and NetT2 respectively: pˆ TT c = p TT 1 c P r, {c ∈ SI} p TT 2 c , {c ∈ S \ SI} s.t. pTT 1 c = exp(v 1 c /TT1) Pn k=1 exp(v 1 k /TT1) p TT 2 c = exp(v 2 c /TT2) PN k=1 exp(v 2 k /TT2) (11.3) in which P r ∈ (0, 1]. To obtain the combined soft label pˆc TT , we first compute the sum of the probability of all classes of interest in p TT 2 . P r = Pp TT 2 c for all {c ∈ SI}, which represents the probability proportion of the n modified classes w.r.t. all classes N in the original network NetT2. We then replace the value of each class of interest in p TT 2 with the scaled value in p TT 1 to form the combined soft label. The prediction of teachers may be erroneous, and we use ground-truth labels as hard labels to provide stronger constraints to NetS, correcting these errors from teacher models. Lhard = − X N c=1 gclog(q 1 c ) (11.4) {chap-11-Eq.4} chap-11-Eq.4 220 where gc denotes the ground truth label for class c, q 1 c is the probability value of class c in the student prediction vector under temperature 1. Equation 11.1 transfers knowledge from GRN back to the original network. To summarize, we use Graph Reasoning Network (GRN) in both the network-to-human path and human-to-network path respectively (Figure 11.2) with the same structure. In the network-to-human path, GRN simulates the reasoning logic of the original network with knowledge distillation. In the human-tonetwork path, after human users modify the c-SCG for classes of interest, GRN uses the modified c-SCG to automatically derive I-SCGs for each image which will be used to train the GRN using ground truth labels (Figure 11.5). Once the training of GRN is done, we transfer the knowledge of GRN back to the original network, for which we proposed partial knowledge distillation, where the newly-trained GRN becomes the teacher model to train the original network using GRN’s outputs as soft labels (Figure 11.6). 221 Chapter 12 Contributions of Shape, Texture, and Color in Visual Recognition chapter-12 We investigate the contributions of three important features of the human visual system (HVS) — shape, texture, and color — to object classification. We build a humanoid vision engine (HVE) that explicitly and separately computes shape, texture, and color features from images. The resulting feature vectors are then concatenated to support the final classification. We show that HVE can summarize and rank-order the contributions of the three features to object recognition. We use human experiments to confirm that both HVE and humans predominantly use some specific features to support the classification of specific classes (e.g., texture is the dominant feature to distinguish a zebra from other quadrupeds, both for humans and HVE). With the help of HVE, given any environment (dataset), we can summarize the most important features for the whole task (task-specific; e.g., color is the most important feature overall for classification with the CUB dataset), and for each class (class-specific; e.g., shape is the most important feature to recognize boats in the iLab-20M dataset). To demonstrate more usefulness of HVE, we use it to simulate the open-world zero-shot learning ability of humans with no attribute labeling. Finally, we show that HVE can also simulate human imagination ability with the combination of different features. 222 Humanoid Vision Engine CUB iLab-20M ilab-20M Shape:35% texture:32% color:33% Q: to distinguish horse and zebra? A: Texture Q: to distinguish zebra and zebra car? A: Shape Shape, Texture, or Color, which one contribute most CUB Shape:21% texture:21% (a) (b) color:58% Contribution Attribution Figure 12.1: (a): Contributions of Shape, Texture, and Color may be different among different scenarios/tasks. Here, texture is most important to distinguish zebra from horse, but shape is most important for zebra vs. zebra car. (b): Humanoid Vision Engine takes dataset as input and summarizes how shape, texture, and color contribute to the given recognition task in a pure learning manner (E.g., In ImageNet classification, shape is the most discriminative feature and contributes most to visual recognition). fig:motivation_figure 12.1 Introduction The human vision system (HVS) is the gold standard for many current computer vision algorithms, on various challenging tasks: zero/few-shot learning [412, 310, 539, 436, 520], meta-learning [19, 285], continual learning [489, 548, 598], novel view imagination [673, 165], etc. Understanding the mechanism, function, and decision pipeline of HVS becomes more and more important. The vision systems of humans and other primates are highly differentiated. Although HVS provides us a unified image of the world around us, this picture has multiple facets or features, like shape, depth, motion, color, texture, etc. [163, 203]. To understand the contributions of the most important three features — shape, texture, and color — in visual recognition, some research compares the HVS with an artificial convolutional Neural Network (CNN). A widely accepted intuition about the success of CNNs on perceptual tasks is that CNNs are the most predictive models for the human ventral stream object recognition [63, 621]. To understand which feature is more important for CNN-based recognition, recent paper shows promising results: ImageNet-trained CNNs are biased towards texture while increasing shape bias improves accuracy and robustness [336]. Due to the superb success of HVS on various complex tasks [412, 19, 489, 673, 171], human bias may also represent the most efficient way to solve vision tasks. And it is likely task-dependent (Figure 12.1). Here, inspired by HVS, we wish to find a general way to understand how shape, texture, and color contribute to a 223 recognition task by pure data-driven learning. The summarized feature contribution is important both for the deep learning community (guide the design of accuracy-driven models [336, 180, 162, 54]) and for the neuroscience community (understanding the contributions or biases in human visual recognition) [407, 576]. It has been shown by neuroscientists that there are separate neural pathways to process these different visual features in primates [16, 118]. Among the many kinds of features crucial to visual recognition in humans, the shape property is the one that we primarily rely on in static object recognition [163]. Meanwhile, some previous studies show that surface-based cues also play a key role in our vision system. For example, [178] shows that scene recognition is faster for color images compared with grayscale ones and [428, 421] found a special region in our brain to analyze textures. In summary, [67, 66] propose that shape, color and texture are three separate components to identify an object. To better understand the task-dependent contributions of these features, we build a Humanoid Vision Engine (HVE) to simulate HVS by explicitly and separately computing shape, texture, and color features to support image classification in an objective learning pipeline. HVE has the following key contributions: (1) Inspired by the specialist separation of the human brain on different features [16, 118], for each feature among shape, texture, and color, we design a specific feature extraction pipeline and representation learning model. (2) To summarize the contribution of features by end-to-end learning, we design an interpretable humanoid Neural Network (HNN) that aggregates the learned representation of three features and achieves object recognition, while also showing the contribution of each feature during decision. (3) We use HVE to analyze the contribution of shape, texture, and color on three different tasks subsampled from ImageNet. We conduct human experiments on the same tasks and show that both HVE and humans predominantly use some specific features to support object recognition of specific classes. (4) We use HVE to explore the contribution, relationship, and interaction of shape, texture, and color in visual recognition. Given any environment (dataset), HVE can summarize the most important features (among shape, texture, and color) for the whole task (task-specific) and for each class (class-specific). To the best of our knowledge, we provide the 224 first fully objective, data-driven, and indeed first-order, quantitative measure of the respective contributions. (5) HVE can help guide accuracy-driven model design and performs as an evaluation metric for model bias. For more applications, we use HVE to simulate the open-world zero-shot learning ability of humans which needs no attribute labels. HVE can also simulate human imagination ability across features. 12.2 Related Works In recent years, more and more researchers focus on the interpretability and generalization of computer vision models like CNN [514, 220] and vision transformer [126]. For CNN, many researchers try to explore what kind of information is most important for models to recognize objects. Some paper show that CNNs trained on the ImageNet are more sensitive to texture information [180, 162, 54]. But these works fail to quantitatively explain the contribution of shape, texture, color as different features, comprehensively in various datasets and situations. While most recent studies focus on the bias of Neural Networks, exploring the bias of humans or a humanoid learning manner is still under-explored and inspiring. Besides, many researchers contribute to the generalization of computer vision models and focus on zero/few-shot learning [412, 310, 568, 520, 170, 88], novel view imagination [673, 165, 172], open-world recognition [36, 271, 263], etc. Some of them tackled these problems by feature learning — representing an object by different features, and made significant progress in this area [552, 427, 673]. But, there still lacks a clear definition of what these properties look like or a uniform design of a system that can do humanoid tasks like generalized recognition and imagination. 12.3 Humanoid Vision Engine The goal of the humanoid vision engine (HVE) is to summarize the contribution of shape, texture, and color in a given task (dataset) by separately computing the three features to support image classification, similar to 225 Preprocess Tiger Color? Texture? Shape? (a) (b) Entity Segmentation Saliency Map Shape Encoder Texture Encoder Color Encoder Phase Scrambling Depth Estimation Interpretable Decision Module Color Texture Shape Tiger Contribution attribution Figure 12.2: Pipeline for humanoid vision engine (HVE). (a) shows how will humans’ vision system deal with an image. After humans’ eyes perceive the object, the different parts of the brain will be activated. The human brain will organize and summarize that information to get a conclusion. (b) shows how we design HVE to correspond to each part of the human’s vision system. c12-fig:pipeline humans’ recognizing objects. During the pipeline and model design, we borrow the findings of neuroscience on the structure, mechanism and function of HVS [16, 118, 163, 178, 428, 421]. We use end-to-end learning with backpropagation to simulate the learning process of humans and to summarize the contribution of shape, texture, and color. The advantage of end-to-end training is that we can avoid human bias, which may influence the objective of contribution attribution (e.g., we avoid handcrafted elementary shapes as done in Recognition by Components [42]). We only use data-driven learning, a straightforward way to understand the contribution of each feature from an effectiveness perspective, and we can easily generalize HVE to different tasks (datasets). As shown in Figure 12.2, HVE consists of (1) a humanoid image preprocessing pipeline, (2) feature representation for shape, texture, and color, and (3) a humanoid neural network that aggregates the representation of each feature and achieves interpretable object recognition. 12.3.1 Humanoid Image Preprocessing and Feature Extraction As shown in Fig.12.2 (a), humans (or primates) can localize an object intuitively in a complex scene before we recognize what it is [272]. Also, there are different types of cells or receptors in our primary visual cortex 226 extracting specific information (like color, shape, texture, shading, motion, etc) information from the image [163]. In our HVE (Figure 12.2 (b)), for an input raw image I ∈ R H×W×C, we first parse the object from the scene as preprocessing and then extract our defined shape, texture, and color features Is, It , Ic, for the following humanoid neural network. Image Parsing and Foreground Identification. As shown in the preprocessing part of Fig.12.2 (b), we use the entity segmentation method [429] to simulate the process of parsing objects from a scene in our brain. Entity segmentation is an open-world model and can segment the object from the image without labels. This method aligns with human behavior, which can (at least in some cases; e.g., autostereograms [272]) segment an object without deciding what it is. After we get the segmentation of the image, we use a pre-trained CNN and GradCam [495] to find the foreground object among all masks. We design three different feature extractors after identifying the foreground object segment: shape, texture, and color extractor, similar to the separate neural pathways in the human brain which focus on specific property [16, 118]. The three extractors focus only on the corresponding features, and the extracted features, shape Is, texture It , and color Ic, are disentangled from each other. Shape Feature Extractor For the shape extractor, we want to keep both 2D and 3D shape information while eliminating the information of texture and color. We first use a 3D depth prediction model [442, 441] to obtain the 3D depth information of the whole image. After element-wise multiplying the 3D depth estimation and 2D mask of the object, we obtain our shape feature Is. We can notice that this feature only contains 2D shape and 3D structural information (the 3D depth) and without color or texture information (Figure 12.2(b)). Texture Feature Extractor In texture extractor, we want to keep both local and global texture information while eliminating shape and color information. Figure 12.3 visualizes the extraction process. First, to remove the color information, we convert the RGB object segmentation to a grayscale image. Next, we cut this image into several square patches with an adaptive strategy (the patch size and location are adaptive with object sizes to cover more texture information). If the overlap ratio between the patch and the original 2D object 227 segment is larger than a threshold τ , we add that patch to a patch pool (we set τ to be 0.99 in our experiments, which means the over 99% of the area of the patch belongs to the object). Since we want to extract both local (one patch) and global (whole image) texture information, we randomly select 4 patches from the patch pool and concatenate them into a new texture image (It). Color Feature Extractor To represent the color feature for I. We use phase scrambling, which is popular in psychophysics and in signal processing [410, 546]. Phase scrambling transforms the image into the frequency domain using the fast Fourier transform (FFT). In the frequency domain, the phase of the signal is then randomly scrambled, which destroys shape information while preserving color statistics. Then we use IFFT to transfer back to image space and get Ic ∈ R H×W×C. Ic and I have the same distribution of pixel color values (Figure 12.2(b)). 12.3.2 Humanoid Neural Network section:HNN After preprocessing, we have three features, i.e. shape Is, texture It , color Ic of an input image I. To simulate the separate neural pathways in humans’ brains for different feature information [16, 118], we design three feature representation encoders for shape, texture, and color, respectively. Shape feature encoder Es takes a 3D shape feature Is as input and outputs the shape representation (Vs = Es(Is)). Similarly, texture encoder Et and color encoder Ec take the texture patch image It or color phase scrambled image Ic as input, after embedded by Et (or Ec), we get the texture feature Vt and color feature Vc. We use ResNet-18 [220] as the (a) (b) (c) Figure 12.3: Pipeline for extracting texture feature: (a) Crop images and compute the overlap ratio between 2D mask and patches. Patches with overlap > 0.99 are shown in a green shade. (b) add the valid patches to a patch pool. (c) randomly choose 4 patches from pool and concatenate them to obtain a texture image It . fig:texture 228 backbone for all feature encoders to project the three types of features to the corresponding well-separated embedding spaces. It is hard to define the ground-truth label of the distance between features. Given that the objects from the same class are relatively consistent in shape, texture, and color, the encoders can be trained in the classification problem independently instead, with the supervision of class labels. After training our encoders as classifiers, the feature map of the last convolutional layer will serve as the final feature representation. To aggregate separated feature representations and conduct object recognition, we freeze the three encoders and train a contribution interpretable aggregation module Aggrθ , which is composed of two fully-connected layers (Figure 12.2 (b) right). We concatenate Vs, Vt , Vc and send it to Aggrθ . The output is denoted as p ∈ R n , where n is the number of classes. So we have p = Aggrθ (concat(Vs, Vt , Vc)). We also propose a gradient-based contribution attribution method to interpret the contributions of shape, texture, and color to the classification decision, respectively. Take the shape feature as an example, given a prediction p and the probability of class k, namely p k , we compute the gradient of p k with respect to the shape feature V s . We define the gradient as shape importance weights α k s , i.e. α k s = ∂pk ∂Vs , αk t = ∂pk ∂Vt , αk c = ∂pk ∂Vc . Then we calculate element-wise product between Vs and α k s to get the final shape contribution S k s , i.e. S k s = ReLU Pα k sVs . In other words, S k s represents the “contribution” of shape feature to classifying this image as class k. We can do the same thing to get texture contribution S k t and color contribution S k c . After getting the feature contributions for each image, we can calculate the average value of all images in this class to assign feature contributions to each class (class-specific bias) and the average value of all classes to assign feature contributions to the whole dataset (task-specific bias). 12.4 Experiments In this section, we first show the effectiveness of feature encoders on representation learning (Section 12.4.1); then we show the contribution interpretation performance of Humanoid NN on different feature-biased datasets in ImageNet (Section 12.4.2); We use human experiments to confirm that both HVE and humans 22 (1) Shape encoder (2) Texture encoder (3) Color encoder (a) Shape biased dataset (b) Texture biased dataset (C) Color biased dataset Figure 12.4: T-SNE results of feature encoders on their corresponding biased datasets fig:dataset_tsne predominantly use some specific features to support the classification of specific classes (Section 12.4.3); Then we use HVE to summarize the contribution of shape, texture, and color on different datasets (CUB[575] and iLab-20M[48]) (Section 12.4.4). 12.4.1 Effectiveness of Feature Encoders c12-sec:4.1 To show that our three feature encoders focus on embedding their corresponding sensitive features, we handcrafted three subsets of ImageNet [302]: shape-biased dataset (Dshape), texture-biased dataset (Dtexture), and color-biased dataset (Dcolor). Shape-biased dataset containing 12 classes, where the classes were chosen which intuitively are strongly determined by shape (e.g., vehicles are defined by shape more than color). Texture-biased dataset uses 14 classes which we believed are more strongly determined by texture. Color-biased dataset includes 17 classes. The intuition of class selection of all three datasets will be verified by our results in Table 12.1 with further illustration in Section 12.4.2. All these datasets are randomly selected 230 Table 12.1: “Original” column means the accuracy of Resnet18 on the original images as our upper bound. Shape, texture and color columns represent the accuracy of feature nets. “all” means results of our HNN that combines the 3 feature nets. It approaches the upper bound, suggesting that the split into 3 feature nets preserved most information needed for image classification. accuracy original shape texture color all Shape biased dataset 97% 90% 84% 71% 95% Texture biased dataset 96% 64% 81% 65% 91% Color biased dataset 95% 70% 73% 82% 92% table:dataset-acc as around 800 training images and 200 testing images . The class details of biased datasets are shown in Figure 12.4. If our feature extractors actually learned their feature-constructive latent spaces, their T-SNE results will show clear clusters in the feature-biased datasets. “Bias" here means we can classify the objects based on the biased feature easily, but it is more difficult to make decisions based on the other two features. After pre-processing the original images and getting their feature images, we input the feature images into feature encoders and get the T-SNE results shown in Figure 12.4. Each row represents one featurebiased dataset and each column is bounded with one feature encoder, each image shows the results of one combination. T-SNE results are separated perfectly on corresponding datasets (diagonal) but not as well on others’ datasets (off-diagonal), which shows that our feature encoders are predominantly sensitive to the corresponding features. 12.4.2 Effectiveness of Humanoid Neural Network c12-sec:4.2 We can use feature encoders to serve as classifiers after adding fully-connected layers. As these classifiers classify images based on corresponding feature representation, we call them feature nets. We tested the accuracy of feature nets on these three biased datasets. As shown in Table 12.1, a ResNet-18 trained on the original segmented images (without explicit separated features, e.g. Figure 12.2 (b) tiger without background) provided an upper bound for the task. We find that feature net consistently obtains the best performance on their own biased dataset (e.g., on the shape-biased dataset, shape net classification performance is better 231 Table 12.2: Contributions of features from HVE and humans’ recognition accuracy. (a) Contributions of features for different biased datasets summarized by HVE. contribution ratio shape texture color Shape biased dataset 47% 34% 19% Texture biased dataset 5% 65% 30% Color biased dataset 11% 19% 70% table:biased-contributes (b) Humans’ accuracy of different feature images on different biased datasets. accuracy shape texture color Shape biased dataset 90.0% 49.0% 16.8% Texture biased dataset 33.1% 40.0% 11.1% Color biased dataset 32.3% 19.7% 46.5% table-human-acc than that of the color net or texture net). If we combine these three feature nets with the interpretable aggregation module, the classification accuracy is very close to the upper bound, which means our vision system can classify images based on these three features almost as well as based on the full original color images. This demonstrates that we can obtain most information of original images by our feature nets, and our aggregation and interpretable decision module actually learned how to combine those three features by end-to-end learning. Table 12.2a shows the quantitative contribution summary results of Humanoid NN (Section 12.3.2). For task-specific bias, shape plays a dominant role in shape-biased tasks, and texture, color also contribute most to their related biased tasks. 12.4.3 Human Experiments c12-sec:4.3 Intuitively, we expect that humans may rely on different features to classify different objects (Figure 12.1). To show this, we designed human experiments that asked participants to classify reduced images with only shape, texture, or color features. If an object is mainly recognizable based on shape for humans, we could then check whether it is also the same for HVE, and also for color and texture. Experiments Design Three datasets in Table 12.1 have a clear bias towards corresponding features (Figure 12.4). We asked the participants to classify objects in each dataset based on one single feature image computed by one of our feature extractors (Figure 12.5). Participants were asked to choose the correct class label for the reduced image (from 12/14/17 classes in shape/texture/color datasets). 232 (b) (a) shape texture color Jeep Beach wagon Convertible … (12 classes in total) Figure 12.5: Sample question for the human experiment. (a) A test image (left) is first converted into shape, color, and texture images using our feature extractors. (b) On a given trial, human participants are presented with one shape, color, or texture image, along with 2 reference images for each class in the corresponding dataset. Participants are asked to guess the correct object class from the feature image. fig:sample_question (a) CUB dataset (b) iLab-20M dataset Figure 12.6: Processed CUB and iLab-20M dataset examplesfig:overview_dataset Human Performance Results The results here are based on 3270 trials, 109 participants. The accuracy for different feature questions on different biased datasets can be seen in Table 12.2b. Human performance is similar to our feature nets’ performance (compare Table 12.1 with Table 12.2b). On shape-biased dataset, both human and feature nets attain the highest accuracy with shape. The same for the color and texture biased datasets. Both HVE and humans predominantly use some specific features to support recognition of specific classes. Interestingly, humans can perform not badly on all three biased datasets with shape features. 12.4.4 Contributions Attribution in Different Tasks c12-sec:4.4 With our vision system, we can summarize the task-specific bias and class-specific bias for any dataset. This enables several applications: (1) Guide accuracy-driven model design [336, 180, 162, 54]; Our method provides objective summarization of dataset bias. (2) Evaluation metric for model bias. Our method can help correct an initially wrong model bias on some datasets (e.g., that most CNN trained on ImageNet are 233 texture biased [180, 336]). (3) Substitute human intuition to obtain more objective summarization with end-to-end learning. We implemented the biased summarization experiments on two datasets, CUB [575] and iLab-20M [48]. Figure 12.1(b) shows the task-specific biased results. Since CUB is a dataset of birds, which means all the classes in CUB have a similar shape with feather textures, hence color may indeed be the most discriminative feature (Figure 12.6 (a)). As for iLab (Figure 12.6 (b)), we also conduct the class-specific biased experiments on iLab and summarize the class biases in Table 12.3. It is interesting to find that the dominant feature is different for different classes. For instance, boat is shape-biased while military vehicle (mil) is color-biased. 12.5 More Humanoid Applications with HVE To further explore more applications with HVE, we use HVE to simulate the visual reasoning process of humans and propose a new solution for conducting open-world zero-shot learning without predefined attribute labels (Section 12.5.1). We also use HVE to simulate human imagination ability through cross-feature retrieval and imagination (Section 12.5.2). 12.5.1 Open-world Zero-shot Learning with HVE c12-sec:5.1 Zero-shot learning needs to classify samples from classes never seen during training. Most current methods [412, 310, 154] need humans to provide detailed attribute labels for each image, which is costly in time and energy. However, given an image from an unseen class, humans can still describe it with their learned knowledge. For example, we may use horse-like shape, panda-like color, and tiger-like texture to describe Table 12.3: Class-specific bias for each class in iLab-20M ratio boat bus car mil monster pickup semi tank train van shape 40% 35% 44% 18% 36% 28% 40% 36% 31% 40% texture 32% 31% 40% 30% 34% 20% 31% 32% 34% 27% color 28% 34% 16% 52% 30% 53% 29% 32% 35% 33% table:ilab-local-bias 234 From shape perspective, it looks like horse , donkey . FEATURE CLASS 1 CLASS 2 Texture Reasoning Aggregation Reasoning (a) Open-world Image Description (b) Reasoning for Zero-shot Learning From texture perspective, it looks like tiger CLASS 1 , piano keys CLASS 2 . FEATURE From color perspective, it looks like panda CLASS 1 , penguin . FEATURE CLASS 2 More objects … More objects … More objects … More objects … stripe Four legged animal black and white Figure 12.7: The zero-shot learning method with HVE. We first describe the novel image in the perspective of shape, texture, and color. Then we use ConceptNet as common knowledge to reason and predict the label. fig:zero-shot an unseen class zebra. In this section, we show how our HVE can simulate this feature-wise open-world image description by feature retrieval and ranking. And based on these image descriptions, we propose a feature-wise open-world zero-shot learning pipeline with the help of ConceptNet [525], like the reasoning or consulting process of humans. The whole process shows in Figure 12.7. Step 1: Description We use HVE to provide feature-wise descriptions for any unseen class images without predefined attribute labels. First, to represent learnt knowledge, we use trained three feature extractors (described in Section 12.3.2) to get the shape, texture, and color representation image of seen class k. Then, given an unseen class image Iun, we use the same feature extractors to get its feature-wise representation. To retrieve learnt classes as descriptions, we calculate the average distance between Iun and images of other class k in the latent space on shape, texture, and color features. In this way, we can find the top K closest classes of Iun from the perspective of each feature, and we call these K classes “roots” of each feature. Now, we can describe Iun using our three sets of roots. For example, as shown in Figure 12.7(a), for the unseen class zebra, we can describe its shape by {horse, donkey}, texture by {tiger, piano keys}, and color by {panda}. Step 2: Open-world classification To further predict the actual class of Iun based on the feature-wise description, we use ConceptNet as common knowledge to conduct reasoning. As shown in Figure 12.7(b), for every feature roots, we retrieve their common attribute in ConceptNet, (e.g., stripe the is common attribute 235 Table 12.4: Open-world zero-shot accuracy and FID of cross-features imagination. (a) Accuracy of unseen class for zero-shot learning. Oneshot on Prototype and zero-shot on ours Method fowl zebra wolf sheep apple Prototype 19% 16% 17% 21% 74% Ours 78% 87% 63% 72% 98% c12-table:zero-shot (b) Cross-features imagination quality comparison. We compare HVE methods with three pix2pix GANs as baselines FID (↓) shape input texture input color input Baselines 123.915 188.854 203.527 Ours 96.871 105.921 52.846 table:fid root of {tiger, piano keys}). We form a reasoning root pool R∗ consisting of classes from feature roots obtained during image description and shared attribute roots. The reasoning roots will be our evidence for reasoning. For every root in R∗ , we can search its neighbors in ConceptNet, which are treated as possible candidate classes for Iun. All candidates form a possible candidate pool P, which contains all hypothesis classes. Now we have two pools, root pool R∗ and candidate pool P. For every candidate pi ∈ P and ri ∈ R∗ , we calculate the ranking score of pi as: S¯(pi) = P rj∈R∗ cos(E(pi), E(rj )). where E(·) is the word embedding in ConceptNet and cos(A, B) means cosine similarity between A and B. We choose the candidate with the highest score as our predicted label. In our prototype zero-shot learning dataset, we select 34 seen classes as the training set and 5 unseen classes as the test set, with 200 images per class. We calculate the accuracy of the test set (Table 12.4a). As a comparison, we conduct prototypical networks [519] using its one-shot setting. 12.5.2 Cross Feature Imagination with HVE c12-sec:5.2 We show HVE has the potential to simulate human imagination ability. Humans can intuitively imagine an object when seeing one aspect of a feature, especially when this feature is prototypical (contribute most to classification). For instance, we can imagine a zebra when seeing its stripe (texture). This process is similar but harder than the classical image generation task since the input features modality here is dynamic which can be any feature among shape, texture, or color. To solve this problem, using HVE, we separate this procedure into two steps: (1) cross feature retrieval and (2) cross feature imagination. Given any 236 (a) (b) Figure 12.8: (a) The structure and training process of the cross-feature retrieval model. Es, Et , Ec are the same encoders in Section 12.3.2. The feature agnostic net then projects them to shared feature space for retrieval. (b) The process of cross-feature imagination. After retrieval, we design a cross-feature pixel2pixel GAN model to generate the final image. c12-fig5.1 feature (shape, texture, or color) as input, cross-feature retrieval finds the most possible two other features. Cross-feature imagination then generate a whole object based on a group of shapes, textures, and color features. Cross Feature Retrieval. We learn a feature agnostic encoder that projects the three features into one same feature space and makes sure that the features belonging to the same class are in the nearby regions. As shown in Figure 12.8(a), during training, the shape Is, texture It and color Ic are first sent into the corresponding frozen encoders Es, Et , Ec, which are the same encoders in Section 12.3.2. Then all of the outputs are projected into a cross-feature embedding space by a feature agnostic net M, which contains three convolution layers. We also add a fully connected layer to predict the class labels of the features. We use cross-entropy loss Lcls to regularize the prediction label and a triplet loss Ltriplet [493] to regularize the projection of M. For any input feature x (e.g., a bird A shape), positive sample xpos are either same class same modality (another bird A shape) or same class different feature modality (a bird A texture or color); negative sample xneg are any features from different class. Ltriplet pulls the embedding of x closer to that of the positive sample xpos, and pushes it apart from the embedding of the negative sample xneg. The triplet loss is defined as Ltriplet = max(∥F(x) − F(xpos)∥2 − ∥F(x) − F(xneg)∥2 + α, 0), where F(·) := M(E(·)), 237 E is one of the feature encoders. α is the margin size in the feature space between classes, ∥ · ∥2 represents ℓ2 norm. We test the retrieval model in all three biased datasets (Figure 12.4) separately. During retrieval, given any feature of any object, we can map it into the cross feature embedding space by the corresponding encoder net and the feature agnostic net. Then we apply the ℓ2 norm to find the other two features closest to the input one as output. The output is correct if they belong to the same class as the input. For each dataset, we retrieve the three features pair by pair. The retrieval performs better when the input feature is the dominant of the dataset, which again verifies the feature bias in each dataset. Cross Feature Imagination. To stimulate imagination, we propose a cross-feature imagination model to generate plausible final images with the input and retrieved features. The procedure of imagination is shown in Figure 12.8(b). Inspired by the pixel2pixel GAN[261] and AdaIN[253], we design a cross-feature pixel2pixel GAN model to generate the final image. The GAN model is trained and tested on the three biased datasets. In Figure 12.9, we show more results of the generation, which show that our model satisfyingly generates the object from a single feature. From the comparison between (c) and (e), we can clearly find that they are alike from the view of the corresponding input feature, but the imagination results preserve the retrieval features. The imagination variance also shows the feature contributions from a generative view: if the given feature is the dominant feature of a class (contribute most in classification. e.g., the stripe of zebra), then the retrieved features and imagined images have smaller variance (most are zebras); While non-dominant given feature (shape of zebra) lead to large imagination variance (can be any horse-like animals). We create a baseline generator by using three pix2pix GANs where each pix2pix GAN is responsible for one specific feature (take one modality of feature as input and imagine the raw image). The FID comparison is in Table 12.4b. 238 (I) Input shape (II) Input texture (III) Input color Figure 12.9: Imagination with shape, texture, and color feature input (columns I, II, III). Line (a): input feature. Line (b): retrieved features given (a). Line (c): imagination results with HVE and our GAN model. Line (d): results of baseline 3 pix2pix GANs. Line (e): original images to which the input features belong. Our model can reasonably “imagine" the object given a single feature. c12-fig5.3 12.6 Conclusion To explore the task-specific contribution of shape, texture, and color features in human visual recognition, we propose a humanoid vision engine (HVE) that explicitly and separately computes these features from images and then aggregates them to support image classification. With the proposed contribution attribution method, given any task (dataset), HVE can summarize and rank-order the task-specific contributions of the three features to object recognition. We use human experiments to show that HVE has a similar feature contribution to humans on specific tasks. We show that HVE can help simulate more complex and humanoid abilities (e.g., open-world zero-shot learning and cross-feature imagination) with promising performance. These results are the first step towards better understanding the contributions of object features to classification, zero-shot learning, imagination, and beyond. 239 Chapter 13 Improving Zero-shot Generalization and Robustness of Multi-modal Models chapter-13 Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets, four other datasets, and other model architectures such as LiT. The proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code is available at https://github.com/gyhandy/Hierarchy-CLIP. 240 13.1 Introduction Vision-language multi-modal models trained on large-scale data have achieved significant success in numerous domains and have demonstrated excellent zero-shot generalization ability [434, 636, 422, 267, 440, 172]. Given a test image and a set of candidate class labels, one can compute the similarity between the embedding of the image and the embedding of each candidate class labels, and predict the class as the one with the highest similarity. The zero-shot top-1 accuracy for ImageNet [112] using CLIP variants (CLIP ViT-L) matches the performance of the original ResNet model trained from scratch. Recently, CLIP has been found to be more robust to distribution shift than ResNet, achieving good performance on ImageNet-V2 [447], ImageNet-R [225], ImageNet-A [230], and ImageNet-Sketch [579]. We noticed a large gap between the top-1 accuracy and top-5 accuracy, 64.2% vs. 89.4% respectively, revealing potential headroom for improvement. We investigated the cases where the top-1 prediction was incorrect but the top-5 prediction was correct, and identified several typical failure modes. Despite the well-known multi-label issues in ImageNet [39], we found many of the remaining failure cases are caused by noise and ambiguous text prompts related to the WordNet hierarchical structure of ImageNet. Some class names are quite general so that the model cannot correctly match images from their specific subclasses. For example, the hot-air balloon images belonging to the “balloon” class were misclassified as “airship”, see Figure 13.1 middle. On the other hand, some class names are too specific such that the model fails to correlate them with their more generic super-classes. For example, 96% of images with ground truth label “tusker” are wrongly classified as other elephant classes such as “Asian elephant”, see Figure 13.1 left. The failure modes analysis suggests that the text encoder is very sensitive to inputs and as a result, the overall classification lacks robustness. Inspired by these observations, we propose to first identify the subset of images whose top-1 prediction is likely to be incorrect, and then improve the accuracy for those images by a principled framework to augment their class labels by WordNet hierarchy. To estimate whether an image has an incorrect prediction, i.e., to estimate the prediction confidence, we use the consistency of predictions under different text prompt templates and image augmentations as a signal for prediction confidence estimation. Although prediction confidence estimation has been well studied in single-modal classification models, we found those commonly used confidence scores, maximum softmax probability [227] and maximum logit score [224], are not always reliable for the multi-modal CLIP and LiT models due to the poor calibration of the logits scores. For example, among the 1K classes in ImageNet, the class with the greatest mean logit 241 Figure 13.1: Typical failure modes in the cases where top-5 prediction was correct but top-1 was wrong. fig:failure-mode value (computed as the cosine similarity between image and text embeddings) is “fig” (the fruit). Though we don’t have access to CLIP private training data, we hypothesize that this might be due to “fig” being a common abbreviation for “figure”, which frequently occurs in the training data and thus includes many non-fruit illustrations. In this work, we first propose a simple yet efficient zero-shot confidence estimation method better suited for CLIP, based on predictions’ self-consistency over different text prompts and image perturbations. [589] proposed using self-consistency among multiple model outputs to improve the reasoning accuracy of large language models. Here we extend the idea for confidence estimation in multi-modal models by measuring consistency of predictions under multiple input text prompts and image transformations. Our method is effective at predicting mistakes; the identified low confidence subset has significantly lower top-1 accuracy (21.58%) than the average accuracy (64.18%). Next, to improve the accuracy for the low confidence subset, we develop a label augmentation technique using WordNet label hierarchy. Our method leverages semantic information from ancestors (top-down) as well as children (bottom-up) and improves the top-1 accuracy of the subset to 38.71% (17.13% improvement). Our method not only improves model accuracy, but also model robustness, improving on ImageNet variants with distribution shift such as ImageNet-v2, ImageNet-R, ImageNet-Adversarial and Imagenet-Sketch. The main contributions of this work are: We identified several failure modes for zero-shot ImageNet classification using multi-modal models, and our findings suggest that the text encoder is very sensitive to prompts. To improve the prediction accuracy, prompts need to be better designed. 242 We propose a simple yet efficient zero-shot confidence score that is better suited for multi-modal models, based on predictions’ self-consistency under different text prompts and image perturbations. We develop a label augmentation technique that uses both ancestor and children labels from WordNet. By applying the label augmentation to the previously identified low confidence subset of images, we significantly improve their prediction accuracy. 13.2 Related work Confidence estimation. Reliably estimating the confidence of a prediction is helpful for downstream decision making and can ensure the safe deployment of machine learning models. A well-calibrated confidence estimation should assign low scores for incorrect predictions and high score for correct predictions. Maximum softmax probability [227] and maximum logit [224] are the most commonly used confidence scores for classification problems, because of their simplicity and computational efficiency. Recent works propose more sophisticated confidence estimation methods which either involve modifications to the classification models or significantly increase the inference time. For example, Bayesian approaches such as Gaussian Process layer [353] and dropout-based variational inference [158] assume the weights in the neural networks are random variables such that the final prediction follows a distribution. A large variance of a prediction indicates the low confidence of the prediction. Non-Bayesian methods such as ensemble-based methods which aggregate the predictions from multiple models to improve the robustness of the confidence estimation [308, 599]. Those sophisticated methods were developed and studied in the single-modal models, and the application to multi-modal models is not straightforward. In addition, those methods mostly require modification to the model and additional training, which becomes challenging to multi-modal models since the training data are generally not publicly available. In our work, we focus on a zero-shot confidence estimation that is exclusively designed for multi-modal models. Our method does not require additional training, and is simple, efficient, and effective. Prompt engineering. Prompt engineering and learning has attracted much attention in vision and learning since the introduction of image-text models [434, 267, 636]. The image-text models align images and their text descriptions into a common space, which facilitates model generalization to unseen categories at inference time. However, it has been observed that downstream image classification accuracy highly depends on the specific input prompts. This motivates researchers to either fine-tune or auto-learn prompts when adapting multi-modal models to downstream vision tasks. 243 [667, 666] propose CoOp and CoCoOp to automatically learn the prompt word embeddings in the few-shot settings, and show significant improvements over the vanilla zero-shot image classification based-on prompting. These are learning based approaches, requiring supervised data from downstream tasks, while our proposed method is zero-shot and post-hoc without using any supervised data. In concurrent work, [510] proposes learning prompt embeddings in an unsupervised manner by minimizing the entropy of the averaged prediction probability distribution, where each prediction is based on a random augmentation applied to the input image. Our work differs from [510] in the sense that we do not learn an input-dependent prompt embedding. Instead we only selectively modify the prompts using knowledge hierarchy for images that have unreliable predictions, and our modified new prompt is natural language rather than a numerical embedding. Label hierarchy. Label hierarchy or label ontology are relational graphs among semantic labels. WordNet is one of the most widely used concept ontologies, and it has been used for visual recognition problems. Fergus et al. [145] leverage the WordNet hierarchy to define a semantic distance between any two categories and use this semantic distance to share labels. Deng et al. [111] propose a hierarchy and exclusion graph to explicitly model the semantic relations among labels, and significantly improve object classification by exploiting the rich label hierarchy. The idea of semantic distance defined on the WordNet ontology graph is also used in [464, 465] for transferring knowledge in zero-shot learning problems. We are similar to the above work in that we utilize the label semantics encoded by the label hierarchy as well, but label hierarchy in our case is used in the multi-modality scenarios: textual labels and visual images are represented in the same latent space, therefore, the hierarchy structure is directly exploited in the representation space to steer the recognition process. 13.3 Zero-shot inference failure case analysis sec:failure_analysis Given that the top-1 accuracy (64.2%) is much lower than top-5 accuracy (89.4%) for zero-shot ImageNet classification using CLIP, we investigated the failure cases that are “top-5 correct but top-1 wrong" (12605 images, 25.2% of all test images). The failure modes are summarized as: (1) Class name does not specify super-class name: Some classes, whose class names do not have their WordNet ancestor (e.g., “tusker", one of 1k ImageNet classes, does not have its parent “elephant" in the class name), may have a relatively lower score than other classes, which explicitly have the ancestor present in the class name (e.g., “Asian 244 plane car dog bird … Prompt -1 Prompt -2 Prompt -n … Left-right flip Up-down flip rotation crop … Start with Top-5 predicti ons Figure 13.2: Our zero-shot classification pipeline consists of 2 steps: confidence estimation via selfconsistency (left block) and top-down and bottom-up label augmentation using the WordNet hierarchy (right block). See Algorithms 4 and 5 for pseudocode. c13-fig:method elephant”). See examples in Figure 13.1 (Left). (2) Class name does not specify sub-class name: If the class name is too abstract, then its CLIP embedding is not necessarily close to the image embedding: e.g, CLIP wrongly classifies most images from “balloon” class as airship, see Figure 13.1 (Middle). That is because there are distinct kinds of balloons, each belonging to a different semantic subgroup. Relying on the text embedding of the fine-grained children’s class names (e.g., using “hot-air balloon”) often fixes these errors. [39] reported the similar issue of label ambiguity in ImageNet. (3) Inconsistent naming between class names: Some ImageNet class names are nouns, but others are adjectiveprefixed nouns. This may make CLIP text embedding biased, see one example in Figure 13.1 (Right) where images from “screw” class are misclassified as “metal nail”. 13.4 Proposed Method As shown in Section 13.3, CLIP models can be sensitive to different text prompts for images in certain classes. In this section, we first propose a confidence estimation method to identify low confidence predictions. We show that the identified subset has much lower accuracy than the average (Section 13.4.1). We next develop a principled method that utilizes knowledge hierarchy to improve the accuracy of the low confidence subset, and consequently improve the overall accuracy on the whole datasets (Section 13.4.2). 245 13.4.1 Self-consistent zero-shot confidence estimation sec:confidence_estimation Given an image x and a candidate class name c, where c ∈ C, |C| = 1000, the CLIP model encodes x and c respectively by its image encoder fimage and text encoder ftext, denoted as zm = fimage(x) and zc = ftext(c). The prediction logit score is defined as logit(x, c) = cos(zm, zc), where cos(·, ·) is the cosine similarity between two vectors, and the predicted class is arg maxc∈C logit(x, c). We estimate the confidence by the self-consistency rate when applying different context prompts and image augmentations. Confidence estimation via text prompts. To improve the zero-shot classifier’s performance, the CLIP paper [434] hand crafted various context prompts, e.g. “A photo of a big {label}” and “A photo of a small {label}”), for different datasets for the purpose of prompt ensembling: For an image x, given a set of context prompts T , the ensembled logit score is logit(x, T (c)) = 1 |T | P t∈T logit(x, t(c)), where t(c) denotes the new prompt after applying context prompt t(·) to c. Here instead of using the prompts for ensembling, we make use of the prompts to define our confidence score. Given a set of prompts T , we apply each of the prompt t(·) for the classifier, and see if the top-1 prediction is the same as that when applying no prompt. We use the percentage of prompts that have consistent top-1 prediction with that without prompt as the confidence score ST (x), i.e. ST (x) = P t∈T 1{cˆ(x, t) = ˆc(x, ∅)} |T | (13.1) {eq:conf_prompt} eq:conf_prompt where cˆ(x, ∅) = arg maxc∈C logit(x, c) is the top-1 prediction using the pure class name, and cˆ(x, t) = arg maxc∈C logit(x, t(c)) is the top-1 prediction when applying prompt t(·). Intuitively, a reliable prediction should have highly consistent top-1 predictions when context prompts are applied or not, and therefore should have a high confidence score ST (x) with respect to the prompt set T , and vice versa. Confidence estimation via image perturbation. We can also estimate the confidence of a prediction based on the self-consistency when applying different perturbations to the input image. Intuitively, if the top-1 predictions are inconsistent when applying different image perturbations, the prediction is unreliable. Specifically, we consider the common image transformations, left-right flip, rotation, crop, etc., and apply the perturbation method b(·) to the 246 input image, and infer the predicted class as cˆ(x, b) = arg maxc∈C logit(b(x), c). We define the confidence score with respect to a set of image perturbations B as, SB(x) = P b∈B 1{cˆ(x, b) = ˆc(x, ∅)} |B| (13.2) {eq:conf_image} eq:conf_image We expect a high confidence prediction to have highly consistent prediction when applying different image perturbations, and therefore to have a high confidence score SB(x) with respect to the image perturbation set B. Determining the low confidence subset by combining the two confidence estimations. The confidence score we proposed in Equation 13.1 and Equation 13.2 are continuous values. A threshold needs to be determined if we want to select a subset of examples with low confidence using the continuous confidence score. In practice, the threshold can be chosen based on the requirement of recall and precision trade-off in the real application. In our study, to bypass the threshold selection, we propose to use a binary criterion for determining the low confidence set. For IamgeNet dataset, the CLIP paper [434] designed total 80 context prompts. We define four sets based on the 80 prompts: the first 40 prompts T1, the last 40 prompts T2, all 80 prompts T3, and no prompts T4 = ∅. We apply the four different sets of prompts to the classifier and see if their top-1 predictions are all consistent or not, i.e. cˆ(x, T1) = ˆc(x, T2) = ˆc(x, T3) = ˆc(x, T4). Then we determine the low confidence subset OT as those examples who have inconsistent predictions among the 4 prompts sets. We studied other choices such as using a random set of 40 prompts as T1, or splitting the 80 prompts into more subgroups, and found the results were very similar. Similarly we also determine a low confidence subset OB based on image perturbations. In practice we found left-right flip works the best among the above mentioned perturbations. Thus for simplicity, we compare the top-1 prediction when applying the left-right flip to the input image and the top-1 prediction when using raw image. If their predictions are not consistent, that example will be included into the low confidence set OB. Finally, we use the union of the two low confidence sets OT identified using the text prompts and and OB identified using the image perturbations as the final low confidence subset O in the following experiments. Algorithm 4 shows the low confidence set generation process. 247 Algorithm 4 Zero-shot confidence estimation alg:uncertainty Input: Input images X = {xi} N i=1, Candidate class set C, image encoder fimage and text encoder ftext, text threshold τt, image threshold τi Output: Low confidence set O 31 Low confidence set OT ← ∅ ▷Confidence estimation via text prompts 32 Sample L different context prompt t1, t2 . . .tL 33 for xi ∈ X do 34 Compute ST (xi) based on Equation 13.1 35 if ST (xi) > τt then xi has high confidence prediction else OT ← OT ∪ xi 36 Low confidence set OB ← ∅ ▷Confidence estimation via image perturbation 37 Sample M perturbation methods b1, . . . , bM 38 for xi ∈ X do Compute SB(xi) based on Equation 13.2 39 if SB(xi) > τi then xi has high confidence prediction else OB ← OB ∪ xi 40 O ← OT ∪ OB 13.4.2 Top-down and bottom-up label augmentation using WordNet hierarchy sec:method_hierachy Through extensive analysis of the incorrect predictions among the identified unreliable predictions, we found that many of them are caused by CLIP’s lack of robustness to prompts. Instead of tuning the prompt templates, we focus on how to augment {label} in “A photo of a {label}”. A proper prompt that specifies both the generic type and the more specific sub-types of this class are very important for correctly classifying the image. However, the ImageNet [112] class names are not all defined with similar specificity and some classes are more abstract than others, e.g. 350 classes have children, while the rest of the classes have no children. To make the ImageNet classification problem better suited to CLIP, we leverage the underlying WordNet hierarchy and develop a top-down and bottom-up class name augmentation method to improve zero-shot prediction accuracy for unreliable predictions. The WordNet hierarchy is a semantic concept ontology, with nodes being cognitive synonyms indicating different concepts, and edges indicating the super-subordinate relation between concepts. Traveling upward from leaf nodes to the root, the concepts start from the very specific to the generic. For example, starting from the edge node “strawberry” to the root are “berry”, “edible fruit”, “produce”, “food”, “solid”, “matter”, and “physical entity” (the root). As we have seen in the failure mode analysis, many of the imageNet class names suffer from either being too abstract or being too specific, so that their concepts do not align well with the visual concepts the CLIP model learned in training. We 248 propose using the WordNet knowledge hierarchy to augment the class labels in prompts so that the CLIP model has a better match between the image and prompts. Top-down: augmenting class names with parent. As shown in failure case analysis, adding the super-class name to reduce ambiguity and to encourage the model’s attention on the generic concept is helpful for improving the accuracy. Therefore we propose using WordNet to find the parent node of the raw class name, and concatenate it to the class name, i.e. logit(x, c) = logit(x, [c; p(c)]) where p(c) is the parent node’s name of the class name c, and [c; p(c)] means the string concatenation of the class name and the parent name. We apply the method to top-5 predicted classes. Using the newly defined class names, we are able to re-rank the top-5 predictions for the identified unreliable subset of images. Note that WordNet contains a few very abstract class names for nodes, such as “physical entity”, “artifact”, “matter”, etc. We found that such parent nodes are not informative, hence we remove them. There are also many academic words in WordNet, for example the parent node of sea anemone is “anthozoan”, which can be rare in CLIP training data. Adding those academic words to class name makes the prediction even less robust. So we simplify the WordNet by pruning based on an estimation of the word frequency in CLIP training data by using embedding norm. Bottom-up: augmenting class names with children. Some ImageNet class names are generally abstract, but the ImageNet images may belong to a specific subtype of the class. For example, “balloon” is a class name in ImageNet, but most balloon images in ImageNet are actually “hot-air balloon”, which is a child of “balloon” in WordNet hierarchy. The logit score for a parent class is not necessarily higher than the score for its child classes, mismatching with hierarchy prior. To accurately classify the images using CLIP, we need to augment the class name with fine-grained child subclasses. For each class c having children in the WordNet hierarchy, we redefine the logit score as the max score over itself and all its children, i.e., logit(x, c) = max{logit(x, c), logit(x, c1), . . . , logit(x, cr)}, where c1 . . . cr are the r children of the node c in the WordNet hierarchy. We apply this bottom-up method to top-5 predicted class names, and re-rank the top predictions. Combining Top-down and bottom-up. In practice, we use both children and the ancestor(parent) to augment each class c, to transfer semantic information bidirectionally in both top-down and bottom-up way: the ancestor(parent) class is more generic than c, and has better chance to disambiguate instance from a more abstract level; on the other hand, 249 Algorithm 5 Top-down and bottom-up class label augmentation using WordNet hierarchy alg:hierarchy Input: Input image x ∈ O, top-5 candidate class set Ctop5, sparse WordNet hierarchy H, image encoder fimage and text encoder ftext Output: Predicted class of x 41 Candidate class set C ← ∅ 42 for c ∈ Ctop5 do C ← C ∪ [c; parent(c)], where parent(c) is the parent of c in H ▷Top-down 43 if c has r ≥ 1 children c1 . . . cr in H then C ← C ∪ {[cj ; parent(c)]} r j=1 ▷Bottom-up 44 cˆ ← arg maxc∈C logit(x, c) if cˆ ∈ Ctop5 then final prediction ← cˆ else final prediction ← parent(ˆc) children categories have more specific attribute description, and the attribute descriptions are semantically meaningful representations bridging the gap between the image embedding and its abstract class concept c. Then the final logit score between x and c is: logit(x, c) = max{logit(x, [c; p(c)]), (13.3) logit(x, [c1; p(c)]), . . . , logit(x, [cr; p(c)])} (13.4) {eq:logit_tp_bu} eq:logit_tp_bu where p(c) is parent of c, and c1 . . . cr are c’s children. The cˆ, where cˆ ∈ Ctop5, with the maximal logit score is the predicted class of x. See Algorithm 5 for details. 13.5 Experiments and Results Our proposed method is composed of two steps and we conduct experiments to verify the effectiveness of each step: (1) Use zero-shot confidence estimation to identify the low confidence subset of samples (see Figure 13.3 for the results), and (2) Augment the class label using top-down and bottom-up strategies based on the sparsified WordNet on the low confidence subset to improve the accuracy (See Table 13.1 and Table 13.2 for the results). 13.5.1 Our proposed confidence score is better suited for selective prediction than baselines A well-calibrated confidence estimator should score high for those correct predictions, and low for incorrect predictions. As a result, a good confidence estimator should be a good predictor for prediction correctness. We plot the receiver 250 (a) CLIP: Calibration ROC and AUC 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Ours | AUC:0.84 Max logits (baseline)| AUC:0.67 Max logits with prompt (baseline)| AUC:0.68 fig:clip_auc (b) CLIP: Selective prediction 0.0 0.2 0.4 0.6 0.8 1.0 Abstention Rate 0.65 0.70 0.75 0.80 0.85 0.90 Top-1 Accuracy Max logits (baseline) Max logits with prompt (baseline) Ours fig:clip_sp (c) LiT: Calibration ROC and AUC 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Ours | AUC:0.81 Max logits (baseline)| AUC:0.70 Max logits with prompt (baseline)| AUC:0.69 fig:lit_auc (d) LiT: Selective Prediction 0.0 0.2 0.4 0.6 0.8 1.0 Abstention Rate 0.70 0.75 0.80 0.85 0.90 Top-1 Accuracy Max logits (baseline) Max logits with prompt (baseline) Ours fig:lit_sp Figure 13.3: ROC plots (left column) show that our proposed confidence score is better at distinguishing correct and incorrect predictions and results in higher AUC scores than baselines for both CLIP (ViT-B/16) (a) and LiT (ViT-B/32)(c). Selective prediction curves (right column) show that our proposed confidence score is better at abstaining incorrect predictions and as a result the accuracy of the remaining set is higher than the baselines for both CLIP (ViT-B/16) (b) and LiT (ViT-B/32) (d). fig:conf_esti Table 13.1: CLIP (ViT-B/16) and LiT (ViT-B/32) zero-shot top-1 accuracy comparison between baseline and ours (w/ hierarchy). CLIP (Ours) Hierarchy-CLIP LiT (Ours) Hierarchy-LiT ImageNet [112] Low conf. set 21.58% 38.71% 31.18% 37.25% Full set 64.18% 67.78% 68.26% 69.41% ImageNet-v2 [447] Low conf. set 17.77% 32.50% 27.08% 31.45% Full set 58.06% 61.07% 60.11% 61.11% ImageNet-R [225] Low conf. set 16.79% 27.91% 21.82% 22.93% Full set 56.88% 59.46% 66.54% 66.75% ImageNet-Adversarial [230] Low conf. set 10.13% 18.44% 7.19% 8.95% Full set 26.12% 29.23% 13.93% 14.56% ImageNet-Sketch [579] Low conf set 13.74% 23.18% 21.51% 24.42% Full set 44.71% c13-table:zero-shot 47.28% 52.47% 53.17% operating characteristic (ROC) curve and compute the area under the ROC curve (AUC) as a quantitative measure to compare our proposed confidence estimation with the baselines. An AUROC of 1.0 indicates perfect separation between correct and incorrect predictions, and 0.5 means the two groups are not distinguishable. Maximum logit score, maxc∈C logit(x, c) is one of the most commonly used confidence score for classification problems in single modal models [224], so we consider it as our baseline. Figure 13.3a and Figure 13.3c clearly show that our confidence score is 251 significantly better than the baseline method at distinguishing between correct and incorrect predictions, for both CLIP and LiT models. The AUC score for our proposed method is above 0.8 while that for the baseline method is around 0.7. We also compare our method with the baseline in the scenario of selective prediction. Given a budget of abstention rate α%, the best strategy is to abstain the α% samples with the lowest confidence scores. If the confidence score is well calibrated, the accuracy for the abstained set will be low and as an evidence the accuracy of the remaining set would be high. We plot the selective prediction curves [308], which reports the accuracy on the remaining set as a function of the abstention rate. Figure 13.3b and Figure 13.3d show that our proposed confidence score results in higher accuracy than the baseline maximum logit score at all abstention rates for both CLIP and LiT. Prompt ensemble has been shown to improve accuracy and robustness of the prediction, so here we also compare ours with the maximum logit score after applying prompt ensemble. As shown in the selective prediction curves, although the prompt ensemble indeed helps to achieve higher accuracy (dashed line) than that using the pure class name (solid line), it is still inferior to our proposed method. 13.5.2 Using hierarchy to help improve zero-shot accuracy on low confidence subset Using top-down and bottom-up label augmentation significantly improves the accuracy on the low confidence subset. We apply the top-down and bottom-up label augmentation on the low confidence subset: to better combine child and parent name, we create a prompt template to transform the child and parent name pairs into a new class name c˜ in natural language: “{child} which is a kind of {parent}” (different prompt templates may have different results). Table 13.1 shows improvement of 17.13% on the top-1 accuracy (from 21.58% to 38.71%) for the identified low confidence subset of samples, and overall 3.6% on the top-1 accuracy (64.18% to 67.78%) for all samples in ImageNet. We show similar improvement on the zero-shot accuracy for ImageNet shifted datasets. To investigate if our method works for other multi-modal models, we apply it to the LiT [636] model and observe that our method improves accuracy for LiT models as well. Generalizability to non-ImageNet datasets To show the generalizability of our methods on non-ImageNet datasets, We conducted experiments on 4 additional datasets: Caltech-101 [325] (101 categories), Flower-102 [404] (102 flower categories), Food-101 [49] (101 food categories) and Cifar-100 [301] (100 categories). For each dataset, a subset of their categories are exist/aligned with WordNet hierarchy, we only apply our method on those WordNet 252 aligned class names, where we could find their ancestor and children. We keep the other class names unmodified. We use CLIP (ViT-B/16) as multi-modal model. Table 13.2 shows that our method consistently improved accuracy on the low-confidence set (low) and the entire set (full): Table 13.2: Generalizability to non-ImageNet datasets (CLIP (ViT-B/16) zero-shot top-1 accuracy). Dataset orig (low) ours (low) orig (full) ours (full) Caltech-101 [325] 10.6 % 27.2% (+16.6%) 74.1% 77.1% (+3.0%) Flower102 [404] 20.0% 29.4% (+9.4%) 63.7% 65.3% (+1.6%) Food-101 [49] 28.2% 49.0% (+20.8%) 84.7% 86.8% (+2.1%) Cifar-100 [301] 9.4% 17.5% (+8.1%) 31.8% 35.2% (+3.4%) c13-table:otherdataset 13.5.3 Ablation study Generalizability to other backbones To study the generalization of our method to different model architectures and sizes, we used 4 additional backbones of CLIP, including convolutional neural network (CNN) based backbones (ResNet-50, ResNet-101) and vision transformer (ViT) based backbones (ViT-B/32, ViT-B/16 and ViT-l/14). Table 13.3 shows the improved accuracy after using our method on ImageNet with CLIP models of different backbones. Our method achieves consistently improved accuracies. Our hierarchy-based label augmentation is complimentary to prompt ensembling. Prompt ensembling (PE) [434] requires a set of manually crafted prompt templates, and the zero-shot performance is sensitive to the set of prompts the model uses. Alternatively, our proposed method does not require a dedicated tuning of the prompt templates. We directly augment the class name with knowledge of the hierarchy from WordNet. In addition, PE is computationally intensive because it needs to infer the embeddings of 80 prompt templates where each is applied with 1000 ImageNet classes, while our method only need to infer once for each of the predicted top-5 labels. Our method is more straightforward and interpretable given that it clearly shows the contribution of parent/child in the decision. Intuitively, PE is typically focused on fixing {class} and augmenting contextual templates, while our method augments the {class} with a fixed contextual template. To verify if our hierarchy-based method is complimentary with prompt ensembling, we apply prompt ensembling after applying our top-down and bottom-up label augmentation. For the low confidence set, we first create a prompt template to transform the child and parent name pairs into a new class name c˜ in 253 Table 13.3: Generalizability to different backbones with CLIP. backbone ResNet-50 ResNet-101 ViT-B/32 ViT-B/16 ViT-l/14 ACC (low) +14.25% +12.97% +15.12% + 17.13% +18.89% ACC (full) +3.73% +3.71% +3.65% + 3.60% +3.23% c13-table:backbone Table 13.4: CLIP (ViT-B-16) zero-shot top-1 accuracy comparison with prompt ensemble. Ensemble only Hierarchy and Ensemble ImageNet [447] Low conf. set 41.05% 42.09% Full set 68.48% 68.86% ImageNet-v2 [447] Low conf. set 36.39% 36.34% Full set 62.02% 62.00% ImageNet-R [225] Low conf. set 35.13% 36.12% Full set 60.21% 60.62% ImageNet-Adversarial [230] Low conf. set 21.13% 22.00% Full set 30.59% 31.07% ImageNet-Sketch [579] Low conf. set 27.13% 26.56% Full set 48.52% 48.26% table:add_prompt natural language: “{child} which is a kind of {parent}”. Then we apply the 80 prompts designed by the CLIP paper [434] individually to the new class name c˜, and then ensemble them. For the high confidence set, since we do not modify the class name using hierarchy information, we only apply the prompt ensemble. The performance is shown in Table 13.4. We compare the zero-shot accuracy using the vanilla prompt ensembling method proposed in CLIP, and the zero-shot accuracy using our combined version of hierarchy-based class name augmentation and prompt ensembling. As shown in the table, using both hierarchy and prompt ensembling achieves better or on par accuracy with the prompt ensemble alone, suggesting that the two methods can be combined. Considering the prompt ensemble requires manually designed prompt templates and much greater inference time, our hierarchy-based class name augmentation is simple, efficient and effective. We also computed IoU of corrected low-confidence instances (low set) between PE and our method: the IoU is 0.55, which implies the two methods are complementary for fixing errors. Effect of threshold of confidence score on zero-shot accuracy. In Table 13.1 we use a binary criterion to determine the low confidence set. We can alternatively use the continuous confidence score by choosing a threshold based on the trade-off between precision and recall. Changing the threshold of the confidence score can lead to different numbers of samples in the low confidence set. We study the effect of threshold on zero-shot accuracy. Table 13.5 shows 254 Table 13.5: Effect of threshold of confidence score on zero-shot accuracy. Threshold Low conf. set size Acc on low conf. set Acc on full set 0.47 10000 19.40% 68.72% 0.52 11000 20.82% 68.78% 0.57 12000 22.06% 68.82% 0.62 13000 23.58% 68.85% 0.66 14000 25.01% 68.88% 0.70 15000 26.51% 68.86% table:threshold the overall accuracy with different thresholds. We find that the overall accuracy is relatively robust to the threshold selection, in the wide range from 0.47 to 0.70. 13.6 Conclusion Multi-modal models’ generalization and robustness is critical for deployment. Motivated by the big gap between top-1 and top-5 accuracy in ImageNet zero-shot classification, we investigated the failure modes and found that the model’s prediction is very sensitive to text prompts. We describe a simple but efficient zero-shot post-hoc method to identify a subset of samples that are most likely to be predicted wrongly by a measure of self-consistency. For those in the low confidence subset, we use the WordNet hierarchy to augment class labels to enhance the robustness, resulting in up to 17.13% accuracy improvement on ImageNet. We show our method provides consistent improvement over other distribution shifted datasets (ImageNet variants), four other datasets, and is generalizable to other image-text models and different backbones. 255 Chapter 14 Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models chapter-14 We focus on the challenge of out-of-distribution (OOD) detection in deep learning models, a crucial aspect in ensuring reliability. Despite considerable effort, the problem remains significantly challenging in deep learning models due to their propensity to output over-confident predictions for OOD inputs. We propose a novel one-class open-set OOD detector that leverages text-image pre-trained models in a zero-shot fashion and incorporates various descriptions of in-domain and OOD. Our approach is designed to detect anything not in-domain and offers the flexibility to detect a wide variety of OOD, defined via fine- or coarse-grained labels, or even in natural language. We evaluate our approach on challenging benchmarks including large-scale datasets containing fine-grained, semantically similar classes, distributionally shifted images, and multi-object images containing a mixture of in-domain and OOD objects. Our method shows superior performance over previous methods on all benchmarks. Code is available at https: //github.com/gyhandy/One-Class-Anything 14.1 Introduction Out-of-distribution detection (OOD) is essential for ensuring the reliability of machine learning systems. When a machine learning system is deployed in the real world, it may encounter unexpected abnormal inputs that are not from the same distribution as the training data. Detection and removal of OOD inputs prevents the machine learning (ML) system from making incorrect predictions that could otherwise lead to serious failures especially in safety-critical applications. For example, a person classification model is trained to localize people in images. If an image that does 256 not contain a person, but instead contains an animal or a sculpture, the model may erroneously label the non-person as a person. Accurate and reliable one-class detection is of paramount importance in life-critical applications; for example, correctly perceiving persons in autonomous driving systems. Although OOD detection has been studied previously in traditional ML models [490, 133], deep learning models are known to output over-confident predictions for OOD inputs, making OOD detection in deep learning much more challenging. Recent efforts have focused on developing methods to correct for the naive softmax probability [227, 342, 224, 355], using deep neural representations to measure the distance to the training distribution [317, 449, 535, 148], leveraging OOD data to learn a more precise decision boundary between in-domain and OOD [228, 472], and using deep density models to measure the likelihood under the training distribution [91, 450, 397, 392]. Unknown OOD In-domain Known in-domain Unknown in-domain Known OOD Covariate shift Multi-object mixed in-domain and OOD Figure 14.1: We study the one-class OOD detection problem where OOD can be anything not in-domain. Example of building a dog detector with some known dogs Cin ={Husky, Papillon, Dobermann} and known non-dogs Cout ={Cat, Bird, Person}. When deploying such a one-class detector in the real world, it’s important for it to be robust to several different types of shifts: (1) Unknown in-domain classes include new species of dogs, and unknown OOD such as wolves, bunnies, violins, (2) Multi-object cases (cats along with dogs, persons along with dogs), (3) Covariate shift (drawings of dogs, painting of a bird, cartoon car, etc). fig:overall For evaluation, most OOD detection methods use one dataset as in-domain and another dataset as OOD. For example, the CIFAR100 vs CIFAR10 [450, 449, 148, 317], or ImageNet vs Places365 [224, 387] benchmark datasets; both have in-domain data that consists of a set of classes, and OOD data comprised of a different set of classes that are non-overlapping with the in-domain. That scenario treats OOD classes as a closed set. However, in realistic scenarios, in-domain data often consists of a set of classes belonging to a unified high-level superclass, and OOD data is anything 257 that is not in that superclass, which can be viewed as an open-set one-class anomaly detection problem [284, 406]. For example, the OOD detection in the person identification scanario is to detect anything that is non-person. The most straightforward approach for building a one-class OOD detector is to train a binary classifier for in-domain and OOD classes [45]. However, since the OOD space is large (anything that is not in-domain) and often includes classes that follow a long-tail distribution [472], it is impossible to include all the possible OOD data when training. Consequently, a model trained on a subset of OOD data may suffer from poor generalization to the unseen OOD at test time. Another approach is to learn the distribution of the in-domain data without using prior knowledge of OOD [475, 72, 655, 245, 414, 392, 73], but that approach cannot leverage the knowledge of known OOD data which can help with learning a more precise boundary. Another issue with the existing one-class OOD work is that the datasets used for evaluation are fairly simple and small scale as MNIST and SVHN. Usually one class serves as in-domain and the remaining classes are OOD (such as using 1 as in-domain and the numbers 2 through 9 as OOD). That testing regime does not fully explore a setting with the open-set assumption that OOD can be anything not in-domain, including, for example, images of non-numbers. Recently, large pretrained text-image models such as CLIP (Contrastive Language-Image Pretraining) [434], and LiT [636] learn the image and text representations simultaneously from massive image captioning data, and can be used as a zero-shot classifier at inference time by comparing the similarity between the class label and the image in the embedding space. It has also shown that the maximum softmax score from the zero-shot classifier is a reasonably good confidence score for OOD detection [387]. However, that method only takes the in-domain class labels without leveraging the OOD information. [148, 135] propose a weaker form of outlier exposure (OE) by including OOD class names into the label set without any accompanying images, and use the sum of the softmax over the OOD class labels as the OOD score. Since the methods were only evaluated on closed-set tasks, it is unclear how well the methods would perform on one-class OOD detection problem which evaluate the generalization ability with unseen classes. We are also interested in evaluating on large-scaled datasets such as ImageNet, and its variants to see how well the methods perform on fine-grained and semantically similar classes, and on distributional shifted data. In this work, we develop a one-class open-set OOD detector using text-image pretrained models in a zero-shot fashion. Our one-class open-set OOD detector detects anything that is not in-domain, contrary to methods that specify a restricted set of predefined OOD classes. Our method can be used to detect any type of OOD, defined either in 258 fine-grained or coarse-grained labels, or even in natural language, through customizing the text labels in the text-image zero-shot classifier. We evaluate our method on large-scaled challenging benchmarks to mimic real-world scenarios, and test with (1) images from unseen classes, (2) distribution shifted images, (3) multi-object images that are may contain a mixture of in-domain and OOD objects. See Figure 14.1 for an overview. We show our method consistently outperforms the previous methods on all the challenging benchmarks. Our contributions are the following: • We find that previous methods [148, 387] do not work well on OOD detection for samples outside of the predefined class sets. We propose better OOD scores that utilize the in-domain and OOD labels and show that they consistently perform best for detecting hard samples from long-tail unseen classes and under distribution shifts. • Our proposed method is flexible enough to incorporate various definition of in-domain and OOD. Because our method is based on text-image models, users can easily customize the definitions of in-domain and OOD via text labels, for example using class names at different hierarchical levels, or even including natural language sentences. • We tackle the challenging OOD detection for multi-object images that contain a mixture of in-domain and OOD objects. Integrating our scores into powerful segmentation models [296, 354], we are able to identify images with mixed in-domain and OOD objects, outperforming the baselines. 14.2 Methods 14.2.1 Background Contrastively pre-trained text-image models can be used as zero-shot classification models [434]. Text-image models consist of an image encoder fimg(·) and a text encoder ftxt(·). Given an input image x and an input text t, the encoders produce embeddings zimg = fimg(x) and ztxt = ftxt(t) for the image and the text respectively. The model is trained to maximize the cosine similarity between the zimg and ztxt from the paired {image, caption} data, and minimize the cosine similarity between the unpaired data. At test time, to predict the class of an image x, we first encode the 259 candidate class names C = {c1, . . . , cn} using the text encoder individually, {z 1 txt, . . . , z n txt}. Then we compute the cosine similarity between the image x and a set of candidate class names C, logits(x, C) = zimg · z 1 txt, · · · , zimg · z n txt , The predicted class for this image x is cˆ(x) = argmaxc logits(x, C). For the problem of OOD detection, given a set of in-domain class labels Cin = {c in 1 , . . . , cin N }, [387] propose to use the maxc p(c|x, Cin) as the confidence score, where p(c|x, Cin) is the element of softmax (logits(x, Cin)) corresponding to the label c, i.e. p(c|x, Cin) = e wc P j∈Cin e wj , wc = zimg · z c txt . The logits can be further scaled by a temperature factor. A high confidence score indicates that the input image is likely to be from one of the Cin classes, and thus in-domain. The corresponding OOD score is defined as S-max_prob(x) = − max c∈Cin p(c|x, Cin) (14.1) {eq:max_prob_no_oe} eq:max_prob_no_oe The above method assumes that only a set of in-domain class labels is available, i.e., without exposure to OOD labels. However, although it can be difficult to obtain OOD images, it is often very easy to produce a set of possible OOD labels. In that setting, [148, 135] proposes to include the OOD labels Cout = {c out 1 , . . . , cout M } into the candidate label set C = Cin ∪ Cout, to utilize the knowledge as a weak form of outlier exposure, without using any OOD images for training. Then the logits are: logits(x, Cin ∪ Cout) = zimg · z 1 txt, · · · , zimg · z N txt, zimg · z N+1 txt , · · · , zimg · z N+M txt And the OOD score is defined as Ssum_out_prob(x) = X c∈Cout p(c|x, Cin ∪ Cout), (14.2) {eq:sum_prob} eq:sum_prob where p(c|x, Cin ∪ Cout) = e wc P j∈Cin e wj + P k∈Cout ewk , wc = zimg · z c txt. 260 … Consider in-domain classes only … Consider both in-domain and OOD classes Butterfly Text-image Similarity (Softmax normalized) Husky Papillon Dobermann Red admiral butterfly … A photograph of red … admiral butterfly OOD Husky Papillon Dobermann Husky Papillon Dobermann Husky Papillon Dobermann Image to text probability Figure 14.2: Our methods utilize in-domain and OOD label sets. When in-domain classes Cin are comprised only of dog breeds, a butterfly may be mistaken for a Papillon dog, possibly due to the similar shape and color to the dog’s ears. However, when a OOD set Cout is included consisting of the class name “butterfly”, or a more precise butterfly type “red admiral butterfly”, or a text description “a photograph of red admiral butterfly”, the image embedding’s similarity with this label pushes down the probabilities with the in-domain dog breeds, correctly identifying the image as OOD. Thus our method S-max_in_prob has better separation between in-domain and OOD compared to the baseline S-max_prob, as shown on the left 2D histograms. c14-method 14.2.2 Our methods: OOD scores utilizing in-domain and OOD label sets In this work, we follow on the same setting as [148] because in the real scenario it is generally easy to produce a set of OOD class labels. For example, for the problem of detecting non-persons in a person detection system, it is easy to create a list of non-person labels that are commonly shown in photos, Cout ={animals, buildings, cars, food, . . . }. We would like to exploit these labels to improve the decision boundary between in- and out-of distribution. Note that Cout may not cover all the sub-types in OOD space due to the long-tail distribution. That is why we assume there are unseen classes C ′ out. Note that our methods are only based on Cout not C ′ out. We will show the good generalization of our methods on unseen classes in Section 14.3.1. Suppose we have an in-domain label set Cin and a OOD label set Cout. Inspired by [387], we first propose the maximum softmax probability over the Cin as the confidence score, and its negative as the OOD score, S-max_in_prob(x) = − max c∈Cin p(c|x, Cin ∪ Cout). (14.3) {eq:max_ind_softmax_all} eq:max_ind_softmax_all 261 Note that our score is different from [387] in the sense that we apply the softmax normalization over all labels C = Cin ∪ Cout, while [387] only applies softmax on Cin. Including Cout in the label sets is important for OOD detection, as shown in Figure 14.2. An OOD image may have relatively high similarity to one of the Cin classes due to spurious features. Once Cout is included, the OOD image’s similarity with the OOD labels pushes down the probabilities with the in-domain Cin classes, correctly identifying the image as OOD. Alternatively, we have the maximum softmax probability over the Cout as another candidate OOD score, Smax_out_prob(x) = max c∈Cout p(c|x, Cin ∪ Cout). (14.4) {eq:max_ood_softmax_all} eq:max_ood_softmax_all We also consider a score based on logits without softmax normalization, as [224] previously show that in the larger-scale and real-world settings, the un-normalized maximum logit outperforms the normalized maximum softmax probability for OOD detection in the single modal models. We propose the OOD score as, Smax_logit_diff(x) = max d∈Cout wd − max c∈Cin wc. (14.5) {eq:max_min} eq:max_min The score measures if a test image has a higher similarity to any of the classes in Cout in comparison to the similarity to any of the classes in Cin. Having a reference for the similarity to Cin is helpful for understanding if the similarity to Cout is truly high or not. Smax_logit_diff(x) > 0 suggests the image is more similar to classes in Cout, and thus it can be inferred to be OOD. Otherwise it is more similar to classes in Cin and is inferred to be in-domain. Though S-max_in_prob(x) does not explicitly use the difference between the probability in Cin and that in Cout, the softmax normalization actually considers the difference between the probability of in and that of the rest classes, including the OOD class with the maximum probability. Therefore, the two scores S-max_in_prob(x) and Smax_logit_diff(x) measure similar quantities. In Section 14.2.5 we show the connection between the two. In summary, our proposed scores along with the baseline methods are listed in Table 14.1. Note that the proposed scores are computed only based on Cin and Cout, not C ′ in and C ′ out. C ′ in and C ′ out are only used at the test time for evaluating the performance on unknown. 262 Table 14.1: Comparison between the proposed scores and the baseline methods. tab:eq_list Uses Cout Uses softmax Definition S-max_prob [387] No Yes − maxc∈Cin p(c|x, Cin) Ssum_out_prob [148, 135] Yes Yes P c∈Cout p(c|x, Cin ∪ Cout) Smax_out_prob Yes Yes maxc∈Cout p(c|x, Cin ∪ Cout) S-max_in_prob (ours) Yes Yes − maxc∈Cin p(c|x, Cin ∪ Cout) Smax_logit_diff (ours) Yes No maxd∈Cout wd − maxc∈Cin wc where p(c|x, Cin) = e wc P j∈Cin e wj , p(c|x, Cin ∪ Cout) = e wc P j∈Cin e wj + P k∈Cout e wk , wc = zimg · z c txt. 14.2.3 Extension to customized in- and out-of-distribution label sets To compute our score, one only needs a definition of Cout and Cin. In fact we can extend Cout and Cin to be any label sets that are mutually exclusive. For example, the label sets can be defined at different hierarchical levels. One can use the high level super class names to define Cout and Cin. For example, for one-class person OOD detection, Cin={person}, Cout={animals, cars, . . . }. If the fine-grained level classes are known, one can define Cout and Cin in a more precise way. For example, Cin={children, adults, . . . }, Cout={dogs, cats, trucks, buses, . . . }. Because our text-image models can take any natural language as the text input, one can even use natural language to describe the sets via customized prompts. For example, ‘A photo of a {class} {doing}’. Therefore, our method can be easily extended to any customized definitions of in- and out-of-distribution. 14.2.4 One-class OOD detection in mixed in- and out-of-distribution multi-object images Detecting OOD in a mixed image that contains both in-domain and OOD objects is challenging. The OOD score for the image can be low due to the confounding of in-domain objects, causing false negatives in OOD detection. For example, an image of a person with a pet, or a dog on a chair. To better detect the in-domain and OOD mixed images, we need to first detect the multiple objects in the images. Grounded-DINO [354, 296] is one of the most powerful open-vocabulary detection model for detecting objects that correspond to an input text. For an image and a list of text labels, the output of Grounded-DINO contains a list of bounding boxes and the confidence scores for each box for each text label, i.e. score(i,j) for i-th bounding box and j-th text label. ImageNet-1K is known to contain multi-object images [498, 562]. We apply Grounded-DINO to the images in ImageNet-Multilabel dataset. We use Cout and Cin as the text input to Grounded-DINO. The output score(i,j) for 263 each bounding box are then treated as the logits, and we compute the OOD score for each bounding box. To identify the in- and out- mixed images, we propose an mixture score g(x), for indicating the confidence that the image contains both in-domain and OOD objects, as the greatest score difference among the bounding boxes. g(x) = max b∈boxes Sb − min b∈boxes Sb (14.6) {eq:mixture} eq:mixture where S can be any of the scores in Table 14.1. An in- and out- mixed image will have g(x) high. 14.2.5 The connection between S-max_in_prob and Smax_logit_diff sec:why In this section we show that the score S-max_in_prob and Smax_logit_diff share similar components. To unify the two, we first take the logarithm of the softmax score. Since logarithm is a monotonic function, the transformation preserves the order of the values. Since S-max_in_prob = − maxc∈Cin p(c|x, Cin ∪ Cout) < 0, to take the logarithm we reverse its value, log maxc∈Cin p(c|x, Cin ∪ Cout) = maxc∈Cin wc − log P j∈Cin e wj + P k∈Cout e wk . The second term can be decomposed into the the sum of maxq∈Cout e wq and the rest, thus log P j∈Cin e wj + P k∈Cout e wk = log (maxq∈Cout e wq (1 + r)) = maxq∈Cout wq + log (1 + r), where r = P l∈Cin∪Cout,l̸=q e wl maxq∈Cout e wq . Then we have − log maxc∈Cin p(c|x, Cin ∪ Cout) = maxd∈Cout wd − maxc∈Cin wc + log (1 + r) = Smax_logit_diff(x) + log (1 + r). The ratio r measures the peakiness of predicted probability distribution. When the predicted probability distribution is concentrated at the predicted OOD class, r ≈ 0 and thus log(−S-max_in_prob) ≈ −Smax_logit_diff. When the predicted probability distribution is spread out over a wide range of values, r is large and then log(−S-max_in_prob) < −Smax_logit_diff. A greater OOD score favors OOD but not in-domain samples. So depending on the use case, the two scores can have different advantages. 14.3 Experimental evaluation We evaluate our proposed methods, along with a few baseline methods, on large-scale datasets and real world challenging problems. Here are some of the challenges we considered: • Unseen classes: We evaluate the scenarios where the test images belong to none of the classes in Cout and Cin. For example, for the problem of person detection, we set Cout ={animal, car, food, . . . }, but at the test time we have 264 images of C ′ out ={toy, tree, . . . }, Cout ∩ C′ out = ∅. Similarly, we can have C ′ in={infant, senior, . . . }, which are not seen in Cin={children, adults, . . . }. • Distributional shift: We evaluate the scenarios where there is a covariate shift in inputs while the conditional distribution of classes is unchanged [534]. For example, a drawing of a person is a shift from the natural person images. The distributional shift datasets we evaluate with are ImageNet-V2 [447], ImageNet-A [229], ImageNet-R [226], and ImageNet-Sketch [579]. • Multi-object images: We evaluate on images that contain a mixture of in-domain and OOD objects, using ImageNetMultilabel dataset [498, 562]. In the real world, images may contain multiple objects, sometimes a mixture of in-domain and OOD objects. For example, a person with a dog (non-person). Those in- and out- mixed images are hard examples for OOD detection. Datasets We evaluate our model on the ImageNet-1K dataset validation split [480]. We group the images in the dataset by their class labels, following the Pascal Visual Object Classes (VOC) WordNet hierarchy [385, 324]. The Pascal VOC provides a mapping from the ImageNet-1K classes to a few common superclasses such as dog, cat, bird, etc. The number of subclasses in each superclass is as follows, {dog: 118, bird: 59, boat: 6, bottle: 7, bus: 3, car: 10, cat: 7, chair: 4, diningtable: 1, horse: 1, person: 3, sheep:1, train: 1, aeroplane: 1, bicycle: 2}, in total 224 classes. The remaining 776 classes are from rare categories such as “fox squirrel”, “snow leopard”, “cowboy hat”, “electric guitar”, forming a long-tail distribution. None of the 776 classes are in the common categories. Based on the class hierarchy, we evaluate the one-class OOD detection problems for the superclasses dog, bird, bus, car, cat, chair, person individually. For each of the one-class OOD problem, we use the images of classes belonging to the superclass as the in-domain data, and the images of the rest classes as the OOD data. Since we would like to evaluate the unseen classes, the set of in-domain classes are randomly split into equal size non-overlapping Cin and C ′ in. To make the results reproducible, we here use the first half of classes as Cin and the second half as C ′ in. We use the OOD classes belonging to the common super categories as Cout, and the OOD classes belonging to the remaining 776 classes as C ′ out. For example, for the dog vs non-dog problem, we split the 118 dog classes into Cin and C ′ in. Cout consists of the classes belonging to the common categories {bird, boat, bottle, bus, . . . , person, chair}, and C ′ out consists of the 776 rare classes. 265 To compute our scores, we assume we only have Cin and Cout. To evaluate the performance on OOD detection, we consider how well the scores can separate in-domain and OOD images belonging to (1) Cin vs Cout, and (2) C ′ in vs C ′ out. To evaluate the performance on distribution shifted data, we use ImageNet-V2, ImageNet-A, ImageNet-R, and ImageNet-Sketch in the corresponding classes to construct the test data. Besides the OOD detection using the superclasses as in-domain, we also consider a narrower in-domain OOD detection problem. The goal is to show that our method is general for both wide and narrow one-class problem. We use “terrier”, a dog sub-type, to construct the this narrower one-class OOD detection problem. Among the 118 dog classes, 23 of them are terrier. We randomly split the 23 classes into non-overlapping Cin and C ′ in. Again Cout consists of the classes belonging to the common categories, and C ′ in consists of the 776 rare classes. The rest 95 non-terrier dog classes are considered as near-OOD C near in . Evaluation metrics To evaluate the performance on one-class OOD detection, we use the area under the ROC curve (AUROC) between the scores of in-domain and that for OOD. The higher the AUROC score suggests a better separation between in- and out-of-distribution. Models We use CLIP ViT-B/16 as the main model for evaluating our method. We did ablation study on ViT-L/14, and the conclusions were consistent. 14.3.1 Our scores outperform the baselines on one-class OOD detection tasks sec:results_compare_methods We evaluate the performance of the listed scores in Table 14.1 on one-class OOD detection tasks using ImageNet and its variants. We are in particular interested in the generalization ability of the scores on unseen samples, and distribution shifted samples. As shown in Table 14.2, evaluated on the tasks of dog vs non-dog, car vs non-car, and person vs non-person, our proposed scores S-max_out_prob and Smax_logit_diff consistently outperform the baselines, achieving the highest AUROCs. The two proposed scores have similar high performance. For all the three tasks, our scores’ AUC approaches 1.0 on the images from pre-defined classes Cin and Cout. It is more challenging to detect images from the unseen classes C ′ in and C ′ out. The most challenging task is person vs non-person detection on unseen data, with the best AUC only 0.78 on ImageNet. That is possibly because (1) the person superclass consists of a set very diverse subtypes so that having one subtype (such as a groom) in Cin does not help much on detecting another subtype (such as a scuba diver). (2) There are only 3 person classes {baseball player, groom, scuba diver} in ImageNet so that Cin has 266 very limited coverage of the in-domain. Including more person class names in Cin would help to reduce the ambiguity between in-domain and OOD. Our scores also perform the best on other ImageNet variant datasets, ImageNet-V2, ImageNet-R, ImageNet-A, and ImageNet-Sketch. ImageNet-A in appears to be the most difficult dataset, since even our best score has 0.06 and 0.09 drop on AUC compared with ImageNet. The proposed two scores perform similarly well on most of the tasks, except for ImageNet-A. S-max_in_prob has AUC 0.1 higher than Smax_logit_diff on detecting unseen samples for the car vs non-car task. On the other hand, Smax_logit_diff is better than S-max_in_prob by 0.08 AUC on detecting dog vs non-dog. Note that S-max_out_prob does not perform as well as S-max_in_prob, especially on the unseen OOD. That is because Cout does not cover all the possible categories in OOD space. An unseen OOD that is not similar to any Cout will have the S-max_out_prob small. We also evaluate the performance on other one-class tasks such as bird vs non-bird. Table 14.2: One-class OOD detection across datasets for various in-domain cases evaluated using AUC↑. Our scores consistently outperform the baselines for detecting samples from unseen classes and under distribution shift. Note that ImageNet-A does not have person images (N/A in Table below). tab:joint_table Dog vs non-dog Car vs non-car Person vs non-person Cin vs Cout C ′ in vs C ′ out Cin vs Cout C ′ in vs C ′ out Cin vs Cout C ′ in vs C ′ out 270 ImageNet * S-max_prob 0.9687 0.8357 0.8209 0.5209 0.7781 0.5096 Ssum_out_prob 0.7715 0.8128 0.8957 0.7440 0.9859 0.6605 Smax_out_prob 0.9971 0.7321 0.9723 0.3652 0.9885 0.7803 S-max_in_prob (ours) 0.9979 0.9847 0.9944 0.9835 0.9995 0.4900 Smax_logit_diff (ours) 1.0000 0.9896 0.9996 0.9360 0.9997 0.6974 270ImageNet v2 S-max_prob 0.7163 0.7384 0.9081 0.6649 0.9589 0.7131 Ssum_out_prob 0.9590 0.8083 0.8570 0.4885 0.7567 0.4104 Smax_out_prob 0.9923 0.7067 0.9449 0.3651 0.9726 0.6993 S-max_in_prob (ours) 0.9945 0.9795 0.9870 0.9729 0.9975 0.6436 Smax_logit_diff (ours) 0.9994 0.9836 0.9988 0.9272 0.9997 0.7172 270 ImageNet R * S-max_prob 0.8148 0.6334 0.8902 0.5690 0.9723 0.4824 Ssum_out_prob 0.9616 0.7812 0.9247 0.3688 0.8639 0.5818 Smax_out_prob 0.9758 0.6950 0.9584 0.2464 0.9662 0.5917 S-max_in_prob (ours) 0.9903 0.9733 0.9976 0.9420 0.9979 0.5924 Smax_logit_diff (ours) 0.9990 0.9726 0.9998 0.8177 0.9990 0.6300 270 ImageNet Adversarial S-max_prob 0.3544 0.3991 0.8668 0.6388 N/A N/A Ssum_out_prob 0.9399 0.6848 0.8528 0.3853 N/A N/A Smax_out_prob 0.9307 0.6778 0.8486 0.3886 N/A N/A S-max_in_prob (ours) 0.8963 0.9261 0.9792 0.8860 N/A N/A Smax_logit_diff (ours) 0.9769 0.9333 0.9935 0.7880 N/A N/A 270 ImageNet Sketch S-max_prob 0.7676 0.8073 0.9179 0.6674 0.9678 0.3821 Ssum_out_prob 0.9643 0.8232 0.9150 0.4945 0.7934 0.5635 Smax_out_prob 0.9820 0.6979 0.9487 0.3672 0.9869 0.6203 S-max_in_prob (ours) 0.9851 0.9868 0.9967 0.9785 0.9985 0.6528 Smax_logit_diff (ours) 0.9993 0.9850 0.9993 0.9199 0.9999 0.6790 267 14.3.2 Customized in-domain and OOD label sets help to improve performance To compute our scores, one only needs the two set of labels Cin and Cout. The label set can be at different hierarchical levels, or even include natural languages. Here we explore a narrow in-domain OOD detection problem, terrier (a sub-type of dog) vs non-terrier, to demonstrate that customized label sets help to improve the performance. This problem also helps us to evaluate the performance on near-OOD, since the non-terrier dogs are naturally defined near-OOD. Both Cin and Cout can be defined at coarse- or fine-grained levels. The coarse level C coarse in ={terrier} or the fine-grained level C fine in ={Boston terrier, Norwich terrier, . . . } is paired with coarse level C coarse out ={bird, boat, . . . ’} or fine-grained level C fine out ={robin, . . . , canoe, . . . } to compute our scores. To better detect near-OOD, we also consider adding near-OOD classes to C near out . As shown in Table 14.3, providing more precise fine-grained level labels helps to improve the performance for both scores. Adding the near-OOD class labels further helps to improve the performance, particularly on near-OOD. Table 14.3: One-class OOD detection for dog sub-type terrier using different Cin and Cout label sets. More fine-grained label sets help to improve the performance for both scores. tab:terrier Label sets Cin vs Cout C ′ in vs C ′ out Cin vs C near out C ′ in vs C near out Average S-max_in_prob C coarse in ∪ Ccoarse out 0.9967 0.9982 0.8333 0.8722 0.9251 C fine in ∪ Ccoarse out 0.9954 0.9960 0.8770 0.7939 0.9156 C fine in ∪ Cfine out 0.9998 0.9992 0.9051 0.8584 0.9406 +C near out 0.9984 0.9974 0.9205 0.8704 0.9467 + add 80 prompts 0.9986 0.9981 0.9243 0.8687 0.9474 + add actions 0.9981 0.9977 0.9196 0.8710 0.9466 Smax_logit_diff C coarse in ∪ Ccoarse out 0.9994 0.9956 0.8170 0.8390 0.9128 C fine in ∪ Ccoarse out 1.0000 0.9987 0.9036 0.8613 0.9409 C fine in ∪ Cfine out 1.0000 0.9995 0.9072 0.8767 0.9458 +C near out 0.9997 0.9480 0.9758 0.9327 0.9640 + add 80 prompts 0.9998 0.9411 0.9769 0.9346 0.9631 + add actions 0.9998 0.9400 0.9779 0.9350 0.9632 Since the CLIP model can input any form of text, natural language that describes the in-domain and OOD classes can also be used for computing our scores. A simple way to generate sentences from the class names is to use the prompt template, such as ‘A photo of {}’. We apply the 80 hand-crafted prompts of [434] to each of the class name in the label set, and take the average of their embeddings per class name as the representation of that class. We also consider adding dogs’ actions such as playing, running, sleeping, and walking, using the prompt template ‘A photo of a 268 {class} {doing}’. The results show that adding the 80 prompts could improve the performance further for S-max_in_prob. However, adding actions does not have much effect on AUC. 14.3.3 OOD detection in mixed in-domain and OOD multi-object images Images with mixed in-domain and OOD objects are difficult to detect because the in-domain object can lower the OOD score, causing the images to be misclassified. We aim to flag those mixed images such that possible post-processing can be executed to correctly classify those images. We use the ImageNet-Multilabel dataset [498, 562] to evaluate the performance of OOD detection in mixed in-domain and OOD images. There are 1743 ImageNet images with more than one bounding box prediction. We want to identify images that contain both in-domain and OOD objects (Figure 14.3). OOD OOD ID ID OOD ID OOD ID OOD OOD Figure 14.3: Our methods detect OOD at bounding box level. Images having mixture of in-domain and OOD objects are identified. fig:multilabel As shown in the bottom section of Table 14.4, none of the single image OOD scores are able to identify the mixed in- and out-domain images, with AUCs between pure and mixed around 0.5. Table 14.4: Identifying in-domain and OOD mixed multi-object images using mixture score g(x) defined based on different OOD scores. None of the single scores can identify mixed images. New scores based on bounding box detection improve the performance, and our scores outperform the baselines. tab:multilabel Dog Bird Pure in vs mix Pure OOD vs mix Pure in vs mix Pure OOD vs mix Scores using bbox g(Ssum_out_prob) 0.6598 0.6918 0.8557 0.7844 g(S-max_in_prob) (ours) 0.6836 0.9570 0.7211 0.9833 g(Smax_logit_diff) (ours) 0.6861 0.8672 0.8846 0.8460 Single score S-max_prob 0.4907 0.5127 0.4869 0.4787 Ssum_out_prob 0.5492 0.4940 0.4908 0.4990 Smax_out_prob 0.5445 0.5152 0.4998 0.4872 S-max_in_prob 0.5527 0.5089 0.5169 0.4946 Smax_logit_diff 0.5794 0.5191 0.4886 0.4639 269 Image segmentation and object detection are needed for identifying those mixed images. We use Grounding-DINO to localize the multiple objects in bounding boxes along with their confidence scores. Since Grounding-DINO has a limitation on input text length, we decided to use the high level class names for Cin ={dog} and Cout ={bird, boat, person, . . . }. As described in Section 2.4, for each bounding box, we have a list of confidence scores corresponding to the list of class names. Then we compute our proposed OOD scores separately for each bounding box. To find the mixed in-domain and OOD multi-object images, we define the mixture score g(x) in Equation 14.6 for each image as the greatest score difference among the bounding boxes. Table 14.4 shows our proposed mixture score g(x), is able to distinguish the mixed images from pure in-domain and pure OOD images, for both dog and bird datasets, having AUCs higher than the baseline. 14.4 Related work One-class anomaly detection Anomaly detection can be formulated as a one-class classification problem [284], which aims to learn the distribution of the normal data only, and then predicts anomalies as data points that are out of the normal distribution. SVM based one-class classification, also called support vector data description (SVDD), fits a hypersphere with the minimum volume that includes most of the normal data points [406]. DeepSVDD leverages the ability of deep neural networks to first learn a good representation before mapping the data to a hypersphere [475]. Later works propose hybrid models that use an autoencoder to learn the data representation, and then map the representation to a one-class classification model [72, 655, 245, 414]. The existing one-class anomaly detection methods have limitations that (1) they need to be trained so they are not zero-shot, (2) they cannot leverage the abnormal data into training, and (3) they are evaluated on simple datasets such as MNIST and CIFAR-10 and on pre-defined closed OOD sets. Multi-label OOD detection To the best of our knowledge, all the existing work on multi-label OOD detection is to detect the images that contain none of the in-domain objects [580, 583, 537]. For example, if the in-domain classes are {dog, cat}, the goal is to detect images that do not contain any instances of dogs or cats, such as an image of a chair. In comparison, our work aims to detect the images that contain a mixture of in-domain and OOD objects, such as a dog on a chair. It is challenging to detect the images with mixed in-domain and OOD objects, because the in-domain 270 objects can confound the OOD score. We believe our work is the first to address this problem for in- and out- mixed multi-object out-of-distribution detection. 14.5 Conclusion and discussion We propose a novel one-class open-set OOD detector that leverages text-image pre-trained models in a zero-shot fashion and incorporates various descriptions of in-domain and OOD. Unlike prior work, we focus on more challenging and realistic settings for OOD detection. We evaluate on images that are from the long-tail of unseen classes, distribution shifted images, and in-domain and OOD mixed multi-object images. Our method is flexible enough to detect any types of OOD, defined with fine- or coarse-grained labels. Our method shows superior performance over previous baselines on all benchmarks. Nonetheless, our method has room for more improvement. We are interested in additional ways to incorporate natural language to define in-domain and OOD beyond prompts. An additional question is how to effectively use negation to define OOD. Better understanding of the advantages and trade-offs of the proposed two scores is also part of the future work. 271 Chapter 15 Invariant Structure Learning for Better Generalization and Causal Explainability chapter-15 Learning the causal structure behind data is invaluable for improving generalization and obtaining high-quality explanations. Towards this end, we propose a novel framework, Invariant Structure Learning (ISL), that is designed to improve causal structure discovery by utilizing generalization as an indication in the process. ISL splits the data into different environments, and learns a structure that is invariant to the target across different environments by imposing a consistency constraint. The proposed aggregation mechanism then selects the classifier based on a graph structure that reflects the causal mechanisms in the data more accurately compared to the structures learnt from individual environments. Furthermore, we extend ISL to a self-supervised learning setting, where accurate causal structure discovery does not rely on any labels. Self-supervised ISL utilizes proposals for invariant causality, by iteratively setting different nodes as targets. On synthetic and real-world datasets, we demonstrate that ISL accurately discovers the causal structure, outperforms alternative methods, and yields superior generalization for datasets with significant distribution shifts. We open-source our code at https://github.com/AaronXu9/ISL.git. 15.1 Introduction High capacity machine learning models such as deep neural networks (DNNs) have fueled transformational progress in numerous domains where the i.i.d. assumption is mostly valid [640] as they can be very effective in fitting to available training data. However, as a severe blind spot in conventional machine learning, the performance of such models can be much worse on the out-of-distribution (OOD) test data. This ‘overfitting’ phenomena can be attributed to 272 over-parameterized models such as DNNs absorbing spurious correlations as shown in Figure 15.1, from the training data and resulting in biases unrelated to the causal relationships that truly drive the input-output mapping for both training and test samples [639, 492, 463, 99, 33]. In most cases, the machine learning problems are underspecified, i.e. there are multiple distinct solutions that solve the problem by achieving equivalent held-out performance on i.i.d. data. Underspecification in practice can be an obstacle to reliable real-world deployment of high capacity machine learning models, as such models can exhibit unexpected behavior when the test data deviate from the training data [99, 23]. Various methods have been proposed towards reducing mitigating underspecification and overfitting: regularization approaches [25, 400, 551, 243] constrain the flexibility of models; data augmentation methods [620, 302] generate artificially-transformed samples invariant to labels; judicious DNN designs [22, 408, 262, 344] introduce appropriate inductive biases for the data types of interest. Notably, the CASTLE [304] approach introduces a novel regularization method that incorporates causal relationships into the regularization process. These approaches have shown some progress in improving generalization, and in some cases they yield significant improvements in test accuracy, however, underlying systematic framework beneath them is missing – they do not tackle the fundamental challenge of discovering causal relationships that are consistent across the training and test data and basing the decision making on them. Thus, their improvements remain restricted to specific scenarios – consistently showing significant OOD generalization improvements require discovery of casual relationships. Accurate discovery of causal relationships would not only improve accuracy and reliability, but also enable explainable decision making, which is crucial for high-stakes applications such as healthcare or finance [508]. Learning the true causal relationships is very challenging, fundamentally [491]. It is infeasible to consider all combinations for factors of variation (such as shape, size and color of an image), as it would be exponential in size (NM combinations with M categorical features where each can take N different values). Effective methods should reduce the prohibitively-high search cost and data inefficiency while accurately discovering the underlying mechanisms. Causal discovery has been studied using various approaches. There are methods based on interventional experiments by randomized controlled trials [417], but they are often prohibitively costly. A more realistic setting is learning from observational data. Constraint-based algorithms[526, 527] directly conduct independence tests to detect causal structure. Score-based algorithms [90, 251] adopt score functions consistent with the conditional independence statistics, however, 273 1 ~ 0, 1 2 + 1 2 ~ 0, 2 2 + 2 ~ = 1|1, 2 = 1 + 22 X1 Shape X3 Background X2 Texture Y Label (a) Structure causal model Environment 1 Environment 2 Environment 3 (b) Different environments Y X1 X2 X3 Y X1 X2 X3 Y X1 X2 X3 Ground-truth NOTEAR-MLP (Baseline) ISL (Ours) MSE: 0.12 MSE: 0.02 (c) Discovered DAGs with MSE. Figure 15.1: A motivational example. (a) For the image label Y (1 means the label is "cow" and 0 otherwise), X1 and X2 represent the causal parents about the image details (here, shape and texture), and X3 (background type where 1 indicates the presence of grass and 0 otherwise) represents a factor that isn’t causal to Y. S(·) is sigmoid function. In this example, texture (X2) is twice as causal to Y than shape (X1). (b) The relationship between Y and X3 vary across environments; since the conditional dependence is not consistent across environments X3 may not be treated as a major causal factor for Y. (c) We utilize the Mean Squared Error (MSE) as a metric to assess the prediction error for ’Y’. This is carried out by using the projected causal parent of ’Y’ as features inputted into a two-layer neural network. Smaller MSE value implies that the causal parents variables used for prediction are more precise. Our proposed method ISL yields more accurate discovery of the underlying causal relation – here, it correctly identifies X1 and X2 but not X3 as the causal factors of Y, improving the explanation quality and prediction accuracy. fig:motivation these can only find the Markov-equivalence class [207]. Functional causal models [507, 420] aim to identify the causal structure from the equivalence class, but the heuristic directed acyclic graph (DAG) search methods suffer from high computational cost and local optimality, as the number of nodes increases. To address this problem, NOTEARS [660] proposes a differentiable optimization framework. NOTEARS-MLP [661] and Gran-DAG [305] extend NOTEARS to non-linear modeling with DNNs. NoFear [594] reevaluates NOTEARS continuous optimization framework, and subsequently, innovates a local search algorithm that enhances its performance. GOLEM [401] propose a likelihoodbased structure learning method that applies soft sparcity and DAG constraints. DARING [223] uses constraints on independent residuals to facilitate DAG learning. One common shortcoming of these approaches is relying on empirical risk minimization. Recent work has shown on the other hand that invariant risk minimization [23] can be very powerful in preventing the absorption of spurious correlations during DAG learning. Our goal in this paper is to push the state-of-the-art in accurate discovery of structural causal models (SCM), and as a consequence, improve the accuracy and reliability of models especially in the presence of severe distribution 274 shifts. Motivated by the limitations of existing work mentioned above, we propose Invariant Structure Learning (ISL), a framework that yields causal explainability, based on tying generalization and SCM learning. Intuitively, better generalization should lead to more accurate SCM learning, and an accurate causal structure should yield improved robustness and generalization. ISL encourages reinforcement between these two goals. Specifically, ISL uses generalization accuracy as a constraint to learn the invariant SCM (as a DAG) that represents the causal relationship among variables. Take Figure 15.1 as an example, where we simplify the object recognition task by using variables to represent the key factors: X1: object shape, X2: object texture (including color), X3: image background (as context), with the output label Y . Figure 15.1 (a) shows the ground truth (GT) Structural Causal Model (SCM). During training, the data (Figure 15.1 (b)) consist of samples from different environments. Baseline methods such as NOTEARS-MLP and CASTLE directly estimate the underlying causal structure, which leads to spurious correlations being absorbed, which in turn results in sub-optimal test accuracy. Our method ISL, on the other hand, learns the invariant structure that correctly identifies the SCM and yields better test accuracy. Overall, our contributions are highlighted as: • We propose Invariant Structure Learning (ISL), a novel learning framework that yields accurate causal explanations by mining the invariant causal structure underlying the training data, and generalizes well to unknown out-of-distribution test data. • We generalize ISL to self-supervised causal structure learning, which first treats the discovered invariant correlations as potential causal edges, and then uses a DAG constraint to finalize the causal structure. • We demonstrate the effectiveness of ISL on various synthetic and real-world datasets. ISL yields state-of-the-art SCM discovery (clearly outperforming alternatives on real-world data) with a particularly prominent improvement for complex graphs structures. In addition, ISL improves the test prediction accuracy throughout, with especially large improvements in cases with significant data drifts (up to ∼ 80% MSE reduction compared to alternatives). 15.2 Related Works Improving machine learning generalization. Many different approaches have been studied to improve generalization (i.e. bringing the test performance closer to training). Regularization methods [25, 400, 551, 243, 237, 573], early stopping [195], gradient clipping [416], batch normalization [259], data augmentation [620, 302] are among the most popular ones. These aren’t based on discovering the true relationship between the features. Towards generalization 275 improvements with input feature discovery direction, supervised auto-encoders [313] add a reconstruction loss for the input features as a regularizer. Recently, some works [265, 29] combine causal discovery with model regularization for better generalization. CASTLE [304] implicitly uses underlying Structural Equation Model (SEM) reconstruction as the regularization to improve model generalization. However, it can’t explicitly yield a DAG for the causal structure and it can’t completely prevent learning spurious correlations. ISL addresses these two challenges by learning the invariant structure across environments and outputting a DAG to describe the causal data structure, and eventually showing better generalization. Causal structure discovery. Constraint-based causal discovery algorithms [526, 527], and some score-based methods[90, 251, 507, 420] conduct exhaustive and heuristic search for the DAG structure, yielding combinatorial explosion issue when they scale up to a larger number of the nodes. NOTEARS [660] proposes directly applying a standard numerical solver for constrained optimization to achieve a global approximate solution overcoming the scalability bottleneck. They formulate the structure learning problem as maximum likelihood estimation over observational data with the additional constraint that the weight matrix has to represent a DAG with the acyclicity and sparsity properties. NOTEARS-MLP [661] and Gran-DAG [305] extend NOTEARS to non-linear functions by using DNNs. RL-BIC [674] uses Reinforcement Learning to search for the DAG with the best scoring. GOLEM [401] applies a likelihood-based objective with soft sparsity and DAG constraints. These methods don’t consider using the generalization quantification as an indication or constrain during DAG learning, which makes the learned DAG sometimes absorb biases and spurious correlations from data. Invariant Learning. The field of invariant learning is increasingly gaining traction, largely due to its implications for causality. EIIL [98] focuses on the inference of environments to facilitate invariant learning, while ’ZIN’ [350] delves into the conditional feasibility and methods for environment inference. Another noteworthy contribution GALA [86], which advocates for invariant learning without the need for explicit environment partitioning. Despite these advances, our work stands apart in its capability for causal discovery. Unlike the aforementioned studies, our methodology uniquely allows for simultaneous learning of causal structures and predictions. 276 15.3 Methodology In Section 15.3.1, we first present the problem definition and the motivation behind our work by discussing how spurious correlation can affect the model generalization and how its influence can be alleviated. Section 15.3.2 describes the proposed ISL framework for a supervised learning setting. In Section 15.3.3, we extend our discussion to the generalization of our approach in a self-supervised setting. 15.3.1 Motivations c15-sec3.1 Problem definition. Standard supervised learning is defined for a dataset with given input variables Xˆ = (Y , X1, X2, ... Xd), including X = {Xi} d i=1 ∈ X and Y ∈ Y, the goal is to learn a predictive model fY : X → Y. PX,Y denotes the joint distribution of the features and target, Dtrain denotes the training data with N samples, and Dtest denotes the testing data. Ideally, we expect both Dtrain and Dtest to be i.i.d., sampled from the same distribution PX,Y . However, it is hard to satisfy this condition for real-world data. It becomes more severe when the model overfits to the training set or the training set does not reveal the underlying distribution PX,Y . Spurious correlations and causality. One perspective to explain poor generalization due to overfitting is models learning spurious correlations. Broadly, a correlation can be considered as spurious when the relationship does not hold across all samples in the same manner [23]. For example, for the image recognition task in Figure 15.1(a), the model may use the green color of the grass to recognize cows, instead of complete profile of its shape. The correlation between green-colored grass and the cow label would be spurious and not consistent. In contrast, with causal learning, our goal is to learn stable and invariant relationships, that generalize well. Let’s consider that there is a SCM defining how the random variables Xˆ =(Y , X1, X2, ... Xd) define each other. The target variable Y is generated by a function fY (P a(Y ), uY ), where P a(Y ) denotes the causal parents of Y in SCM. Non-parametric SEM [418] proves that if we use the causal parents of Y as inputs to predict Y , the learned model would be optimal and generalize well on the unknown test set. Identifying causal parents of Y from spurious correlations is the key to obtain better generalization and causal explainability. Invariant structure across environments: An environment is used to distinguish different properties of data (such as the generative source characteristics), and can help reveal reasons for spurious correlations. Examples of environments can be different devices for capturing the images, or the hospitals at which the patient data are collected. Broadly, 277 it can be considered as a set of conditions, interpreted as ‘context’ of the data [391]. As an important indication of distinguishing causality from spurious correlations, the causal structure of Y should be invariant across all possible environments. Our goal is to learn such invariant structure across environments, which should yield better generalization. 15.3.2 Learning framework c15-sec:3.2 Bi-level optimize ISL module ISL module ISL module Discovered DAG Predictions Training data Environment building (e.g., K-means) Env-1 Env-2 Env-K … Aggregation … … Share Predictor … … … ෩ ෪ ෪ Y X1 X2 … Xd Y X1 1 X2 1 … Xd 1 Figure 15.2: Top: The proposed Invariant Structure Learning (ISL) framework. Given raw data, we build different environments using unsupervised clustering, unless the data source information is provided. For different environments, each ISL module outputs a summarized DAG to represent the learned invariant structure. An aggregation mechanism then selects the optimal predictor based on a graph structure that reflects the causal mechanisms in the data more accurately. During training, the constraint on the Y prediction across environments helps learning an invariant structure. Consequently, the learned DAG leads to a superior predictor. Bottom: Details of the ISL module. θ Y 1 is the invariant structure of P a(Y ) shared across all modules. c15-fig:method Figure 15.2 shows the overall Invariant Structure Learning (ISL) framework. Given raw training data, we first build different environments. If we had the information on data generation and collection process, it would be simple to build different environments based on them. Without such information, we propose to build environments in an unsupervised way using unsupervised clustering, K-means [361], which clusters the raw data into different clusters that each representing an environment. The clustering can be done for the raw data or learned representations, which would require an encoder to map raw data. To determine the number of clusters, we use Elbow [547] and Silhouette [471] methods. To balance the data size across environments, we employ upsampling. We augment the data size to 278 reach to n, the largest number of data samples in an environment. Figure 15.2(top) depicts the environment building process, where each color represents a different source or generation method for the sample. In general, at least two diverse environments are sufficient to learn the invariant structure [23]. We show that ISL is robust to the number of environments and typically an intermediate value is optimal. After assigning data samples to different environments, our goal is to learn the invariant structure that results in superior generalization for predicting Y . In each environment, using that environment’s data, an ISL module (see Figure 15.2(bottom)) independently learns a DAG which defines the variable relationship for that specific environment. To learn the invariant structure for predicting Y , we add a constraint among ISL modules that the parameters to reconstruct Y should be identical across environments. The desiderata for invariance is expressed as a loss function over all environments in the training data. Ideally, the generalization goal would be minimization of an OOD risk ROOD(f) = maxe∈ϵall Re (f(X)) over all possible environments, not only the ones in the training data. We aim to approximate this with a tractable objective. We propose the risk under a certain environment e as Re (f) = EXe,Y e [l(f(Xe , Y e ))], where l denotes the loss function. We decompose the objective function f() into two components. The first is to find the representation of causal parents of Y from given X, which can be considered as the invariant structure for Y across environments. The second is to optimize the classifier with the learned P a(Y ) as the input. θ is a multi-layer perceptron used to approximate f(). It consists of: (i) g(·) with parameters θ Y 1 to learn a representation of P a(Y ), and (ii) h(·) that inputs the representation and yields the prediction for Y . To learn the representation of P a(Y ), g(·) should follow the causal structure of Y . Overall, the proposed objective to learn the invariant causal structure (as a DAG) is summarized as below: min θ Y 1 ,h X e∈ϵall R e (h ◦ θ Y 1 (X)), s.t. θY 1 , θY r , θX = arg min θ X e∈ϵall RDAG(X, θ ˆ ). (15.1) {c15-eq.2} c15-eq.2 This objective is in bi-level optimization form. The outer loop is for the final goal of obtaining the predictive model for Y to generalize well on all environments, requiring that θ Y 1 extract the representation of P a(Y ). The inner loop adds the constraint that θ Y 1 should be the invariant across all the environment during learning their DAGs. As shown in Figure 15.2, we use a DNN to learn the SCM (represented as a DAG) of the dataset. The parameter of θ consists of 279 three parts: first layer to reconstruct variable Y, θ Y 1 , rest layers to reconstruct variable Y, θ Y r , and layers to reconstruct other variables X, θ X. We use the following objective function represented as the DAG loss RDAG(X, θ ˆ ): RDAG(X, θ ˆ ) = Lrec(X, θ ˆ ) + ρ 2 |h(W)| 2 + αh(W) + βLsparse(θ), (15.2) {eq:dag} eq:dag where Lrec(θ) = 1 2N ||Xˆ − θ(Xˆ)||2 F and h(W) = Tr(eW⊙W ) − d, with || · ||F being the Frobenius norm, LDAG = ρ 2 |h(W)| 2 + αh(W) denotes the constrain of DAG. N denotes the number of samples, Tr() is the trace operator, W is a (d+1) × (d+1) adjacency matrix (W ∈ R (d+1)×(d+1)) which represent the connection strength between variables and the final DAG is summarized from W. [660] proves that W is a DAG if and only if h(W) = 0. W is summarized from the first layer θ1 of θ. Specifically, [W]k,j is the L2 norm of the k-th row of the parameter matrix θ j 1 . θ j is the DNN parameters used to reconstruct variable j, that we decompose as the first layer θ j 1 and the remaining layers θ j r (Figure 15.2). Lsparse(θ) = β1||θ Y 1 ||1 + β2||θ Y r ||2 + β3||θ X||1 + β4||θ X||2, where || · ||1 and || · ||2 denote l1 and l2 regularization, respectively, and βi are hyperparameters that can be optimized on validation set. Equation 15.2 is a solution using Augmented Lagrangian [149] approach, where α > 0 and ρ > 0 are gradually increased to find solutions that minimize h(W). To learn the invariant structure across different environments, in all modules, we use a shared layer θ Y 1 . It learns a representation of P a(Y ) given input feature X, (θ Y 1 (X) ≈ P a(X)), which is a constraint to learn the invariant structure of Y prediction among environments. We simplify the training of Equation 15.1 and the overall training objective of the proposed ISL is defined as: min θ,h Xn e∈ϵall (R e (h ◦ θ Y 1 (X)) + γRe DAG(X, θ ˆ Y 1 , θY r , θX)), (15.3) {eq:overall} eq:overall where γ is the trade-off parameter and Re DAG(X, θ ˆ Y 1 , θY r , θX) is the invariant structure constraint. We propose solving this problem with a second order Newton method, L-BFGS-B [669]. Algorithm 6 summarizes the training procedure. After DAG learning converges at all environments, we obtain the invariant structure of Y prediction by first computing Y -related columns in adjacency matrix W from shared θ Y 1 , and then using a threshold select the learned P a(Y ) (Y -related DAG). For the final target, we fix all parameters in Equation 15.3 except h(·), and fine-tune h(·). To obtain the overall DAG for the entire dataset, we aggregate the DAG across different environments by keeping only the overlapping edges across all environments. Please note that h(·) 280 Y Y Y Y Y Y (a) Step 1: Iteratively set each variable as target and propose an invariant structure for each variable. X1 X2 X3 X4 X5 X6 X1 1 1 X2 1 1 X3 1 1 1 X4 X5 1 1 X6 1 X1 X2 X3 X4 X5 X6 X1 1 1 X2 1 1 X3 1 X4 X5 1 X6 1 Aggregated DAG initialization Final DAG Constrained DAG Learning (b) Step 2: Aggregate all proposed causal parents for each variable into a single graph (may not be a DAG), then adjust the activation ability of DNN by de-activating all parameters that are related to the non-proposed edges, and lastly optimize Equation 15.2 to obtain final DAG. Figure 15.3: ISL in self-supervised setting. c15-fig:method-2 always takes the output of g(·) as input and yields the prediction for Y . In practice, the parameters of h(·) are updated twice: initially, g(·) and h(·) are combined to discover the causal parents of the target Y through the learning of g(·), and h(·) merely assisting the learning of g(·) through the DAG constraint and ERM. h(·) maybe suboptimal for Y prediction at this stage. Then, after determining the causal parent of the target variable Y by applying a threshold in the weight matrix of W and reconstructing the DAG, we updated g(·) and fix it. Then, we fine-tune h(·) using only the causal parent variables discovered by g(·)(fixed) to predict the target variable Y . Algorithm 6 Supervised Invariant Structure Learning Input: Dataset D Output: DAG, Y predictor f(X) = h ◦ θ Y 1 (X) 45 Build n environments ϵall, for each e ∈ ϵall, ρ e = 1, α e = 0, h e (W) = ∞. 46 Termination conditions h(W)tol = 10−8 , ρmax = 1016, max iteration as NMAX with i = 0. 47 while i < NMAX = 100 and maxe∈ϵall (h e (W)) > h(W)tol and mine∈ϵall (ρ e ) < ρmax do i += 1 48 for e = 1 to k do Calculate Re (h ◦ θ Y 1 (X)) and Re DAG(X, θ ˆ Y 1 , θY r , θX) in Equation 15.2 Update h, θ Y 1 , θ Y r , θ Y r , θ X with L-BFGS-B [669]; Calculate W from θ Y 1 ; Update h e (W); Update ρ e and α e 49 Summarize DAG from θ1 50 Fix all trainable parameters in Equation 15.3 except h, fine-tune h and obtain final f(X) = h ◦ θ Y 1 (X). c15-alg:training 15.3.3 Generalizing to self-supervised setting c15-sec3.3 In many scenarios, the target labels aren’t available, rendering self-supervised causal graph discovery as an paramount problem. Conventional functional causal models, such as NOTEARS, aim to find a trade-off between three objectives, 281 which are optimal to SEM: X = XW, whilst W should both resemble a DAG and be sparse (see Eq. (3)). As all variables aren’t distinguishable with equal importance, there is no prior knowledge about which nodes should be the source or target nodes. Due to the large variance among node distribution caused by variable semantic meaning, reconstruction accuracy driven learning is unstable, and sensitive to the variable distribution – some variables can be described using a simple distribution, while others may be hard to estimate due to the differences in data source. It can lead to a local minima, causing the learned DAG deviate from the real causal structure [277]. We propose a two-step DAG learning approach, as shown in Figure 15.3. Step 1. Invariant causality proposal: We first build multiple environments. Then, we iteratively set each node as Y (Figure 15.3) and run ISL to propose the invariant structure for Y as candidate causal parents. ISL keeps the invariant variables that are important to prediction of Y under the overall DAG constrain. As such, the learned invariant structure corresponds to either true causal parents of Y or the variables which have strong correlation with Y , thus treated as candidate of causal parents of Y ). Step 2. Constrained graph discovery: We aggregate the candidate causal parents of each variable in Step 1 and form an aggregated graph (Figure 15.3), where there are bi-direction edges, which isn’t allowed in a DAG. We can build a (d+1)×(d+1) binary adjacency matrix W to represent the graph where d+1 represent the number of nodes. j-th column of W represent the potential causal parents of node j. As described in Section 15.3.2, during DAG learning, for each variable Xj , we use a DNN θ j to reconstruct Xj given other variables. There is a corresponding mapping between the j-th column of W and the first layer θ j 1 of DNN θ j : the k-th row in the parameter matrix of θ j 1 encode the contribution of node k to node j, which associate with the value of [Wk,j ]. To narrow down the search space and improve DAG learning, if [Wk,j ] is 0 (node k aren’t potential causal parent of node j summarized from Step 1), we deactivate the corresponding parameter by fixing the value of k-th row in θ j 1 as 0. if [Wk,j ] is 1, we don’t add weight constraint to the first layer θ j 1 in the DAG. We use the parameter modification as a constraint on DAG learning and run a constrained version of DAG learning (Equation 15.2) to obtain the final DAG. 282 Algorithm 7 Self-Supervised Invariant Structure Learning Input: Dataset D Output: DAG # Step 1 Invariant causality proposal 51 Build n environments, ϵall 52 for Xi = X1 to Xd do Set Y = Xi and run algorithm 6, select only the P a(Xi) # Step 2 Constrained graph discovery 53 Aggregate P a(Xi) for each variable to form the initial graph G (may not be a DAG). 54 Summarize a initial adjacency matrix W′ 55 Add weight constraint on θ based on W′ 56 Run DAG mining (Eq.2 in main manuscript) to optimize θ and yield DAG to describe the approximated causal structure. alg:training_ssl 15.4 Experiments In this section, we evaluate the proposed ISL framework for causal explainability and better generalization. We conduct extensive experiments in two settings based on the availability of target labels: supervised learning tasks in Section 15.4.1 and self-supervised learning tasks in Section 15.4.2. Baselines: On causal explainability, we choose NOTEARS-MLP [661], GOLEM [401], and NoFear [594] as the baselines for learning the SCM which represented as a DAG. On target prediction, we choose a standard MLP and CASTLE [304] as the baseline methods. Metrics: We evaluate the estimated Y-related DAG and whole DAG structure using Structural Hamming Distance (SHD): the number of missing, falsely detected or reversed edges, lower the better. We evaluate the target (Y ) prediction accuracy in Mean Squared Error (MSE). We compute SHD and the errors for multiple times and report the mean value. 15.4.1 Supervised learning tasks sect:supervised_experiment 15.4.1.1 Synthetic data We first examine the performance of ISL in accurately discovering the casual structure, as well as the target prediction performance using synthetic tabular datasets with known casual structure information as well as the target labels. We aim to mimic challenging scenarios encountered for data generation and collection processes in real world, that the data may consist of samples from different environment sources, while the target-related causal structure is consistent in the entire dataset (e.g., Figure 15.1). We construct the synthetic datasets in the following way: 283 • Step 1: We randomly sample an initial DAG G′ following Erdos-Renyi or Scale-Free schema with different edge densities. We randomly select one node (which isn’t the source node) as the target Y . We calculate the number of causal parent nodes C of Y , c. If c < cmin, we randomly add cmin − c number of nodes into C as the causal parents of Y . • Step 2: To simulate the spurious correlations, we create s ∈ [1, ...k] new nodes S, and these nodes act as causal descendants of Y . After defining the causal parents and descendants of Y , now we obtain the GT DAG G. For all nodes X except Y and S, we define an ANM X = F(X) + ϵ to generate data on top of G, where F is a two-layer MLP whose parameters are uniformly sampled from (-2, -0.5) ∪ (0.5, 2). ϵ is the external noise which is randomly sampled from Gaussian, Exponential and Uniform. Y is generated from its causal parents C, Y ∼ P(Y = 1|sigmoid(G(C) + ϵ)). G can be either linear (uniformly random weight matrix) or non-linear (same initialization method as F). • Step 3: We randomly select the number of environments e from the uniform distribution of [2, 5]. For each environment, all nodes (except for s added spurious correlation nodes S ) in the GT DAG G follows the ANM (defined in Step 2) but with different random seed and noise term. For S, their correlation to Y isn’t invariant among environments and controlled by a continuous variable r ∈ [0, 1]. Specifically, for each node Si in S, S ∼ P(S = 1|Y = 1) = r = P(S = 0|Y = 0). • Step 4: We generate two different kinds of test set: In-distribution (ID) and Out-of-distribution (OOD). Both have the same number of environments with the training set. ID test set uses the same value of r as training set, while OOD test set uses uniformly random sampled r, which represents the unknown test environments. S ∼ P(S = ˆg(Y )|Y ) = r, P(S ∼ random variable) = 1-r. We generate different sizes of graphs with 5 c and s combinations (see Table 15.1) (10 datasets with 1000 samples for each environment). Table 15.1 shows that ISL significantly outperforms others for both Y prediction and Y -related DAG learning. Particularly in OOD scenarios, the outperformance is more prominent, up to 83% decrease in MSE compared to black-box MLP and 96% decrease compared to CASTLE. This is attributed to more accurate casual graph discovery (evident from lower SHD), allowing to capture dynamics that are consistent between ID and OOD data, and hence the model generalizes better. 284 Table 15.1: Synthetic tabular data experiments in supervised learning setting. Note that black-box MLP and CASTLE can’t provide DAGs. ISL yields lower MSE for ID and OOD, and lower SHD. Number of nodes Metrics MLP NOTEARS-MLP CASTLE GOLEM NoFear ISL (Ours) 3 (c=2, s=1) ID MSE 0.008 0.090 0.012 0.171 0.422 0.005 OOD MSE 0.016 0.191 0.020 0.394 0.451 0.010 Average SHD - 2 - 3 2 0 4 (c=2, s=2) ID MSE 0.006 0.082 0.019 0.250 0.441 0.006 OOD MSE 0.014 0.152 0.032 0.250 0.411 0.009 Average SHD - 2 - 4 2 0 5 (c=3, s=2) ID MSE 0.004 0.093 0.020 0.250 0.427 0.004 OOD MSE 0.004 0.060 0.016 0.250 0.419 0.004 Average SHD - 3 - 5 3 0 9 (c=4, s=5) ID MSE 0.006 0.061 0.025 0.250 0.416 0.005 OOD MSE 0.031 0.174 0.160 0.250 0.407 0.005 Average SHD - 4 - 10 4 0 20(c=10, s=10) ID MSE 0.008 0.051 0.137 0.250 0.403 0.008 OOD MSE 0.018 0.251 0.252 0.250 0.434 0.007 Average SHD - 9 - 19 9 1 table:exp:4.1 Counterfactual simulations: Besides the accurate predictions for test data, causal structure discovery is also notable for its capability of accurate modeling of counterfactual outcomes, i.e. predicting how the output would change with certain input changes. To demonstrate this, we design experiments by modifying the dataset used above. We randomly select a node Xi from causal parents set C or spurious correlation set S and change the value of Xi while keeping the other nodes unmodified. Then, we test the prediction accuracy on this dataset. Table 15.2 shows that ISL yields more accurate counterfactual outcomes, compared to alternatives. Particularly when the counterfactual source is spurious correlation variables, baseline methods are much worse at outcome predictions. Table 15.2: Synthetic tabular data counterfactual simulation experiments. MSE is shown for various counterfactual outcomes, obtained by modifying the ‘counterfactual source’ variables. Counterfactual source MLP NOTEARS-MLP CASTLE ISL (Ours) Causal parent X1 0.021 0.280 0.034 0.016 Causal parent X2 0.043 0.301 0.064 0.012 Spurious correlation S1 0.184 30.962 0.471 0.012 table:exp:4.2 285 15.4.1.2 Real-world data We perform supervised learning experiments on real-world datasets with GT causal structure: Boston Housing [44, 1] and Insurance [44, 2] datasets. For each, we randomly split the train/validation/test with the proportion 0.8/0.1/0.1. We conduct three experiments and show the average performance. We consider the accuracy for Y prediction and target-related DAG (causal parents of Y ) learning. Specifically, Boston Housing contains information collected by the U.S Census Service concerning housing in Boston. There are 14 attributes including 1 binary variable and 506 samples in which the median value of homes (MED) is to be predicted. For ISL, we first calculate the Within-Cluster-Sum of Squared (WSS) errors for different values of k, and choose the k for which the WSS starts to diminish. Based on this, we build k = 2 environments. We obtain the Y -related GT DAG of Boston Housing from [595, 645]. The Insurance dataset is based on a network for car insurance risk estimation. The network has 27 nodes and 52 edges with 20000 samples. The Insurance dataset provides the GT causal structure as a DAG. Three of the observable nodes (‘PropCost’, ‘LiabilityCost’ and ‘MedCost’) are designated as ‘outputs’. Besides the designated output, we add other variables ‘CarValue’ (based on the importance for the task) as the target as well. For ISL, similarly, we use K-means clustering to build k = 3 different environments. Table 15.3 summarizes the results for Y prediction and Y -related causal structure learning. We observe that ISL significantly outperforms black-box MLP in all cases (up to 74% MSE reduction), as well as NOTEARS-MLP and CASTLE. Table 15.3: Supervised learning experiments on real-world data. Note that MLP and CASTLE cannot provide DAGs (and thus don’t have SHD). Dataset Target Metrics MLP NOTEARS-MLP CASTLE GOLEM NoFear ISL (Ours) Boston Housing MED MSE (↓) 0.16 0.12 0.10 0.53 0.53 0.05 SHD (↓) - 2 - 3 3 1 Insurance ‘PropCost’ MSE (↓) 0.40 0.99 0.36 0.68 1.03 0.34 SHD (↓) - 2 - 1 1 0 ‘MedCost’ MSE (↓) 0.69 1.03 0.55 0.86 0.99 0.52 SHD (↓) - 2 - 4 4 0 ‘LiabilityCost’ MSE (↓) 0.94 0.39 0.38 0.50 1.03 0.25 SHD (↓) - 1 - 2 3 0 ‘CarValue’ MSE (↓) 0.23 0.60 0.23 0.97 0.97 0.23 SHD (↓) - 2 - 6 4 1 table:exp:4.3 Impact of the number of environments: The default number of environments we used in all experiments is 3. We investigate the impact of the number of environments on Boston Housing and synthetic data. Table 15.4 shows the clear 286 PropCost ThisCarCost OtherCarCost CarValue MakeModel VehicleYea r Mileage MedCost Accident AirBag Age LiabilityCost SocioEcon OtherCar RiskAversion GoodStudent SeniorTrain DrivingSkill DrivHist DrivQuality Cushioning Antilock RuggedAuto ThisCarDam AntiTheft HomeBase MED CRI LST RM AGE DIS Prop Cost This Car Cost Other Car Cost Car Value Make Model Vehicle Year Mileage Med Cost Accid ent Cushioni ng Age Liabil ity Cost Accid ent Boston Housing Insurance (a) Y-related DAG (Supervised) PropCost ThisCarCost OtherCarCost CarValue MakeModel VehicleYear Mileage MedCost Accident AirBag Age LiabilityCost SocioEcon OtherCar RiskAversion GoodStudent SeniorTrain DrivingSkill DrivHist DrivQuality Cushioning Antilock RuggedAuto ThisCarDam Theft AntiTheft HomeBase (b) Discovered DAG for Insurance dataset (Self-supervised) Figure 15.4: Visualization of discovered causal structure in (a) supervised and (b) self-supervised settings. Blue solid arrows are overlapped edges between our results and GT, red solid arrow denotes the edges that we can identify but with wrong direction, green solid arrow denotes our proposed edge that is not contained for, yellow dash arrows denotes our missing edges that GT contains. c15-fig:vis value of having multiple environments – with only one, the invariant constraint is not effective, yielding worse results. Increasing the number of environments has diminishing returns. Table 15.4: The impact of number of environments for ISL in supervised learning setting. Dataset Target Metrics ISL (e=1) ISL (e=2) ISL (e=7) Boston Housing MED MSE 0.067 0.051 0.051 SHD 5 1 1 Synthetic data (Figure 15.1) Y MSE 0.110 0.022 0.020 SHD 2 0 0 table:exp:4.4 15.4.2 Self-supervised learning sect:self_supervised_experiment For self-supervised learning tasks, there is no target variable, and the goal is to learn accurate SCM, represented as a DAG, that represent the underlying causal structure of given dataset. We conduct experiments on two real-world datasets: Sachs [482, 3] and Insurance [44, 2] datasets. The Sachs dataset is for the discovery of protein signaling network on expression levels of different proteins and phospholipids in human cells [482], and is a popular benchmark 287 Table 15.5: Self-supervised causal graph discovery on the Sachs and Insurance datasets. Dataset Method Total Correct SHD (↓) Edges Edges (↑) Sachs RL-BIC [674] 10 7 11 GraN-DAG [305] 10 5 13 NOTEARS-MLP [661] 11 6 11 DAG-GNN [631] 15 6 16 GOLEM [401] 11 6 14 NOTEARS [660] 20 6 19 ICA-LiNGAM [507] 8 4 14 CAM [193] 10 6 12 DARING [223] 15 7 11 ISL (Ours) 12 8 8 Insurance NOTEARS-MLP [661] 35 18 39 NOTEARS [660] 24 10 46 GOLEM [401] 36 28 61 NoFear [594] 15 10 49 ISL (Ours) 46 31 27 table:exp:sachs for causal graph discovery, containing both observational and interventional data. The true causal graph from [482] contains 11 nodes and 17 edges. We conduct our two-stage DAG learning based on ISL by building 3 environments and compare the DAG results with different baselines. Table 15.5 shows that ISL outperforms all other methods in correct discovery of the GT DAG on both Sachs and Insurance. On the challenging Insurance data, the number of corrected edges is 72% higher for ISL, compared to NOTEARS-MLP. 15.5 Conclusions We propose a novel method, ISL, for accurate causal structure discovery. The ISL framework is based on splitting the training data into different environments and learning the structure that is invariant to the selected target. We demonstrate the effectiveness of ISL in both supervised and self-supervised learning settings. On synthetic and real-world datasets, we show that ISL yields more accurate causal structure discovery compared to alternatives, which also results in superior generalization, especially against severe distribution shifts. 288 15.6 Limitations and Future Work Our approach has proven effective in uncovering the causal structure by leveraging the assumption of its invariance across multiple environments. However, this method relies heavily on partitioning the dataset into distinct environments via clustering, a process that may present challenges in cases where data is characterized by high dimensionality or fluctuating densities. Therefore, a potential limitation of our current method is its dependency on successful clustering, and a shortcoming may occur when clustering algorithms struggle due to data complexity. Moving forward, we aim to focus on developing enhanced algorithms with an emphasis on more evenly distributing the dataset. By improving the partitioning of the data, we hope to increase the robustness and applicability of our approach in diverse data scenarios. In doing so, we anticipate further optimizing our method’s capacity to accurately discern causal structures, thereby advancing the field’s understanding and modeling of complex systems. 289 Chapter 16 Lightweight Learner for Shared Knowledge Lifelong Learning chapter-16 In Lifelong Learning (LL), agents continually learn as they encounter new conditions and tasks. Most current LL is limited to a single agent that learns tasks sequentially. Dedicated LL machinery is then deployed to mitigate the forgetting of old tasks as new tasks are learned. This is inherently slow. We propose a new Shared Knowledge Lifelong Learning (SKILL) challenge, which deploys a decentralized population of LL agents that each sequentially learn different tasks, with all agents operating independently and in parallel. After learning their respective tasks, agents share and consolidate their knowledge over a decentralized communication network, so that, in the end, all agents can master all tasks. We present one solution to SKILL which uses Lightweight Lifelong Learning (LLL) agents, where the goal is to facilitate efficient sharing by minimizing the fraction of the agent that is specialized for any given task. Each LLL agent thus consists of a common task-agnostic immutable part, where most parameters are, and individual task-specific modules that contain fewer parameters but are adapted to each task. Agents share their task-specific modules, plus summary information ("task anchors") representing their tasks in the common task-agnostic latent space of all agents. Receiving agents register each received task-specific module using the corresponding anchor. Thus, every agent improves its ability to solve new tasks each time new task-specific modules and anchors are received. If all agents can communicate with all others, eventually all agents become identical and can solve all tasks. On a new, very challenging SKILL-102 dataset with 102 image classification tasks (5,033 classes in total, 2,041,225 training, 243,464 validation, and 243,464 test images), we achieve much higher (and SOTA) accuracy over 8 LL baselines, while also achieving near perfect parallelization. Code and data can be found at https://github.com/gyhandy/Shared-Knowledge-Lifelong-Learning 290 16.1 Introduction Lifelong Learning (LL) is a relatively new area of machine learning (ML) research, in which agents continually learn as they encounter new tasks, acquiring novel task knowledge while avoiding forgetting of previous tasks [413]. This differs from standard train-then-deploy ML, which cannot incrementally learn without catastrophic interference across successive tasks [151]. Most current LL research assumes a single agent that sequentially learns from its own actions and surroundings, which, by design, is not parallelizable over time and/or physical locations. In the real world, tasks may happen in different places; for instance, we may need agents that can operate in deserts, forests, and snow, as well as recognize birds in the sky and fish in the deep ocean. The possibility of parallel task learning and sharing among multiple agents to speed up lifelong learning has traditionally been overlooked. To solve the above challenges, we propose a new Lifelong Learning challenge scenario, Shared Knowledge Lifelong Learning (SKILL): A population of originally identical LL agents is deployed to a number of distinct physical locations. Each agent learns a sequence of tasks in its location. Agents share knowledge over a decentralized network, so that, in the end, all agents can master all tasks. SKILL promises the following benefits: speedup of learning through parallelization; ability to simultaneously learn from distinct locations; resilience to failures as no central server is used; possible synergies among agents, whereby what is learned by one agent may facilitate future learning by other agents. Application scenarios for SKILL include: 1) Users each take pictures of landmark places and buildings in their own city, then provide annotations for those. After learning and sharing, all users can identify all landmarks while traveling to any city. This could also apply to recognizing products in stores or markets in various countries, or foods at restaurants worldwide. Thus, even though each teacher only learns at one or a few locations (or tasks), eventually all users may be interested in knowledge from all locations, as it will be useful during travel. 2) Agents in remote outposts worldwide with limited connectivity are taught to recognize symptoms of new emerging diseases, then share their knowledge to allow any agent to quickly identify all diseases. 3) Explorers are dispatched to various remote locations and each learns about plant or animal species they encounter, then later shares with other agents who may encounter similar or different species. 4) Each time a criminal of some sorts is apprehended (e.g., shoplifter, insurgent, spy, robber, sex offender, etc), the local authorities take several hundred pictures to learn to identify that person. Then all local authorities share their knowledge so that any criminal can later be identified anywhere. 291 However, to solve SKILL, one must address the following challenges: Chal-1 Distributed, decentralized learning of multiple tasks. A solution to SKILL should support a population of agents deployed over several physical locations and each learning one or more sequential tasks. For resilience reasons, the population should not rely on a single central server. c16-r-1 Chal-2 Lifelong learning ability: Each agent must be capable of lifelong learning, i.e., learning a sequence of tasks with minimal interference and no access to previous data as each new task is learned. c16-r-2 Chal-3 Shareable knowledge representation: The knowledge representation should easily be shared and understood among agents. Agents must be able to consolidate knowledge from other agents in a decentralized, distributed fashion. c16-r-3 Chal-4 Speedup through parallelization: Shared knowledge should be sufficiently compact, so that the benefits from using multiple parallel agents are not erased by communications costs. Adding more agents should result in greater speedup compared to a single agent. We measure speedup as the the ratio of time it takes for one agent to learn all tasks compared to N agents (larger is better). As a goal for our work, we strive for a speedup of at least 0.5×N with N agents, where perfect speedup would be 1.0 × N if there was no parallelization and communications overhead. c16-r-4 Chal-5 Ability to harness possible synergies among tasks: When possible, learning some tasks may improve learning speed or performance at other, related tasks. c16-r-5 To address the SKILL challenge, we take inspiration from neuroscience. Many approaches to LL involve at least partially retraining the core network that performs tasks (feature extraction backbone plus classification head), as every new task is learned. But transmitting and then merging these networks across multiple agents would incur very high communications and computation costs. With the exception of perceptual learning, where human visual cortex may indeed be altered when learning specific visual discrimination tasks for days or weeks [194, 124], there is little evidence that our entire visual cortex — from early stage filters in primary visual cortex to complex features in inferotemporal cortex — is significantly altered when we just learn, e.g., about a new class of flowers from a few exemplars. Instead, the perirhinal cortex (and more generally the medial temporal lobe) may be learning new representations for new objects by drawing upon and combining existing visual features and representations from visual cortex [116]. This may give rise to specialized "grandmother cells" [51] (or Jennifer Aniston neurons; [431, 432]) that can be trained on top of an otherwise rather immutable visual cortex backbone. While the grandmother cell hypothesis remains debated in 292 neuroscience (vs. distributed representations; [560]), here, it motivates us to explore the viability of a new lightweight lifelong learning scheme, where the feature extraction backbone and the latent representation are fixed, and each new object class learned is represented by a single new neuron that draws from this representation. From this inspiration, we propose a simple but effective solution to SKILL based on new lightweight lifelong learning (LLL) agents. Each LLL agent uses a common frozen backbone built-in at initialization, so that only the last layer (head) plus some small adjustments to the backbone (beneficial biases) are learned for each task. To eliminate the need for a task oracle, LLL agents also learn and share summary statistics about their training datasets, or share a few training images, to help other agents assign test samples to the correct head (task mapper). On a new, very challenging dataset with 102 image classification tasks (5,033 classes in total, 2,041,225 training, 243,464 validation, and 243,464 test images), we achieve higher accuracy compared to 8 LL baselines, and also near-perfect parallelization speedup. Our main contributions are: (1) We formulate a new Lifelong learning challenge, Shared Knowledge Lifelong Learning (SKILL), which focuses on parallel (sped up) task learning and knowledge sharing among agents. We frame SKILL and contrast it with multi-task learning, sequential LL, and federated learning (Section 16.3). (2) A new LL benchmark dataset: SKILL-102, with 102 complex image classification tasks. To the best of our knowledge, it is the most challenging benchmark to evaluate LL and SKILL algorithms in the image classification domain, with the largest number of tasks, classes, and inter-task variance (Section 16.4). (3) A solution to the SKILL problem: Lightweight Lifelong Learning (LLL) for efficient knowledge sharing among agents, using a fixed shared core plus task-specific shareable modules. The need for a task oracle is eliminated by using a task mapper, which can automatically determine the task at inference time from just an input image (Section 16.5). (4) Our SKILL algorithm achieves SOTA performance on three main metrics: High LL task accuracy (less catastrophic forgetting), low shared (communication) resources, and high speedup ratio (Section 16.6). (5) The proposed Lightweight Lifelong Learner shows promising forward knowledge transfer, which reuses the accumulated knowledge for faster and more accurate learning of new tasks. 293 �! " �# " �$ " �% " �& " … �" �" a) Multi-task Learning c) Federated Learning �! " �" �" �! # �# �# �! $ �$ �$ �! % �% �% �! & �& �& �'()*(+ �! " �# " �$ " �% " �& " … �" �" b) Sequential Lifelong Learning time t-2 t-1 t+1 �! " �# " �$ " �% " �& " … �" �" time t-2 t-1 t+1 d) Shared Knowledge Lifelong Learning (SKILL) �! ' �# ' �$ ' �% ' �& ' … �' �' time t-2 t-1 t+1 �! ( �# ( �$ ( �% ( �& ( … �( �( time t-2 t-1 t+1 �! ) �# ) �$ ) �% ) �& ) … �) �) time t-2 t-1 t+1 �! * �# * �$ * �% * �& * … �* �* time t-2 t-1 t+1 Task (T) Physical Region (R) Agent (A) Agent (A) with all task abilities Share knowledge (communication) Learning time (t) legend Comparison Parallel Learning Solve multi tasks Obtain agent(s) solve all tasks Allow tasks in different physical locations Communicate between agents a) Multi-task Learning ✓ ✓ ✓ ✕ ✕ b) Sequential Lifelong Learning ✕ ✓ ✓ ✕ ✕ c) Federated Learning ✕ ✕ ✕ ✓ ✓ d) Shared Knowledge Lifelong Learning (SKILL) ✓ ✓ ✓ ✓ ✓ Figure 16.1: SKILL vs. related learning paradigms. a) Multi-task learning [70]: one agent learns all tasks at the same time in the same physical location. b) Sequential Lifelong Learning (S-LL) [338]: one agent learns all tasks sequentially in one location, deploying LL-specific machinery to avoid task interference. c) Federated learning [380]: multiple agents learn the same task in different physical locations, then sharing learned knowledge (parameters) with a center agent. d) Our SKILL: different S-LL agents in different physical regions each learn tasks, and learned knowledge is shared among all agents, such that finally all agents can solve all tasks. Bottom-right table: Strengths & weaknesses of each approach. Fig-skill 16.2 Related Works c16-sec:related 16.2.1 Lifelong Learning Lifelong Learning (LL) aims to develop AI systems that can continuously learn to address new tasks from new data, while preserving knowledge learned from previously learned tasks [378]. It also refers to the ability to continually learn over time by accommodating new knowledge while retaining previously learned experiences [413]. LL is challenging because it is usually assumed that the training data from previous tasks is not available anymore while learning new tasks; hence one cannot just accumulate training data over time and then learn from all the data collected so far. Instead, new approaches have been proposed, which fall under three main branches [105]: (1) Regularization methods add an auxiliary loss term to the primary task objective to constrain weight updates, so as to minimally disturb previously learned knowledge while learning new tasks. The extra loss can be a penalty on the parameters (EWC [297], MAS [12] and SI [634]) or on the feature-space (FDR [38]), such as using Knowledge Distillation (LwF [338], DMC [644]). 294 (2) Parameter-Isolation methods assign a fixed set of model parameters to a task and avoid over-writing them when new tasks are learned (SUPSUP [604]), PSP [89] and BPN [598]. (3) Rehearsal methods use a buffer containing sampled training data from previous tasks, as an auxiliary to a new task’s training set. The buffer can be used either at the end of the task training (iCaRL, ER [446, 462]) or during training (GSS, AGEM, AGEM-R, GSS, DER, DERPP [363, 78, 14, 61]). However, most traditional LL algorithms cannot satisfy the requirement of SKILL: parallel learning for speeding up, and sharing knowledge among agents. 16.2.2 Multi-task Learning Multi-Task Learning (MTL) aims to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks [653, 97, 474]. The main difference between MTL and SKILL is that MTL assumes that all tasks are located in the same physical region, and that one can access the datasets of all tasks at the same time [653]. While MTL learns multiple tasks together, SKILL assumes that different knowledge sources are separated in different physical regions and different agents should learn them in parallel. 16.2.3 Federated Learning Federated learning (FL) is a machine learning setting where many clients (e.g., mobile devices, networked computers, or even whole organizations) collaboratively train a model under the orchestration of a central server, while keeping the training data decentralized [276, 332, 47]. As shown in Figure 16.1, compared with SKILL: (1) FL agents usually learn the same task from multiple partial datasets in different locations, relying on the central server to accumulate and consolidate the partial knowledge provided by each agent. In contrast, SKILL agents solve different tasks, and share knowledge over a decentralized network. (2) Each SKILL agent may learn multiple tasks in sequence, and hence must solve the LL problem of accumulating new knowledge while not forgetting old knowledge. Sequences of tasks and the LL problem are usually not a primary focus of FL, with a few exceptions [630] which still do not directly apply to SKILL, as they focus on a single task for all agents and use a central server. Because federated learning relies on a central server, it is susceptible to complete failure if that server is destroyed; in contrast, in SKILL, as long as not all agents are destroyed, the surviving agents can still share and master some of the tasks. 295 16.2.4 Other methods that may help solve SKILL One related direction is to share a compact representation dataset: Dataset distillation [587] combines all training exemplars into a small number of super-exemplars which, when learned from using gradient descent, would generate the same gradients as the larger, original training set. However, the distillation cost is very high, which could not satisfy the 0.5N speedup requirement. Another related direction is to reuse shared parameters for different tasks: Adversarial reprogramming [132] computes a single noise pattern for each task. This pattern is then added to inputs for a new task and fed through the original network. The original network processes the combined input + noise and generates an output, which is then remapped onto the desired output domain. However, the cost of the reprogramming training process is high, which could not satisfy the 0.5N speedup requirement. Another related direction is to use a generative adversarial network (GAN [198]) to learn the distribution of different datasets and generate for replay. Closed-loop GAN (CloGAN [459]) could be continuously trained with new data, while at the same time generating data from previously learned tasks for interleaved training. However, the GAN needs more parameters to transmit, and the high training time does not satisfy the 0.5N speedup requirement. 16.3 Shared knowledge in lifelong learning (SKILL) c16-sec:2 The chief motivation for SKILL is to enable the next generation of highly-efficient, parallelizable, and resilient lifelong learning. Assumptions: (1) A population of N agents wants to learn a total of T different tasks separated into N physical regions. (2) Each agent i asynchronously learns 1 ≤ Ti ≤ T tasks, in sequence, from the distinct inputs and operating conditions it encounters. As in standard LL, training data from previous tasks is not available anymore while learning the next task. (3) Each agent performs as a "teacher" for its Ti tasks, by sharing what it has learned with the other N − 1 agents; at the same time, each agent also performs as a "student" by receiving knowledge from the other N − 1 agents. In the end, every agent has the knowledge to solve all T tasks. Figure 16.1 contrasts SKILL with other learning paradigms. Note how here we use "teacher" and "student" to distinguish the two roles that every agent will perform; this is different from and not to be confused with other uses of student/teacher terminology, for example in knowledge distillation. (4) There 296 is a perfect task oracle at training time, i.e., each agent is told which tasks it should learn. (5) There is a clear separation between tasks, and between training and test phases. c16-sec:2.2 Evaluation metrics: (1) CPU/computation expenditure. This metric is important to gauge the efficacy of an approach and its ability to scale up with more agents operating in parallel. Wall-clock time is the main metric of interest, so that speedup can be achieved through parallelism. Thus, if N agents learn for 1 unit of time, wall-clock time would be 1, which is an N-fold speedup over a single sequential agent. In practice, speedup < N is expected because of overhead for sharing, communications, and knowledge consolidation. Because wall clock time assumes a given CPU or GPU speed, we instead report the number of multiply-accumulate (MAC) operations. (2) Network/communication expenditure. Sharing knowledge over a network is costly and hence should be minimized. To relate communications to computation, and hence allow trade-offs, we assume a factor α = 1, 000 MACs / byte transmitted. It is a hyperparameter in our results that can be easily changed to adapt to different network types (e.g., wired vs. wireless). (3) Performance: After the population of N agents has collectively learned all T tasks, we report aggregated (averaged) performance over all T tasks (correct classification rate over all tasks). Note how here we assume that there is no task oracle at test time. After training, agents should be able to handle any input from any task without being told which task that input corresponds to. SKILL does not assume a free task oracle because transmitting training data across agents is potentially very expensive. Thus, agents must also share information that will allow receiving agents to know when a new test input relates to each received task. Open questions: What knowledge should be shared? SKILL agents must share knowledge that is useful to other agents and avoid sharing local or specialized knowledge that may be misleading, in conflict with, or inappropriate to other agents. The shared knowledge may include model parameters, model structure, generalizations/specializations, input data, specific contextual information, etc. There are also size/memory/communication constraints for the shared knowledge. When and how to share? Different communication network topologies and sharing frequencies likely would lead to different results. Here, we will sidestep this problem and assume a fully connected communication network, and broadcast sharing from each agent to all others each time a new task has been learned. 297 Figure 16.2: (a) SKILL-102 dataset visualization. Task difficulty (y-axis) was estimated as the error rate of a ResNet-18 trained from scratch on each task for a fixed number of epochs. Circle size reflects dataset size (number of images). (b) Comparison with other benchmark datasets including Visual Domain Decathlon [445], Cifar-100 [301], F-CelebA [283], Fine-grained 6 tasks [479] [575], [404], [299], [484], [131] c) Qualitative visualization of other datasets, using the same legend and format as in a). Fig-102d 16.4 SKILL-102 dataset c16-sec:3 We use image classification as the basic task framework and propose a novel LL benchmark dataset: SKILL-102 (Figure 16.2). SKILL-102 consists of 102 image classification datasets. Each one supports one complex classification task, and the corresponding dataset was obtained from previously published sources (e.g., task 1: classify flowers into 102 classes, such as lily, rose, petunia, etc using 8,185 train/val/test images [403]; task 2: classify 67 types of scenes, such as kitchen, bedroom, gas station, library, etc using 15,524 images [430]. In total, SKILL-102 is a subset of all datasets/tasks and images in DCT, and comprises 102 tasks, 5,033 classes and 2,041,225 training images. After training, the algorithm is presented 243,464 test images and decides, for each image, which of the 5,033 classes it belongs to (no task oracle). To the best of our knowledge, SKILL-102 is the most challenging completely real (not synthesized or permuted) image classification benchmark for LL and SKILL algorithms, with the largest number of tasks, number of classes, and inter-task variance. 298 16.5 Lightweight Lifelong Learner for SKILL c16-sec:4 To satisfy the requirements of SKILL (see Introduction), we design Lightweight Lifelong Learning (LLL) agents. The design motivation is as follows: We propose to decompose agents into a generic, pretrained, common representation backbone endowed into all agents at manufacturing time, and small task-specific decision modules. This enables distributed, decentralized learning as agents can learn their own tasks independently (item Chal-1). It also enables lifelong learning (item Chal-2) in each agent by creating a new task-specific module for each new task. Because the shared modules are all operating in the common representation of the backbone, this approach also satisfies (item Chal3). Using compact task-specific modules also aims to maximize speedup through parallelization (item Chal-4). Finally, we show a few examples where knowledge from previously learned tasks may both accelerate the learning and improve the performance on new tasks (item Chal-5). Figure 16.3 shows the overall pipeline and 4 roles for each agent. Agents use a common frozen backbone and only a compact task-dependent "head" module is trained per agent and task, and then shared among agents. This makes the cost of both training and sharing very low. Head modules simply consist of (1) a classification layer that operates on top of the frozen backbone, and (2) a set of beneficial biases that provide lightweight task-specific re-tuning of the backbone, to address potentially large domain gaps between the task-agnostic backbone and the data distribution of each new task. To eliminate the need for a task oracle, LLL agents also learn and share task anchors, in the form of summary statistics about their training datasets, or share a few training images, to help other agents assign test samples to the correct head at test time (task mapper). Two representations for task anchors, and the corresponding task mapping mechanisms, are explored: Gaussian Mixture Model Classifier (GMMC) and Mahalanobis distance classifier (MAHA). Receiving agents simply accumulate received heads and task anchors in banks, and the anchors for all tasks received so far by an agent are combined to form a task mapper within that agent. We currently assume a fully connected communication network among all agents, and every agent, after learning a new task, broadcasts its head and task anchor to all other agents. Hence, all agents become identical after all tasks have been learned and shared, and they all can master all tasks. At test time, using one of all identical agents, we first run input data through the task mapper to recover the task, and then invoke the corresponding head to obtain the final system output. The task mapper eliminates the need for a task oracle at test time. The combination of using a pre-trained backbone, task-specific head and BB, and task mapper enables lifelong learning in every agent with minimal forgetting as each agent learns a sequence of tasks (see results). 299 More Agents … In more regions … �� �: Scenes 67 classes, 15,620 images �! … time �� �: Sketches 250 classes 20,751 images �� �: Oregon Wildlife 20 classes 14013 images Agent 2 �� � Flowers 102 classes; 8,189 images �� �: Fashion 45 classes 44000 images �� �: Blindness 5 classes 13000 images … �" … time Agent 1 �� �: SVHN 10 classes; 99,289 images �� �: Places-365 365 classes 1.8M images �� �: CLEVR-count 8 classes 2560 images … �# … time Agent 4 �� �: Birds 200 classes 11,788 images �� �: Wiki-Art 22 classes 81,528 images �� �: Camelyon 2 classes 262,145 images … �$ … time Agent3 �� �: Cars 196 classes 16,185 images �� �: Food-101 101 classes 101,304 images �� �: 100 sports 100 classes 14600 images … �% … time Agent x ������ � Fixed backbone Training images of �� � Training Learnable head Learnable Beneficial Bias Task anchor Share knowledge after each task learning with other agents Receive Knowledge from other agents Task mapper Knowledge consolidate Testing Input Output Select which head to use Time More tasks… … �� � �� � �� � �� � �� � Time More tasks… … … Task mapper Figure 16.3: Algorithm design. Top: overall pipeline, where agents are deployed in different regions to learn their own tasks. Subsequently, learned knowledge is shared among all agents. Bottom: Zoom into the details of each agent, with 4 main roles: 1) Training: agents use a common pre-trained and frozen backbone, stored in ROM memory at manufacturing time (gray trapezoid with lock symbol). The backbone allows the agent to extract compact representations from inputs (e.g., with an xception backbone, the representation is a latent vector of 2048 dimensions, and inputs are 299 × 299 RGB images). Each agent learns a task-specific head (red triangle) for each new task. A head consists of the last fully-connected layer of the network plus our proposed LL beneficial biasing units (BB) that provide task-dependent tuning biases to all neurons in the network (one float number per neuron). During training, each agent also learns a GMMC or Mahalanobis task anchor which will form a task mapper. 2) Share knowledge with other agents: each agent shares the learned task-specific head, Beneficial Bias (BB), and GMMC module (or training images for Mahalanobis) with all other agents. 3) Receive knowledge from other agents: each agent receives different heads and GMMC/Mahalanobis task mapper anchors from other agents. All heads are stored in a head bank and all task anchors are consolidated to form a task mapper. 4) Testing: At test time, an input is first processed through the task mapper. This outputs a task ID, used to load up the corresponding head (last layer + beneficial biases) from the bank. The network is then equipped with the correct head and is run on the input to produce an output. fignewdesign 300 Pretrained backbone: We use the xception [95] pretrained on ImageNet [112], as it provides a good balance between model complexity and expressivity of the embedding. The backbone is embedded in every agent at manufacturing time and is frozen. It processes 299 × 299 RGB input images, and outputs a 2048D feature vector. Any other backbone could be used, depending on available resources. Beneficial Biases: To address potentially large domain shifts between ImageNet and future tasks (e.g., line-drawing datasets, medical imaging datasets, astronomy datasets), we designed beneficial biases (BB). Inspired by the Beneficial Perturbation Network (BPN) of [598], BB provides a set of task-dependent, out-of-network bias units which are activated per task. These units take no input. Their constant outputs add to the biases of the neurons already present in the backbone network; thus, they provide one bias value per neuron in the core network. This is quite lightweight, as there are far fewer neurons than weights in the backbone (22.9M parameters but only 22k neurons in xception). Different from BPN, which works best in conjunction with an LL method like EWC [297] or PSP [89], and only works on fully-connected layers, BB does not require EWC or PSP, and can perform as an add-on module on both convolutional layers (Conv) and fully-connected layers (FC). Specifically, for each Conv layer, we have y = Conv(x) + b + B (16.1) with input feature x ∈ R w∗h∗c , output feature y ∈ R w ′∗h ′∗c ′ . b ∈ R c ′ is the original frozen bias of the backbone, and B ∈ R c ′ is our learnable beneficial bias. The size of B is equal to the number of kernels (c ′ ) in this Conv layer. (w, h, c and w ′ , h′ , c′ denote the width, height and channels of the input and output feature maps respectively.) For FC layers, y = F C(x) + b + B (16.2) with x ∈ R l , y ∈ R l ′ , b ∈ R l ′ and B ∈ R l ′ . The size of B (beneficial bias) is equal to the number of hidden units (l ′ ) in this FC layer. GMMC task mapper: To recover task at test time, each agent also learns Gaussian Mixture clusters (GMMC) [460] that best encompass each of its tasks’ data, and shares the cluster parameters (means + diagonal covariances). This is 301 also very fast to learn and very compact to share. As shown in Figure 16.3 (bottom right), during training, each agent clusters its entire training set into k Gaussian clusters: f(x) = X k i=1 ϕiN (x|µi , Σi), X k i=1 ϕi = 1 (16.3) We use k = 25 clusters for every task. In sharing knowledge, each agent performs a "teacher" role on its learned task and shares the mean and diagonal covariance of its clusters with all other agents (students). In receiving knowledge, each agent performs a "student" role and just aggregates all received clusters in a bank to form a task mapper with kT clusters, keeping track of which task any given cluster comes from: Dmap() = {(N1, ϕ1) : 1, ...,(NkT , ϕkT ) : T}. At test time, a image xi is evaluated against all clusters received so far, and the task associated with the cluster closest to the test image is chosen: T ask = Dmap((Nm, ϕm)), where m = arg maxm(P(m, xi)). The probability of image xi belonging to the mth Gaussian cluster is given by: P(m, xi) = ϕmN (x|µm, Σm) PkT n=1 ϕnN (x|µn, Σn) (16.4) Mahalanobis task mapper: To perform as a task mapper, the Mahalanobis distance (MAHA) method [317] learns C class-conditional Gaussian distributions N (x|µc, Σ) ˆ , c = 1,2, ... C, where C is the total number of classes of all T tasks and Σˆ is a tied covariance computed from samples from all classes. The class mean vectors and covariance matrix of MAHA are estimated as: µc = 1 Nc P i:yi=c xi (Nc: number of images in each class) and Σ =ˆ 1 N PC c=1 P i:yi=c (xi − µc)(xi − µc) T , (N: total number of images shared to the student agent). In training, each teacher agent computes the mean of each class within its task and randomly samples a variable number m of images per class. In our experiments, we use m = 5 images/class for every task. During sharing knowledge, each agent shares the sample class means along with the saved images with all other agents. The shared images received by the student agents are used to compute the tied covariance. Similar to GMMC, the student agents also maintain a task mapper to keep track of which task any given class comes from. For a test image x, MAHA computes the Mahalanobis distance for all classes received so far and assigns the test image to the task associated with the smallest Mahalanobis distance, defined as: arg min c (x − µc) T Σˆ −1 (x − µc) (16.5) 302 System implementation details: (1) Frozen xception backbone [95], with 2048D latent representation. (2) Each agent learns one "head" per task, which consists of one fully-connected layer with 2048 inputs from the backbone and c outputs for a classification task with c classes (e.g., task 1 is to classify c = 102 types of flowers), and BB biases that allow us to fine-tune the backbone without changing its weights, to mitigate large domain shifts. (3) Each agent also fits k = 25 Gaussian clusters in the 2048D latent space to its training data. (4) At test time, a test image is presented and processed forward through the xception backbone. The GMMC classifier then determines the task from the nearest Gaussian cluster. The corresponding head is loaded and it produces the final classification result: which image class (among 5,033 total) the image belongs to. (5) The workflow is slightly different with the Mahalanobis task mapper: while GMMC clusters are learned separately at each teacher for each task as the task is learned, the Mahalanobis classifier is trained by students after sharing, using 5 images/class shared among agents. (6) Agents are implemented in pyTorch and run on desktop-grade GPUs (e.g., nVidia 3090, nVidia 1080). 16.6 Experiments and results c16-sec:5 Each LLL agent in our approach is a sequential lifelong learner, capable of learning several tasks in its physical region, one after the other. Hence, before we show full results on the SKILL challenge, we first compare how well LLL can learn multiple tasks sequentially in a single agent, compared to baselines LL algorithms. This is the standard LL scenario where tasks are learned one after the other and data from previous tasks is not available while learning new tasks. Baselines: We implemented 8 baselines from the literature. For those that require a task oracle, we (unfairly to us) grant them a perfect task oracle (while our approach uses imperfect GMMC or Mahalanobis task mappers). When possible, we re-implement the baselines to use the same pretrained xception backbone as our approach. This ensures a fair comparison where every approach is granted the same amount of pre-training knowledge and the same feature processing ability. The two exceptions are PSP [89] that uses ResNet-18, and SUPSUP [604] that uses ResNet-50. Our baselines fall in the following 3 categories [105]: (1) Regularization methods add an auxiliary loss term to the primary task objective to constraint weight updates. The extra loss can be a penalty on the parameters (EWC [297], MAS [12] and SI [634]) or on the feature-space (FDR [38]), such as using Knowledge Distillation (DMC [644]). We use EWC as the representative of this category: one agent learns all 102 tasks in sequence, using EWC machinery to 303 constrain the weights when a new task is learned, to attempt to not destroy performance on previously learned tasks. We also use SI, MAS, LwF, and Online-EWC as baselines of this type. (2) Parameter-Isolation methods assign a fixed set of model parameters to a task and avoid over-writing them when new tasks are learned (SUPSUP [604], PSP [89]). We use PSP as the representative of this category: one agent learns all 102 tasks in sequence, generating a new PSP key for each task. The keys help segregate the tasks within the network in an attempt to minimize interference. We used the original PSP implementation, which uses a different backbone than ours. PSP accuracy overall hence may be lower because of this, and thus we focus on trends (decline in accuracy as more tasks are added) as opposed to only absolute accuracy figures. We also used SUPSUP as baseline of this type. (3) Rehearsal methods use a buffer containing sampled training data from previous tasks, as an auxiliary to a new task’s training set. The buffer can be used either at the end of the task training (iCaRL, ER [446, 462]) or during training (GSS, AGEM, AGEM-R, GSS, DER, DERPP [363, 78, 14, 61]). We use ER and as the representative of this category: One agent learns all 102 tasks in sequence. After learning each task, it keeps a memory buffer with 10 images/class (size hence keeps increasing when new tasks are learned) that will later be used to rehearse old tasks. When learning a new task, the agent learns from all the data for that task, plus rehearses old tasks using the memory buffer. Accuracy on first task: To gauge how well our approach is achieving lifelong learning, we plot the accuracy on the first task as we learn from 1 to 102 tasks, in Figure 16.4. There is nothing special in our dataset about the first task, except that it is the first one. A good LL system is expected to maintain its accuracy on task 1 even as more subsequent tasks are learned; conversely, catastrophic interference across tasks would rapidly decrease task 1 accuracy with more learned tasks. Overall, our approach maintains the highest accuracy on task 1 over time, and virtually all of the accuracy degradation over time is due to increasing confusion in the task mapper (e.g., curves for Mahalanobis task mapper alone and LLL w/BB w/MAHA are nearly shifted versions of each other). Indeed, once the task is guessed correctly, the corresponding head always performs exactly the same, no matter how many tasks have been learned. Normalized accuracy on first 10 tasks: We compare our method to the baselines on the first 10 tasks, when up to 20 subsequent tasks are learned. A good LL system should be able to maintain accuracy on the first 10 tasks, while at the same time learning new tasks. Because in SKILL-102 different tasks have different levels of difficulty, we normalize accuracy here to focus on degradation with an increasing number of new tasks. For example, the accuracy of our method (LLL w/o BB) when learning a single task is 92.02% for task 1, but only 52.64% for task 6, which is much harder. 304 Figure 16.4: Accuracy on task 1 (learning to classify 102 types of flowers) as a function of the number of tasks learned. a) Comparison between our methods. b) Comparison between our best and other baselines. Our approach is able to maintain accuracy on task 1 much better than the baselines as more and more tasks are learned: while our approach does suffer some interference, task 1 accuracy remains to within 90% of its initial best even after learning 101 new tasks (for the 4 LLL variants, BB=beneficial biases, MAHA=Mahalanobis Distance task mapper, GMMC=GMMC task mapper). In contrast, the accuracy of EWC, PSP, and several other baselines on task 1 catastrophically degrades to nearly zero after learning just 10 new tasks, even though we granted these methods a perfect task oracle. The best performing baseline, ER, is of the episodic buffer type (a fraction of the training set of each task is retained for later rehearsing while learning new tasks), with an un-bounded buffer that grows by 10 images/class. This methods does incur higher (and increasing) training costs because of the rehearsing. Note how SUPSUP does not experience any degradation on task 1, which is a desirable feature of this approach. However, a drawback is that SUPSUP is not able, even from the beginning, to learn task 1 as well as other methods (50.64% accuracy vs. over 90% for most other approaches). We attribute this to SUPSUP’s limited expressivity and capacity to learn using masks over a random backbone, especially for tasks with many classes. Indeed, SUPSUP can perform very well on some other tasks, usually with a smaller number of classes (e.g., 91.93% correct on SVHN, 93.18% on Brazillian Coins, 99.11% on UMNIST Face Dataset). c16-fig:task1 Here, we define a normalized accuracy as the accuracy divided by the initial accuracy just after a given task was learned (which is also the best ever accuracy obtained for that task). This way, normalized accuracy starts at 100% for all tasks. If it remains near 100% as subsequent tasks are learned, then the approach is doing a good job at minimizing 305 interference across tasks. Conversely, a rapidly dropping normalized accuracy with an increasing number of subsequent tasks learned indicates that catastrophic interference is happening. Our results in Figure 16.5 show that, although not perfect, our approach largely surpasses the baselines in its ability to maintain the accuracy of previously learned tasks, with the exception of SUPSUP, which suffers no degradation (see caption of Figure 16.5 for why). Figure 16.5: Normalized accuracy on the first 10 tasks (one per curve color) as up to 20 additional tasks are learned. Our LLL approach is able to maintain high normalized accuracy on the first 10 tasks, while all other baselines except SUPSUP suffer much stronger catastrophic interference. SUPSUP is a special case as there is no interference among successive tasks when a perfect task oracle is available. Hence normalized accuracy for all tasks remains at 100%. However, we will see below that the absolute accuracy of SUPSUP is not as good. fig:first10 Task mapper accuracy after learning 1 to 102 tasks: To investigate how well our approach is expected to scale with more tasks, we computed task mapper accuracy on all tasks learned so far, after learning 1, 2, 3, ... 102 tasks. This 306 allows us to evaluate degradation with more tasks that is due to increasing confusion in the task mapper, as opposed to being due to classification difficulty of newly added tasks. Results are shown in Figure 16.6: Task mapping accuracy starts at 100% after learning 1 task (all test samples are correctly assigned to that task by Mahalanobis or GMMC), then decreases as more tasks are learned, eventually still achieving 87.1% correct after 102 tasks for MAHA, and 84.94% correct for GMMC. It is important to note that in our approach, any loss in accuracy with more tasks only comes from the task mapper: once the correct head is selected for a test sample, the accuracy of that head remains the same no matter how many heads have been added to the system. In contrast, other baseline methods may suffer from catastrophic forgetting for both the task mapper and the classification model when more tasks are learned, as further examined below. Figure 16.6: Task mapper accuracy on all tasks learned so far, as a function of the number of tasks learned, when using Mahalanobis (left) or GMMC (right) task mappers. Our approach is able to maintain good task mapping accuracy as the number of tasks increases. fig:avgnormalized When using GMMC task mapping, the regression line is y = −0.0012x+ 0.952, which intercepts zero for T = 800 tasks. Thus, with the distribution of tasks in our dataset, we extrapolate that T = 500 is realistic as is. Since task interference in our system only comes from GMMC, pushing beyond T = 500 might require more than k = 25 GMMC clusters per task, which would increase CPU and communications expenditure. When using Mahalanobis task mapping, the results are similar with an intercept at T = 978, though this approach incurs a slightly higher communications cost (discussed below). Absolute accuracy: The normalized accuracy figures reported so far were designed to factor out variations in individual task difficulty, so as to focus on degradation due to interference among tasks. However, they also factor out the potential benefits of BB in raising absolute task accuracy, and they obfuscate the absolute performance of baselines. Hence, we here also study absolute task accuracy. 307 Figure 16.7: Absolute accuracy per task after learning 102 tasks. (Top) Absolute accuracy of the GMMC and Mahalanobis task mappers alone shows quite a bit of variability, indicating various degrees of overlap among tasks. (Bottom) Absolute accuracy of the main xception+head network alone (with or without BB, assuming perfect task mapper) also shows significant variability, indicating various degrees of difficulty per task. The accuracy with BB is overall slightly higher than without BB (orange bars higher than corresponding blue bars in the bottom panel), as further explored in the next figure. c16-fig:absacc We first plot the absolute accuracy of our LLL approach, separately for each task, in Figure 16.7, separating whether BB is used or not, and which task mapper is used. This shows that our SKILL-102 dataset provides a range of difficulty levels for the various tasks, and is quite hard overall. BB improves accuracy on nearly all datasets, at an extra computation cost, detailed below. As promised, BB improves accuracy quite dramatically on some datasets which have a large domain gap compared to ImageNet used to pretrain the backbone (e.g., 31.98 percent point improvement with BB on deepvp that contains dashcam images, 24.92 percent point improvement on CLEVR, 24.5 percent point improvement on Aircraft, 20.75 percent point improvement on SVHN). We then plot the absolute accuracy averaged over all tasks learned so far in Figure 16.8. The absolute accuracy for GMMC and Mahalanobis is the same as before. However, now the absolute accuracies for the full LLL models and for the baselines conflate two components: 1) how much interference exists among tasks and 2) the absolute difficulty of each of the tasks learned so far. 308 Figure 16.8: Average absolute accuracy on all tasks learned so far, as a function of the number of tasks learned. Our LLL approach is able to maintain higher average accuracy than all baselines. BB provides a small but reliable performance boost (LLL w/BB vs. LLL w/o BB). The sharp decrease in early tasks carries no special meaning except for the fact that tasks 4,8,10 are significantly harder than the other tasks in the 0-10 range, given the particular numbering of tasks in SKILL-102. Note how again SUPSUP has a low accuracy for the very first task. This is because of the nature of its design; indeed, SUPSUP is able to learn some other tasks in our sequence with high accuracy. c16-fig:avgtask Computation and communication costs, SKILL metrics: The baselines are sequential in nature, so trying to implement them using multiple agents does not make sense as it would only add communication costs but not alleviate the sequential nature of these LL approaches. For example, for the EWC baseline, one could learn task 1 on agent A then communicate the whole xception weights to agent B (22.9 M parameters = 91.6 MBytes) plus the diagonal of the Fisher Information matrix (another 22.9 M parameters), then agent B would learn task 2 and communicate its resulting weights and Fisher matrix to agent C, etc. Agent B cannot start learning task 2 before it has received the trained weights and Fisher matrix from agent A because EWC does not provide a mechanism to consolidate across agents. Thus, we first consider one agent that learns all 102 tasks sequentially, with no communication costs. 309 Table 16.1: Analysis of computation expenditures and accuracy for our approach and the baselines, to learn all 102 tasks (with a total of 5,033 classes, 2,041,225 training images) in a single agent. Here we select LLL, no BB, MAHA as reference (1x CPU usage) since it is the fastest approach, yet still has higher accuracy than all baselines. For our approach, MAHA leads to slightly higher accuracy than GMMC, at roughly the same computation cost. All baselines perform worse that our approach, even though they also requires more computation than our approaches that do not use BB. BB adds significatly to our computation cost, but also leads to the best accuracy when used with MAHA. Training (MACs) CPU usage VS. Ours, no BB, MAHA Average accuracy after learning 102 tasks LLL(Ours)-Single Agent, no BB, GMMC 1.73E+16 ~1x 67.43% LLL(Ours)-Single Agent, BB, GMMC 1.56E+18 ~90.7x 70.58% LLL(Ours)-Single Agent, no BB, Mahalanobis 1.73E+16 1x (reference) 68.87% LLL(Ours)-Single Agent, BB, Mahalanobis 1.56E+18 ~90.7x 72.1% EWC 1.75E+18 ~101.3x 8.86% PSP 6.28E+17 ~36.4x 25.49% ER 4.53E+18 ~262.8x 35.32% SUPSUP 1.01E+18 ~58.6x 56.22% EWC-ONLINE 1.55E+18 ~90.1x 7.77% LwF 1.56E+18 ~90.5x 8.41% SI 2.07E+18 ~120.1x 13.89% MAS 2.06E+18 ~119.6x 20.54% tab:cpu-single Table 16.1 shows the computation expenditures (training time in terms of the number of multiply-accumulate (MAC) operations needed to learn all 102 datasets) for our approach and the baselines. Our approach overall has by far the lowest computation burden when BB is not used, yet all 4 variants of our approach perform better than all baselines. BB increases accuracy but at a significant computation cost: This is because, to compute BB biases, one needs to compute gradients through the entire frozen backbone, even though those gradients will only be used to update biases while the weights remain frozen in the backbone. Our approach presents the advantage that it can also be parallelized over multiple agents that each learn their own tasks in their own physical region. All agents then learn their assigned tasks in parallel. Each agent is the "teacher" for its assigned tasks, and "student" for the other tasks. Then all agents broadcast their shared knowledge to all other agents. As they receive shared knowledge, the students just accumulate it in banks, and update their task mapper. After sharing, all agents know all tasks (and are all identical). As mentioned above, the main source of performance degradation in our approach is in the task mapper, which gets increasingly confused at T increases. 310 For our baselines, we are not aware of a way to parallelize their operation, except that we were able to create a modified version of SUPSUP that works on several parallel processors. In our modified SUPSUP, each agent learns a mask for each of its tasks, then communicates its masks to all other agents. At test time, we (unfairly to us) grant it a perfect task oracle, as our GPUs did not have enough memory to use the built-in task mapping approach of SUPSUP, given our 102 tasks and 5,033 classes (this would theoretically require 1.02 TB of GPU memory). Table 16.2 shows the computation and networking expenditures for our approach and our modified SUPSUP to learn all tasks in the SKILL-102 dataset. Because some algorithms run on GPU (e.g., xception backbone) but others on CPU (e.g., GMMC training), and because our tasks use datasets of different sizes, we measure everything in terms of MACs (multiply-accumulate operations, which are implemented as one atomic instruction on most hardware). To measure MACs for each component of our framework, we used a combination of empirically measured, framework-provided (e.g., pyTorch can compute MACs from the specification of all layers in a network), or sniffed (installing a hook in some algorithm that increments a counter each time a MAC is executed). To translate communication costs to MACs, we assume a nominal cost of α = 1, 000 MACs to transmit one byte of data. This is a hyperparameter in our results that can be changed based on deployment characteristics (e.g., wireless vs. wired network). The amount of data shared per task for our approach is quite small: 813 KBytes/task for LLL with GMMC, no BB; 883 KBytes/task for LLL with GMMC, BB; 1.74 MBytes for LLL with MAHA, no BB; and 1.81 MBytes for LLL with MAHA, BB (on average, given that our tasks have 49.34 classes each on average). Our results in Table 16.2 show: 1. Our approach has very low parallelization overhead, which leads to almost perfect speedup > 0.99N for all variants. Indeed, teachers just learn their task normally, plus a small overhead to train GMMC on their own tasks, when GMMC is used. Communications are less than 2 MBytes per task. Students either do nothing (just accumulate received knowledge in a bank) or update their Mahalanobis task mapper. 2. The baselines have comparatively much higher training cost, yet their performance is poor. Performance of episodic buffer / rehearsing methods might be improved further by increasing buffer size, but note that in the limit (keeping all training data for future rehearsing), this gives rise to a > 5, 000× increase in training time. 311 Table 16.2: Analysis of computation and network expenditures for our parallelized LLL approach and our parallelized SUPSUP, to learn all T = 102 tasks. Our approach supports any number of agents N such that 1 ≤ N ≤ T. Maximum speedup is expected when N = T and each agent learns one task, then shares with all others. Here, we report numbers for T = 102, N = 51, and each agent learns 2 tasks in sequence. Note that in our approach, accuracy is not affected by N, only the amount of parallelization speedup increases with N. Note how in this table we still report MACs but taking parallelization into account (e.g., teacher CPU for N agents is single-agent CPU divided by N). Teacher CPU: Time to learn tasks from their training datasets, plus to possibly prepare data for sharing (e.g., compute GMMC clusters). Communications: Our LLL agents communicate either GMMC clusters or Mahalanobis training images, while our modified SUPSUP communicates masks. Here we assume that there is a communication bottleneck at the receiver (student): the shared data from 100 tasks needs to be received serially, over a single networking interface for each student. Hence our communication figures are for all the shared data from all other tasks apart from those an agent learned itself. We convert communication costs to equivalent MACs by assuming 1,000 MACs per byte transmitted. BB adds a small extra communication cost, to transmit the biases. Student CPU: For GMMC, students do not do any extra work (hence, student CPU is 0); for Mahalanobis, students compute a covariance matrix for all 102 tasks. Speedup factor: is just total MACs for single agent divided by total MACs for parallel agents and by N. All approaches here achieve near perfect parallelization (> 0.99N, where 1.0N is perfect). Accuracy: In addition to being faster when BB is not used, our LLL variants still all outperform the parallel SUPSUP in accuracy, by a large margin (> 10%). Teacher CPU (MACs) Communi- -cations (bytes) Student CPU (MACs) Total (MACs) Parallelization efficiency (xN) CPU usage vs. Ours-SKILL, no BB, MAHA Average accuracy after learning 102 tasks LLL(Ours)-Multiple Agents, no BB, GMMC 1.69E+14 8.22E+07 0.00E+00 1.69E+14 0.99999519 ~0.96x 67.43% LLL(Ours)-Multiple Agents, BB, GMMC 1.53E+16 1.03E+08 0.00E+00 1.53E+16 0.999999934 ~87.2x 70.58% LLL(Ours)-Multiple Agents, no BB, Mahalanobis 1.69E+14 6.72E+09 5.00E+09 1.76E+14 0.996630551 1x (reference) 68.87% LLL(Ours)-Multiple Agents, BB, Mahalanobis 1.53E+16 6.74E+09 5.00E+09 1.53E+16 0.999962712 ~87.3x 72.1% Parallel SUPSUP, Perfect Task Oracle 9.91E+15 3.03E+08 0.00E+00 9.91E+15 0.999999697 ~56.4x 56.22% tab:cpu 16.7 Shared Knowledge Accumulation, Reuse and Boost c16-sec:7 As our system learns many tasks, it may occur that some tasks overlap with others, i.e., they may share similar images and/or class labels. Here, we explore two approaches to handle such overlap. 16.7.1 Corrective approach to task overlap/synergy Since our LLL learners can learn a large number of tasks while solving SKILL problems, some synergy can occur across tasks if one is able to detect when two classes from two different tasks are equivalent, as shown in Figure 16.9. 312 We implemented a method to compare the semantic distance of the predicted class name and the actual class name. Originally, after the GMMC infers the task that a test image may come from, we would immediately consider the image as misclassified if the predicted task is wrong. With the consideration of semantic similarity between class names, we will now always load the prediction head corresponding to the predicted task given by GMMC and use it to infer the class name. If the class was guessed incorrectly, but, in the end, the final class name is equivalent to the correct one, then we can declare success on that particular test image (Figure 16.9). To obtain the pairwise similarity, we constructed a similarity matrix that stores the semantic distance, measured by the cosine similarity of word embeddings, for all the class names. Those embeddings were obtained from CLIP’s [434] text encoder based on GPT-2. If the similarity between the predicted class name and the actual class name is greater than a threshold (empirically chosen for now), then we declare it a correct prediction. Figure 16.9: Left: similar classes with a cosine similarity in the CLIP embedding greater than 0.90. Right: similar classes with a cosine similarity greater than 0.95. This can help correct spurious errors where, for example, a test image from class "bike" from the Stanford_Online_Products dataset could also be considered correctly classified if the system output were "bicycle" from the Sketches dataset. fig:similar_class As our 102-task dataset contains 5,033 object classes, the full similarity matrix is triangular 5,033 x 5,033 (too large to display here). The approach yields a small but consistent improvement in accuracy (Figure 16.10). This is one way in which we can handle possible overlap between tasks, which may inevitably arise when large numbers of tasks are learned. 313 Figure 16.10: Correcting spurious errors by realizing when two distinct classes from two tasks actually are the same thing. The approach provides a small but consistent improvement in accuracy over baseline (which declares failure as soon as task mapping failed), here shown on 15 datasets that have some overlap. c16-fig:co 16.7.2 Learning approach to task overlap/synergy The ability to reuse learned knowledge from old tasks to boost the learning speed and accuracy of new tasks is a potential desirable feature of LL algorithms. Here, this might be achieved if, when learning a new task, an LLL agent could "borrow" the knowledge of old tasks, from not only itself but also the shared knowledge from any other agents. One important design feature of our LLL agents is that they can share partial heads across tasks: Our heads are a single layer with 2,048 inputs (from the xception backbone) and c outputs for a task with c classes. Thus, each of the c output neurons is connected to the 2,048 inputs, but there are no lateral connections. This means that we can consider each of the c output neurons individually as evidence provider for their associated class (remember the analogy to "grandmother cells" in Introduction). We can then cherry-pick sets of 2,048 weights corresponding to individual classes of previously learned tasks, and use them as initialization for some similar classes in new tasks to be learned. As we show below, this greatly accelerates learning of the similar new classes, compared to starting with randomized initial weights, and also yields higher accuracy on these new classes. 1) New task is a combination of old tasks: To validate our idea, a new learning paradigm is proposed to use previously learned weights when a new task contains classes that were already learned previously. This experiment considers 314 Figure 16.11: Learning speed for a given object class when the corresponding weights are initialized randomly (orange) vs. from previously learned weights of a similar object class found in one of the previously learned tasks (blue), averaged for 190 combinations of two previously learned tasks. In this example, best accuracy is already reached after just 1 to 2 training epochs in the blue curve. In contrast, it takes up to 30 epochs to train from random initialization, with still a final accuracy in the orange curve that is lower than the blue curve. This approach hence leads to significant learning speedup when tasks contain some similar classes. fig:cweight two datasets and two sets of weights representing the old knowledge, and a new dataset that contains all classes from both datasets. Simply normalizing and concatenating the linear weights leads to poor performance. Hence, instead, we normalize the weights during training by their p-norm, and concatenate the normalized weights as the new task’s weights. The experiment was conducted over 190 combinations of 2 datasets chosen from 20 datasets, and the average results show that there is a very small accuracy loss initially (epoch 0). After a few extra training epochs, we reach a higher accuracy than training new weights from scratch (random initialization; Figure 16.11). The mathematical version. During training, a constraint is added. Let W1 be the full set of 2, 048 × c weights of the first dataset, and W2 be the weights of the second dataset, where W1 = ...w1 1 ... ... ...w1 n... and W2 = ...w2 1 ... ... ...w2 n... . Normal Linear Layer training’s forward path is yˆ = Wx. Define W’ as w ′ 1 ... w ′ n where w ′ i = wi/p-norm(wi) Hence, during training, Linear Layer training’s forward path is yˆ = W’x. And concatenate(W1’, W2’) was used as the weights for the new combined task. Since the weights are normalized class-wise and not task-wise, this method can be used on any combination of previously leaned classes. For example, for 10 tasks containing 100 classes c1 c100 and a new task containing c1, c3, c10, c20, we can simply find the corresponding w ′ 1 , w′ 3 , w′ 10, w′ 20, and concatenate them together. Choice of p: We find that using the 2-norm causes the classifier to converge to a state where all weight vectors have the same magnitude, which causes an accuracy drop for the old task. Hence, we choose to use the infinity norm, which is still modulated by the weight magnitudes, and is still easy to transfer. 2) New task is different but similar to old tasks. In the previous setting, we assumed the new task classes are a combination of different old task classes. In a more general situation, the new task classes are all new classes that all other agents never learned before, but we could still borrow some learned knowledge from similar learned classes. For instance, as shown in Figure 16.9, the knowledge of classes shown on top of the figure may be helpful to learn the new classes shown at the bottom. We conduct four different experiments (for 4 pairs of datasets that share some related classes) to show the knowledge boost when we learn a new task. We first check if a learned old task shares similar knowledge with the new one. For instance, before we learn the MIT indoor scenes dataset, we find that the House Room Image Dataset contains classes that are similar to the new classes, in the CLIP embedding space. So we match each class from MIT indoor scenes dataset to the previously learned classes, which in this case come from the House Room Image Dataset. If the class similarity is larger than a threshold, we treat it as a matched class, then we use the similar old class weights to initialize the weights of the new class. If a new class was not matched with old classes, we use random initialization. We also conduct a corresponding controlled experiment by using random initialization for all new classes. The results of all 4 experiments are shown in Table 16.3 Datasets initialization All 10 shot 5 shot 3 shot Old task House Room Image Dataset random 0.86 0.77 0.73 0.52 New task MIT Indoor Scenes ours 0.89 0.83 0.8 0.71 Old task Standford Online Products random 0.78 0.61 0.61 0.6 New task Office Home Product ours 0.8 0.62 0.59 0.6 Old task 100 Sports random 1 0.98 0.92 0.92 New task UIUC Sports Event Dataset ours 1 0.99 0.97 0.97 Old task iFood2019 random 0.61 0.42 0.35 0.3 New task Food-101 ours 0.64 0.5 0.46 0.43 Table 16.3: Boosted LLL learning when previously learned weights from similar classes can be used to initialize learning of new classes. We repeat the experiment with either learning from all images in the training set of the new task, or only 10, 5, or 3 images per class. Overall, re-using previously learned weights of similar classes boosts accuracy, usually (but not always) more so when the new task is only learned from a few exemplars (which is much faster than learning from scratch from the whole dataset). tab:boost 316 In a more general learning scenario, the new task classes may correspond to similar but not necessarily identical classes in different old tasks. For example, having learned about SUVs may help learn about vans more quickly. Here we conduct two new experiments (Figure 16.12). In EXP-1, the new task is sketch image classification, and the classification weights of each class are initialized from a learned old class that is harvested from different learned tasks with the help of CLIP-based word matching. For instance, the classification weights of van are initialized with the weights of SUV from the previously learned Stanford-CARs dataset; the classification weights of Topwear are initialized with the weights of t-shirt from the learned Fashion-Product dataset, etc (total: 5 pairs of classes). The results show that our initialization leads to better performance (0.98 v.s. 0.93) and faster convergence (3 v.s. 4 epochs) compared with random initialization. This shows that our method can reuse the shared knowledge from other agents to improve the learning of new tasks. Similar performances are shown in EXP-2, which is identical to EXP-1 except that it uses 5 different pairs of classes. Weight Initialization Accuracy Converge Epochs Random 0.93 4 Ours 0.98 3 Weight Initialization Accuracy Converge Epoch Random 0.90 4 Ours 0.95 3 EXP-1 Results EXP-2 Results bicycle New task: Sketch classification Learned tasks classes EXP-2 Setting truck Stanford_Cars: Land Rover Range Rover SUV 2012 owl backpack mug Office-Home_Product: Bike Office-Home_Product: Backpack Office-Home_Product: Mug Pokemon : Dodrio shoes New task: Sketch classification Learned tasks classes Office-Home_Clipart: Sneakers ilab_80m: plane CORe50: light bulb airplane lightbulb Stanford_Cars: AM General Hummer SUV 2000 Fashion_Product: Topwear van t-shirt EXP-1 Setting Figure 16.12: Two experiments where the weights from previously learned similar but not identical classes are successful in boosting learning of new classes. Left: pairs of similar classes (according to CLIP). Right: accuracy achieved with weight transfer vs. random initialization. fig:weight_init 16.7.3 Further boost with Head2Toe A possibly complementary approach to BB to address domain gaps is Head2Toe [136], where the last layer can now directly draw from potentially any of the previous layers in the backbone. This has been shown to help alleviate domain 317 gaps, as some of the lower-level features in the backbone may be useful to solve tasks with a big gap, even though the top-level backbone features may not. However, Head2Toe has a very high computation cost to select which layers should connect to the head, which is why we have not used it in our main results. Here, we explore how that cost of selection of the most appropriate layers to connect to the head for a given task can be eliminated by re-using the computations already expended for BB: Intuitively, layers which have large absolute BB magnitude may also be the most useful to connect to the head. Compared to the conventional Head2Toe [136] with two-stage training (first, select which layers will connect to the head, then train those connections), our new BB+H2T uses the biases that have been previously trained and stored in the BB network for feature selection. Specifically, we first concatenated all the biases in the BB network and selected the top 1% largest biases. Then, we picked the feature maps corresponding to the selected indices, average pooled them and flattened them to 8,192-dimensional vectors. After that, we concatenated all flattened feature vectors along with the logits of the last layer (after pooling layer, before softmax) in the BB network. Finally, we trained the concatenated vector with Adam optimizer, 0.001 learning rate, and 100 epochs. This approach, when combined with BB and MAHA, improved performance averaged over all tasks by 0.78% (when a perfect task mapper is available; or by 0.56% when using MAHA). 16.8 Discussion and Future Works We have proposed a new lightweight approach to lifelong learning that outperforms all baselines tested, and also can benefit almost perfectly from parallelization. We tested the approach on a new SKILL-102 benchmark dataset, which we believe is the largest non-synthetized lifelong learning challenge dataset to date. While many previous efforts have employed many tasks, those were usually synthesized, e.g., as permutations over a single base dataset (e.g., 50 permuted-MNIST tasks in [89]). SKILL-102 contains novel real data in each task, with large inter-task variance, which is a better representative of realistic test scenarios. Our proposed lightweight LL points to a new direction in LL research, as we find that one can simply use lightweight task-specific weights (head) combined with maximizing the leverage obtained from task-agnostic knowledge that is rapidly adapted by a compact BB module to handle each new task. Our results show how this lightweight design is better able to handle large scale lifelong learning tasks, and also solves our SKILL challenge very well. 318 We credit our good performance on both sequential lifelong learning and the SKILL challenge to our particular lightweight lifelong learner design: a fixed backbone, which represent task-agnostic knowledge shared among agents, to minimize the complexity of task-specific knowledge parameters (the head); Beneficial Biases, which on-demand shift the backbone to solve possibly large domain gaps for each new task, with very compact parameters; a GMMC/MAHA global task anchor for learned tasks, representing the tasks in the common task-agnostic latent space of all agents, which is easy to share and consolidate, and which eliminates the need for a task oracle at test time. Our results show that combination of this three components help our LLL work well. Our approach uses a pretrained backbone to represent task-agnostic knowledge, which our results show is a very effective strategy. For fair comparison, we also use the same pretrained backbone as the initialization for the baselines (except PSP and SUPSUP; see above). However, our fixed backbone design often cannot handle large domain gaps between new tasks and ImageNet. This is the reason why we proposed BB to relieve the domain gap by shifting the fixed parameters towards each new task with compact biases. Similar to other parameter-isolation methods, our model structure is growing on demand (though slowly) with the number of tasks (we add a new head per task, while they add new masks, keys, etc). Rehearsal-based baselines (e.g., ER) also grow, by accumulating more rehearsing exemplars over time. While some baselines do not grow (e.g., EWC), they also perform very poorly on our challenging SKILL-102 dataset. To further speed up our method with BB, we could use BB on only partial layers. For instance, if we use BB only the last half of the layers in the backbone, we will use only half of the current time to train the model. In future experiments, we will test whether this still gives rise to a significant accuracy benefit. Currently, we use the CLIP embedding space to match a new class with learned old classes, which uses only language knowledge (class labels). As a future work, we will use GMMC as a class matching mechanism to utilize the visual semantic information for matching. Specifically, when a agent learns a new class, the agent will collect a few shots (e.g., 10 images) of the new class and then use the GMMC mapper (trained on all previous tasks) to decide whether these images belong to a learned task or not, with a threshold. If most of the images are matched to a learned task, we can then summon the shared head of that task to classify them, now to obtain a possible match for individual previously learned classes. If most of the images are classified into one previously learned specific class, we can use the weights of that class to initialize the new class weights, similar to what we have done in Section 16.7. 319 A good task mapper is essential in our approach if one is to forego the use of a task oracle. Thankfully, our results show that task mapping can be achieved with high accuracy even with a simple model like GMMC (over 84% correct for 102 tasks). Indeed, in our SKILL challenge, the task mapper is only solving a 102-way classification problem at the granularity of tasks, vs. a 5,033-way full SKILL classification challenge. Here, we focused on GMMC and MAHA, but many other alternatives could be investigated in future work. Our choice of GMMC was based on previous research that compared it to several other techniques, including self-organizing maps and multilayer perceptrons [460]. 16.9 Conclusions We have proposed a new framework for shared-knowledge, parallelized LL. On a new, very challenging SKILL-102 dataset, we find that this approach works much better than previously SOTA baselines, and is much faster. Scaling to > 500 difficult tasks like the ones in our new SKILL-102 dataset seems achievable with the current implementation. Broader impacts statement: We believe that LLL will spur a new generation of distributed LL systems, as it makes LL more accessible to edge systems and more parallelizable. Thus, broader impacts are expected to be positive in enabling more lightweight devices to learn at the edge and to share what they have learned. 320 Chapter 17 CLR: Channel-wise Lightweight Reprogramming for Continual Learning chapter-17 Continual learning aims to emulate the human ability to continually accumulate knowledge over sequential tasks. The main challenge is to maintain performance on previously learned tasks after learning new tasks, i.e., to avoid catastrophic forgetting. We propose a Channel-wise Lightweight Reprogramming (CLR) approach that helps convolutional neural networks (CNNs) overcome catastrophic forgetting during continual learning. We show that a CNN model trained on an old task (or self-supervised proxy task) could be “reprogrammed" to solve a new task by using our proposed lightweight (very cheap) reprogramming parameter. With the help of CLR, we have a better stability-plasticity trade-off to solve continual learning problems: To maintain stability and retain previous task ability, we use a common task-agnostic immutable part as the shared “anchor" parameter set. We then add task-specific lightweight reprogramming parameters to reinterpret the outputs of the immutable parts, to enable plasticity and integrate new knowledge. To learn sequential tasks, we only train the lightweight reprogramming parameters to learn each new task. Reprogramming parameters are task-specific and exclusive to each task, which makes our method immune to catastrophic forgetting. To minimize the parameter requirement of reprogramming to learn new tasks, we make reprogramming lightweight by only adjusting essential kernels and learning channel-wise linear mappings from anchor parameters to task-specific domain knowledge. We show that, for general CNNs, the CLR parameter increase is less than 0.6% for any new task. Our method outperforms 13 state-of-the-art continual learning baselines on a new challenging sequence of 53 image classification datasets. Code and data: https://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming. 321 Figure 17.1: Equipped with a task-agnostic immutable CNN model, our approach "reprogram" the CNN layers to each new task with lightweight task-specific parameters (less than 0.6% of the original model) to learn sequences of disjoint tasks, assuming data from previous tasks is no longer available while learning new tasks. c17-fig:teaser 17.1 Introduction c17-sec:introduction Continual Learning (CL) focuses on the problem of learning from a stream of data, where agents continually extend their acquired knowledge by sequentially learning new tasks or skills, while avoiding forgetting of previous tasks [413]. In the literature, CL is also referred to as lifelong learning [78, 13, 413] and sequential learning [15]. This differs from standard train-and-deploy approaches, which cannot incrementally learn without catastrophic interference across tasks [151]. How to avoid catastrophic forgetting is the main challenge of continual learning, which requires that the performance on previous learned tasks should not degrade significantly over time when new tasks are learned. This is also related to a general problem in neural network design, the stability-plasticity trade-off [204], where plasticity represents the ability to integrate new knowledge, and stability refers to the ability to retain previous knowledge [105]. 322 Dynamic Network methods have been shown to be among the most successful ones to solve continual learning, which usually shows great stability on previous tasks and alleviates the influence of catastrophic forgetting by dynamically modify the network to solve new tasks, usually by network expansion [413, 598, 89, 604]. For stability, a desirable approach would be to fix the backbone and learn extra task-specific parameters on top of it, which will have no catastrophic forgetting. However, the number of parameters in such methods can quickly become very large. How to reduce the amount of required extra parameters is still a challenging problem. To solve the above issues, our approach is based on three motivations: (1) Reuse instead of re-learn. Adversarial Reprogramming [132] is a method to “reprogram" an already trained and frozen network from its original task to solve new tasks by perturbing the input space without re-learning the network parameters. It computes a single noise pattern for each new task. This pattern is then added to inputs for the new task and fed through the original network. The original network processes the combined input + noise and generates an output, which is then remapped onto the desired new task output domain. For example, one pattern may be computed to reprogram a network originally trained on ImageNet [479] to now solve MNIST [113]. The same pattern, when added to an image of an MNIST digit, would trigger different outputs from the ImageNet-trained network for the different MNIST classes (e.g., digit 0 + pattern may yield “elephant"; digit 1 + pattern “giraffe", etc). These outputs can then be re-interpreted according to the new task (elephant means digit 0, etc). Although the computation cost is prohibitively high compared to baseline lifelong learning approaches, here we borrow the reprogramming idea; but we conduct more lightweight yet also more powerful reprogramming in the parameter space of the original model, instead of in the input space. (2) Channel-wise transformations may link two different kernels. GhostNet [214] could generate more feature maps from cheap operations applied to existing feature maps, thereby allowing embedded devices with small memory to run effectively larger networks. This approach is motivated by near redundancy in standard networks: after training, several learned features are quite similar. As such, Han et al. [214] generate some features as linear transformations of other features. Inspired by this, our approach augments a network with new, linearly transformed feature maps, which can cheaply be tailored to individual new tasks. (3) Lightweight parameters could shift model distribution. BPN [598] adds beneficial perturbation biases in the fully connected layers to shift the network parameter distribution from one task to another, which is helpful to solve continual 323 learning. This is cheaper than fine-tuning all the weights in the network for each task, instead tuning only one bias per neuron. Yet, this approach provided good lifelong learning results. However, the method could only handle fully connected layers and the performance is bounded by the limited ability of the bias parameters to change the network (only 1 scalar bias for each neuron). Our method instead designs more powerful reprogramming patterns (kernels) for the CNN layers, which could lead to better performance on each new task. Drawing from these three ideas, we propose channel-wise lightweight reprogramming (CLR). We start with task-agnostic immutable parameters of a CNN model pretrained on a relatively diverse dataset (e.g., ImageNet-1k, Pascal VOC, ...) if possible, or on a self-supervised proxy task, which requires no semantic labels. We then adaptively “reprogram" the immutable task-agnostic CNN layers to adapt and solve new tasks by adding lightweight channelwise linear transformation on each channel of a selected subset of Convolutional layers (Figure 17.2). The added reprogramming parameters are 3x3 2D convolutional kernels, each working on separate channels of feature maps after the original convolutional layer. CLR is very cheap but still powerful, with the intuition that different kernel filters could be reprogrammed with a task-dependent linear transformation. The main contributions of this work are: We propose a novel continual learning solution for CNNs, which involves reprogramming the CNN layers trained on old tasks to solve new tasks by adding lightweight task-specific reprogramming parameters. This allows a single network to learn potentially unlimited input-to-output mappings, and to switch on the fly between them at runtime. Out method achieves better stability-plasticity balance compared to other dynamic network continual learning methods: it does not suffer from catastrophic forgetting problems and requires limited extra parameters during continual learning, which is less than 0.6% of the original parameter size for each new task. Our method achieves state-of-the-art performance on task incremental continual learning on a new challenging sequence of 53 image classification datasets. 17.2 Related Work Continual learning. Continual learning methods can be broadly categorized in three approaches [105]: Replay-based, Regularization-based, and Dynamic network-based. 324 (1) Replay methods use a buffer containing sampled training data from previous tasks, as an auxiliary to a new task’s training set. The buffer can be used either at the end of the task training (iCaRL [446], ER [462]) or during training (GSS, AGEM, AGEM-R, DER, DERPP [363, 14, 78, 61]). Rehearsal schemes [446, 466, 260, 79] explicitly retrain on a limited subset of stored samples while training on new tasks. [363, 78] use constrained optimization as an alternative solution leaving more leeway for backward/forward transfer. Pseudo rehearsal methods use output of previous model [462] or use generates pseudo-samples with a saved generative model [509]. Rehearsal methods have the drawback that (1) they require an extra buffer to save samples (related to dataset hardness), (2) they easily overfit to the subset of stored samples during replay. (3) the performance seems to be bounded by joint training, which is influenced by the number of tasks. (2) Regularization methods add an auxiliary loss term to the primary task objective to constraint weight updates. Data-focused methods (LwF[338], LFL[275], DMC [644]) use knowledge distillation from previous models trained on old tasks to constrain the model being trained on the new tasks. Prior-based methods (EWC [297], IMM [318], SI [634], MAS [12]) estimate the importance of network parameters to previously learned tasks, and penalize changing important parameters while training new tasks. The drawbacks of these methods are expensive computation of importance estimation and suboptimality on new tasks, especially with large numbers of tasks. (3) Dynamic network methods dynamically modify the network to solve new tasks, usually by network expansion. When no constraints apply to architecture size, one can grow new branches for new tasks, while freezing previous task parameters [481, 617], or dedicate a model copy to each task [13]. To save memory, recent methods keep architecture remains static, with fixed parts allocated to each task. Previous task parts are masked out during new task training, either imposed at parameters level [146, 372], or unit level [497]. SUPSUP [604]), PSP [89] assign a fixed set of model parameters to a task and avoid over-writing them when new tasks are learned. Dynamic network methods have been shown to be among the most successful ones in solving continual learning. For stability, a desirable approach would be to fix the backbone and learn extra task-specific parameters on top of it, which will have no catastrophic forgetting. SUPSUP [604] uses a randomly initialized fixed backbone and learns a task-specific supermask for each new task, while the performance is bounded by the ability of the fixed backbone. CCLL [516], and EFTs [567] use fixed backbones trained on the first task, and learn task-specific group convolution parameters. The performance is sensitive to the first task, and the extra group convolution hyper-parameter is not straightforward to train. 325 Input feature maps (w’ × h’ × c’) Last layer Input … Output Class … CLR! ! CLR" ! CLR# ! … CLR Conv block CLR Conv block conv CLR layer norm relu CLR Conv block conv norm relu feature maps before CLR layer (w × h × c) feature maps after CLR layer (w × h × c) Output feature maps (w × h × c) Channel-wise Lightweight Reprogramming (CLR) w h c w h c w’ h’ c’ Figure 17.2: Proposed continual learning model with channel-wise lightweight reprogramming (CLR) layers. All gray blocks are fixed parameters. (top) General network architecture. (bottom) Details of CLR reprogramming layer: for each channel k ∈ [1..c] of an original w × h × c feature map (blue), a 3x3 kernel is learned to reprogram the feature towards the new task (green), without modifying the original conv parameters (grey). fig:CLR Our channel-wise lightweight reprogramming method also belongs to dynamic network continual learning methods. Inspired by ghost networks, BPN, and Adversarial reprogramming, we reprogram the immutable task-agnostic parameters and apply lightweight (computation and memory cheap) extra parameters to learn new tasks. Meta learning. Meta learning aims to improve the learning algorithm itself, given the experience of multiple learning episodes [247]. In contrast, conventional ML improves model predictions over multiple data instances. During meta-learning, an outer (or upper/meta) algorithm updates and improves the inner learning algorithm (e.g., one image classification task). For instance, the outer objective could be the generalization performance or learning speed of the inner algorithm. Meta-learning has the following main difference with CL: similar to transfer learning, Meta-learning focuses on finding a better learning strategy or initialization, and cares about the performance on a new or generalized task, while performance on previous tasks is not the interest. Thus, meta-learning could be used as a preliminary step in CLR, to optimize our shared initial parameter θ in CLR. Meta-learning can learn from a single task in multiple domains [549, 147] or different tasks [150, 352, 19], which requires access to some data from all base tasks or domains at the same time, while our method and CL assume sequential learning and cannot access the data from previous tasks. Also, meta-learning assumes the learned general setting will be used in similar tasks as the base tasks, while CL does not assume that previous tasks are similar to future tasks. 326 17.3 Proposed Method Problem setting. In this paper, we focus on the task-incremental setting of continual learning, where data arrives sequentially in batches, and one batch corresponds to one task, such as a new set of classes or a new dataset to be learned. For a given continual learning algorithm, the goal is to obtain a high average performance on all previously learned tasks after learning the final task. At test time, just like PSP [89], CCLL[516], EFTs[567], EWC [297], ER[462], etc, we assume that a task oracle is provided during inference, showing which task a given sample belongs to. Structure. We first introduce how our proposed Channel-wise Lightweight Reprogramming parameters reprogram the immutable task-agnostic backbone by conducting channel-wise linear transformation after the original convolutional layer to change the original Conv-block as CLR-Conv block, and then develop a new CLR-reprogrammed network to solve new task. (Section 17.3.1). Then, we introduce how CLR-reprogrammed networks can be used to solve continual learning tasks (Section 17.3.2). 17.3.1 Channel-wise Lightweight Reprogramming c17-sec:3.1 Our proposed Channel-wise Lightweight Reprogramming method is equipped with an immutable task-agnostic backbone and creates task-specific lightweight reprogramming parameters to solve new tasks. Here, we use a fixed backbone as a task-shared immutable structure. This differs from SUPSUP [604], which uses a randomly initialized fixed backbone, and CCLL [516], EFTs [567], which use fixed backbones trained on the first task. We use a more general and compatible way with less requirement to obtain the backbone: the fixed backbone could be pretrained with supervised learning on a relatively diverse dataset (e.g., on ImageNet-1k [479], or Pascal VOC [137]), or with self-supervised learning on proxy tasks, such as DINO [69] and SwAV [68], which requires no semantic labels – we will see that our approach is robust to the choice of a pretraining dataset in the Section 17.4.5 experiments). This fixed backbone can provide a diverse set of visual features. However, those need to be reprogrammed later for individual tasks, which is achieved by our CLR layers. Specifically, we use channel-wise linear transformation to reprogram the feature map generated by an original convolutional kernel, to generate a new task-specific feature map for the new task. Figure 17.2 shows the structure of the proposed CLR. CLR is compatible with any convolutional neural network (CNN). A CNN usually consists of several Conv blocks (e.g., Residual block) (Figure 17.2 top), which contain a 327 Algorithm 8 CLR Layers c17-alg:alg-1 Feature map X′ w×h×c (output from the original conv layer F), 3 × 3 2D CLR reprogram kernels ×c Reprogrammed feature map Xˆ′ w×h×c p ← ⌊3/2⌋ X′ ← zero-padding(X′ , p) for k ∈ [1..c] do Xˆ′ [k] ← CLRk(X′ [k]) X′ [k] is the k-th channel convolutional layer (conv), normalization layer (e.g., batch normalization), and activation layer (e.g., relu). Our method treats a pretrained CNN as a general and task-agnostic immutable parameter backbone for all future tasks, so we fix its parameters. (More details on the choice of a pretraining backbone in the Section 17.4.5 experiments). To reprogram the CNN to solve a new task in continual learning, we add lightweight trainable parameters by changing each of the original Conv blocks into a CLR-Conv block, thereby creating a CLR-reprogrammed CNN (Figure 17.2 top). Specifically, as shown in Figure 17.2 (top), a CLR-Conv block is obtained by adding a channel-wise lightweight reprogramming layer (CLR layer) after each of the original fixed conv layers. (For parameter saving and preventing overfit, 1×1 conv layers are excluded.) Each CLR layer conducts linear transformations on each channel of the original feature map after the fixed conv layer to reprogram the features. Here, the linear transformation is represented with 3x3 2D convolutional kernels conducted on single channel feature map. Figure 17.2 (bottom) illustrates the details of Channel-wise linear transformation for reprogramming. For each convolutional kernel fk(), given the input feature X, we obtain one channel of feature x ′ k = fk(X) after the process, all Channel-wise features form the output feature map X′ . Our Channel-wise reprogramming Linear transformation is applied on each channel x ′ k of the output feature map X′ . For each kernel fk(), we have a corresponding reprogramming 3x3 2D kernel CLRk, which takes the single channel output x ′ k as input and conducts a linear transformation to obtain the reprogrammed feature xˆ′ k: xˆ′ k = CLRk(x ′ k ) = CLRk(fk(X)) (17.1) Algorithm 8 shows the pseudocode of the features before and after a CLR layer. We initialize the CLR layer as an identity transformation kernel (e.g., in a 3x3 2D kernel, the center parameter is one and all others are zero). This setting is crucial for training efficiency and performance, as it favors keeping the general feature extractors in the original fixed model, while at the same time it allows achieving adaptive reprogramming, based on the loss function for the new task. 328 For CLR-reprogrammed CNNs, the original conv layers in the backbone are fixed, the trainable parameters include the CLR layer after each fixed conv layer (normalization layer is optional to train), and the last fully-connected layer. For CLR-reprogrammed Resnet-50 [220], the trainable CLR layer takes only 0.59% of the parameter of the original fixed Resnet-50 backbone. This efficient parameter property makes CLR-reprogrammed networks easy to deploy for continual tasks. (a) Continual learning new tasks with CLR-reprogrammed model Task 1 Scenes 67 classes; 15,620 images Task 2 Task k-1 Task k Task k+1 Birds 200 classes; 11,788 images … Cars 196 classes; 16,185 images … SVHN 10 classes; 99,289 images Task oracle input Output Select which CLR to use (b) Test time with task-specific CLR parameters Fixed backbone Legend: Task sequence Task-specific CLR parameter Last layer CLR layers CLR pool … Blood Cell 4 classes; 12,500 images Figure 17.3: CLR-reprogrammed CNNs for continual learning. (a) In learning time, a CNN model could be reprogrammed by Channel-wise Lightweight Reprogramming parameters to solve continual new tasks. Only the CLR layers need to be trained in each reprogramming. (b) In test time, task oracle will select which task-specific CLR parameters to use and make the final decision. fig:CLR-CL 17.3.2 CLR for Continual Learning c17-sec:3.2 In task incremental continual learning, the model faces a sequence of tasks. As shown in Figure 17.3(a), during continual learning, a CLR-reprogrammed model learns each task at a time, and all tasks share the same fixed pretrained backbone. Each task has a trainable task-specific CLR parameter set consisting of the CLR layers after each original conv layer and the last linear layer. During testing (Figure 17.3(b)), we assume a perfect task oracle (as assumed by our baselines shown in the next section) which tells the model which task the test image belongs to. The fixed backbone equipped with the corresponding task-specific CLR parameter makes the final decision. Due to the absolute parameter isolation in CLR (i.e., CLR layers are completely specific and separate for each task, and the shared backbone is not changed at all), our method’s performance on every task is not influenced by increasing number of tasks (similar to SUPSUP [604], CCLL [516], and EFT [567]). Theoretically, the CLR-reprogrammed model can learn as many tasks as needed in a continual learning setting and keep the optimal performance on each task with no accuracy decrease when the number of tasks increase, but with a 0.59% increase in parameters for each new task. 329 17.4 Experiments and results c17-sec:4 In this section, we compare our CLR-reprogrammed model to baselines on a challenging 53-dataset with large variance (Section 17.4.1). We evaluate task accuracy fluctuation after learning new tasks to assess forgetting (Section 17.4.2). Then, we evaluate the average accuracy on all learned tasks so far during continual learning (Section 17.4.3). We analyze the network parameters and computation cost during continual learning (Section 17.4.4). We conduct ablation study to analyze the influence of different immutable backbones (Section 17.4.5) Comparison 53-dataset (ours) 8-dataset [12, 13] ImageNet [479] Fine grained 6 tasks [479] [318] Cifar100 [301] F-CelebA [445] # tasks 53 8 20 6 20 10 # Classes 1,584 738 1000 1943 100 20 # Images 1,811,028 166,360 1,300,000 1,440,086 60,000 1,189 # different classification target 5 1 1 2 1 1 Mix image style (nature/artifact) ✓ ✓ ✗ ✓ ✗ ✗ Mix super/fine-class classification ✓ ✓ ✓ ✗ ✗ ✗ Table 17.1: Comparison of 53-dataset with other benchmark datasets including Cifar-100 [301], F-CelebA [445], Fine-grained 6 tasks [479] [318], [404], [299], [484], [131]. Note that our 53-dataset covers the 8-dataset, F-CelebA and part of the Fine grained 6 tasks. table:53-dataset 17.4.1 Dataset and Baselines: c17-sec:4.1 Datasets. We use image classification as the basic task framework. We extend the conventional benchmark 8-dataset [12, 13] to a more challenging 53-datasets by collecting more challenging classification tasks. 53-datasets consist of 53 image classification datasets. Each one supports one complex classification task, and the corresponding dataset is obtained from previously published sources, e.g., task 1 [430]: classify scenes into 67 classes, such as kitchen, bedroom, gas station, etc (scene dataset with 15,523 images); task 2 [574]: classify 200 types of birds, such as Brewer Blackbird, Cardinal, Chuck will Widow, etc (birds dataset with 11,787 images). A full list and more details on each dataset are in Appendix. The 53-datasets is a subset of SKILL-102 [169] Lifelong Learning benchmark dataset*, more details about dataset creation and an extended version (107 tasks) is in USC-DCT (Diverse Classification Tasks), a broader effort in our laboratory. DCT can be used for many machine vision purposes, not limited to lifelong learning. We use 53 datasets with > 1.8M images from 1,584 classes over 53 tasks for experiments. Table 17.1 shows the details of our 53-dataset in comparison to other benchmark datasets. Specifically, we compare the number of different classification targets among different benchmarks, which represents the diversity and difference among *SKILL-102 dataset http://ilab.usc.edu/andy/skill102 330 different continual tasks. For instance, our 53-dataset contains 5 different classification targets: object recognition, style classification (e.g., painting style), scene classification, counting number (e.g., CLVER dataset [270]), and medical diagnosis (e.g., Breast Ultrasound Dataset). To date, our 53-dataset is one of the most challenging image classification benchmarks for continual learning algorithms, with a large number of tasks and inter-task variance. For the experiments below, we subsampled the dataset to allow some of the sequential baselines to converge: we capped the number of classes/task to 300 (only affected 1 tasks), and used either around 5, 120 training images for tasks with c ≥ 60 classes, or around 2, 560 for tasks with c < 60, where c represent the number of classes. Thus, we used 53 tasks, total 1,583 classes, total 132,625 training images and 13,863 test images. We also conduct experiments on CIFAR-100 dataset. Baselines. As discussed in Section 17.1, we grant each baseline a perfect task oracle during inference. We implemented 13 baselines, which can be roughly categorized in the following 3 categories [105]: (1) Dynamic Network methods contains most of the baseline methods because our method also belongs to it: they dynamically modify the network to solve new tasks, usually by network expansion. We use PSP [89], Supermask in Superposition (SUPSUP) [604], CCLL[516], Confit[269], and EFTs[567] as the representative methods of this category: For PSP, the model learns all 53 tasks in sequence, generating a new PSP key for each task. The keys help segregate the tasks within the network in an attempt to minimize interference. For SUPSUP, the model uses a random initialized parameter as fixed backbone and learns class-specific supermasks for each task, which help alleviate catastrophic forgetting. During inference, different tasks use different supermasks to make the decision. (2) Regularization methods add an auxiliary loss term to the primary task objective to constraint weight updates. The extra loss can be a penalty on the parameters (EWC [297], LwF[338], MAS [12] and SI [634]) or on the featurespace (FDR [38]), such as using Knowledge Distillation (DMC [644]). We use EWC, online-EWC, SI, LwF as the representatives of this category: for EWC, one agent learns all 53 tasks in sequence, using EWC machinery to constrain the weights when a new task is learned, to attempt to not destroy performance on previously learned tasks. (3) Replay methods use a buffer containing sampled training data from previous tasks, as an auxiliary to a new task’s training set. The buffer can be used either at the end of the task training (iCaRL, ER [446, 462]) or during training (GSS, AGEM, AGEM-R, DER, DERPP [363, 78, 14, 61]). We use Episodic Memory Rehearsal (ER) as the representative baseline of this category: One agent learns all 53 tasks in sequence. After learning each task, it adds 10 images/class of that task to a growing replay buffer that will later be used to rehearse old tasks. When learning a new 331 task, the agent learns from all the data for that task, plus rehearses all old tasks using the memory buffer. Additionally, SGD is a naive baseline that just fine-tunes the entire network for each task, with no attempt at minimizing interference. SGD-LL is a variant that uses a fixed backbone plus a single learnable shared last layer for all tasks with a length equal to the task with the largest number of classes (500 classes in our setting). SGD uses standard stochastic gradient descent as optimization, which may suffer from the catastrophic forgetting problem. For a fair comparison, all the 13 baseline methods use an ImageNet-pretrained ResNet-50 [220] backbone, except for PSP, Confit (requires ResNet-18) and SUPSUP (which requires a randomly initialized ResNet-50 backbone). Figure 17.4: Accuracy on task 1 as a function of the number of tasks learned. Our approach maintains the highest accuracy on task 1 over time, and importantly, it totally avoids catastrophic forgetting and maintains the same accuracy as the original training, no matter how many new tasks are learned. As discussed in the approach section, this is because we explicitly isolate the task-specific parameters for all tasks and avoid parameter interference. This is also the case for baselines SUPSUP [604], CCLL [516], and EFT [567]. Other baseline methods suffer a different degree of catastrophic forgetting. EWC [497], PSP [89], LwF[338], SI [634] and SGD suffer severe catastrophic forgetting with this challenging dataset. Rehearsal-based method ER performs relatively well because it has an unlimited large replay buffer, and it saves 10 images/class of the previous tasks. Yet, the overall accuracy of ER is still lower than our CLR-reprogrammed model. Rehearsal methods also incur higher (and increasing) training costs because of the rehearsing. We noticed similar performance on the second task. c17-fig:task1 17.4.2 Accuracy on the first tasks c17-sec:4.2 To evaluate the performance of all methods in overcoming catastrophic forgetting, we track the accuracy on each task after learning new tasks. If the method suffers from catastrophic forgetting, then the accuracy of the same task will decrease after learning new tasks. A great continual learning algorithm is expected to maintain the accuracy of the original learning performance after learning new tasks, which means that old tasks should be minimally influenced by an increasing number of new tasks. Figure 17.4 shows the accuracy on the first, and second tasks as we learn from 1 to 53 tasks using 53-datasets, to gauge the amount of forgetting on early tasks as many more tasks are learned. Overall, 332 Oregon Wildlife 20 classes 14013 images Fashion 45 classes 44000 images Blindness 5 classes 13000 images SVHN 10 classes; 99,289 images Cars 196 classes 16,185 images Food-101 101 classes 101,304 images 100 sports 100 classes 14600 images Figure 17.5: Average accuracy on all tasks learned so far, as a function of the number of tasks learned. Our CLR-reprogrammed approach is able to maintain higher average accuracy than all baselines. The average accuracy increases because some of the later tasks are easier than former tasks (i.e., later tasks have higher accuracy). c17-fig:avgtask our CLR-reprogrammed model maintains the highest accuracy on these early tasks over time, and importantly, our method (similar to CCLL [516], EFT [567] and SUPSUP [604]) avoids forgetting and maintains the same accuracy as the original training, no matter how many new tasks are learned. SUPSUP is not able, even from the beginning, to learn task 1 as well as other methods. We attribute this to SUPSUP’s limited expressivity and capacity to learn using masks over a random backbone, especially for tasks with many classes. Indeed, SUPSUP can perform very well on some other tasks, usually with a smaller number of classes (e.g., SVHN, UMNIST Face Dataset). Baseline methods suffer a different degree of forgetting: EWC [497], PSP [89], ER [462], SI [634], and LwF [338] suffers severe catastrophic forgetting in the challenging dataset. We noticed similar performance on the second task. 17.4.3 Average accuracy after learning all 53 tasks c17-sec:4.3 We computed the average accuracy on all tasks learned so far after learning 1, 2, 3, ... 53 tasks. We plot the accuracy averaged over all tasks learned so far, as a function of the number of tasks learned in Figure 17.5. Note that the level of difficulty for each of the 53 tasks is quite variable, as shown in Figure 17.6. Hence, the average accuracy over all tasks so far may go up or down when a new task is added, depending on whether an easier or harder task is added. The average 333 Task id of 53 datasets Figure 17.6: Absolute accuracy per task after learning 53 tasks with our CLR-reprogrammed CNN. c17-fig:absacc accuracy represents the overall performance of a continual learning method in learning and then performing sequential tasks. Overall our CLR-reprogrammed model achieves the best average accuracy compared with all baselines. For replay-based methods, the overall performance is lower than us even though they have a large buffer, and the training time is increased for these methods with the increase in task number. EWC, PSP, LWF, SI suffer severe catastrophic forgetting in the challenging 53-datasets. To show more details, we plot the accuracy of each task after learn all 53 tasks with our CLR-reprogrammed method in Figure 17.6. This shows that 53-dataset provides a range of difficulty levels for the various tasks, and is quite hard overall. 17.4.4 Parameter and computation cost c17-sec:4.4 Achieving higher overall average accuracy during learning of sequential tasks is very important. A great continual learning algorithm is also expected to minimize the requirement of extra network parameters, and the computation cost. Table 17.2 shows the required extra parameter and computation cost for our approach and the baselines. The "extra parameters to add one new task" represents the percentage compared with the original backbone (e.g., ResNet-50) size. For instance, EWC needs no extra parameter, while its computation cost is relatively high (to compute the Fisher information matrix that will guide the constraining of parameters while learning a new task), and the accuracy is not the best. Our CLR method required only 0.59% extra parameters to learn each new task, and the computation cost increase is small compared with baseline SGD (normal training). Importantly, our method achieves the best average accuracy. 334 Method Extra parameter to add 1 new task Computation cost Average Acc (53-datasets) SGD 0% 1 16.71 % PSP [89] 5.02% 0.828 34.91 % EWC [297] 0% 1.160 9.36 % ONLINE EWC [494] 0% 1.011 9.47∗ % SGD-LL 0% 0.333 12.49 % ER [462] 189.14% 3.99 53.13 % SI [634] 0% 1.680 7.28% LwF [338] 0% 1.333 8.23% SUPSUP [604] 3.06% 1.334 56.69 % EFT [567] 3.17% 1.078 67.8 % CCLL [516] 0.62% 1.006 78.3 % CLR (Ours) 0.59% 1.003 86.05 % Table 17.2: Extra parameter expenditures and computation cost analysis. We treat the computation cost of SGD as the unit, and the computation costs of other methods are normalized by the cost of SGD. PSP’s low computation cost comes from using a Resnet-18 backbone instead of Resnet-50 which is its original form. For EWC, though the final model size does not increase, the performance is poor and N fisher matrices are needed during training; EWC-online updated the way of updating the fisher matrix and only requires one fisher matrix during training. ER maintain a memory buffer that includes five images per class from the tasks that have already been seen, we spread the size in bytes of the image buffer over the 53 tasks to obtain the amount of extra parameters per task. SUPSUP requires a 3MB mask for each task. table:cost 17.4.5 Influence of different immutable backbone c17-sec:4.5 Our method obtains the task-agnostic immutable parameters by training the CNN model on a relatively diverse dataset with supervised learning or proxy tasks with self-supervised learning. To investigate the influence of different pretraining methods on the performance of continual learning, we choose four different kinds of task-agnostic immutable parameters trained with different datasets and tasks. For supervised learning, besides Imagenet-1K, we also conduct experiments with a pretrained backbone on the Pascal-VOC image classification task (relatively smaller one). For self-supervised learning, which needs no semantic labels, we conduct experiments with backbone trained with DINO [69] and SwAV [68]. DINO [69] is a self-distillation with no label framework, which utilizes multiple crop of the image (patch) on the same model and update model’s parameters with exponential moving average. While SwAV [68] simultaneously clusters the data and keeps the consistency between cluster assignments produced for different augmentations of the same image. The results in Table 17.3 shows that both supervised and self-supervised learning could contribute a good immutable parameters, and our method is robust to different backbone pre-training. Note how Pascal-VOC is a much 335 Learning paradigm Method/dataset Average Acc (53-datasets) Supervised Learning ImageNet-1k [479] 86.05 % Supervised Learning Pascal VOC [137] 82.49 % Self-supervised Learning SwAV [68] 85.12 % Self-supervised Learning DINO [69] 85.77 % Table 17.3: Influence of different task-agnostic immutable parameter. Both supervised learning and selfsupervised learning could contribute a relatively good immutable parameter for our method, which shows that our method is robust to different backbones. c17-table:backbone smaller dataset, which may explain the lower overall accuracy; with any of the other (larger) datasets, accuracy is almost the same, suggesting that our method is not highly dependent on a specific set of backbone features. 17.5 Conclusion We propose Channel-wise Lightweight Reprogramming (CLR), a parameter-efficient add-on continual learning method, to allow a single network to learn potentially unlimited parallel input-to-output mappings and to switch on the fly between them at runtime. CLR adds channel-wise lightweight linear reprogramming to shift the original pretrained fixed parameter to each task, which is simple and generalizable to any CNN-based model. The experiments on continually learning 53 different and challenging tasks show that the CLR method achieves state-of-the-art performance on task incremental continual learning. Besides high performance, CLR is also parameter-efficient, which requires only 0.59% extra parameter to learn a new task. Appendix 17.6 Details of our 53-dataset for continual learning and performance Figure 17.7 shows a summary of the 53 datasets we used as the continual learning benchmark in our main paper. The figure also shows the detailed per-task accuracy of our methods and baselines after learning all 53 tasks in the task incremental continual learning setting. 336 Figure 17.7: Statistics of the datasets and per-task accuracy of our method and baselines after learning all 53 tasks in the continual learning setting. Ablation columns indicate our methods with different initialization weights. fig:summary_result 17.7 Channel-wise linear reprogramming ability To further understand the performance of channel-wise lightweight reprogramming achieved by channel-wise linear transformation, we conduct qualitative experiments to explore the ability of CLR layer to transfer the feature map from a Pre-trained immutable parameter weight set (starting point) to a target parameter weight set (goal). Usually, the Pre-trained weight is not good enough due to the domain gap between the Pre-trained dataset/learning paradigm and the target dataset. And a relatively good performance could be achieved by either finetuning the whole backbone on the target dataset (FINETUNE) or learning from scratch (randomly initialized backbone) on the target task dataset (SCRATCH). We will show that with the help of a very cheap CLR layer, a feature map after a pretrained 337 (non-optimal) model could be reprogrammed towards a "relatively optimal" feature map obtained by either finetuning the whole backbone (FINETUNE) or training from scratch (SCRATCH). We choose two datasets: CLEVR dataset and the Kannada-MNIST dataset. Model performance on the CLEVR dataset reaches 46.09% with a Pre-trained ResNet-50 backbone + linear head, 97.66% with FINETUNE, and 91.41% with SCRATCH. In this scenario, pretrain has a large accuracy gap with FINETUNE and SCRATCH. It would be interesting to see if the CLR layer could reprogram a feature map obtained from pretrain towards a feature map obtained by FINETUNE and SCRATCH, which shows the ability of CLR layer to fill a large domain gap. Model performance on the Kannada-MNIST dataset reaches 95.77% with a Pre-trained backbone + linear head, 99.62% with FINETUNE, and 100% with SCRATCH. Here, SCRATCH performs better than FINETUNE, which shows that the pretrained weights may have no benefit (or even harmful) for target task learning. Here we want to show that the CLR layer could reprogram a feature map obtained from pretrain towards the feature map obtained by SCRATCH. We use the feature map after the first convolutional layer in the different models (pretrain, FINETUNE, and SCRATCH). Taking the feature map after the pretrain model as input and the feature map after FINETUNE (or SCRATCH) as output, we utilize a CLR layer (3x3 2D depthwise convolutional kernels) to learn the mapping, i.e. the channel-wise linear transformation between them. The qualitative results are shown in Figure 17.8. Specifically, in Figure 17.8, we visualize the feature map that initially has a large initial gap between pretrain and FINETUNE (or SCRATCH). The results show that after channel-wise linear transformation, the feature after pretrain could be reprogrammed towards the goal feature (FINETUNE or SCRATCH) 17.8 Bootstrapping results Fig.5 in the main paper shows the average accuracy as more tasks are learned. However, the gradient of the curve is also influenced by the order of the tasks (i.e., Hard tasks located in earlier sequence will cause average accuracy tends to increase, while easy tasks located in earlier sequence will cause average accuracy tends to decrease) which is entangled with the effect of catastrophic forgetting. We use bootstrapping to show the tendency of average accuracy when more tasks are learned. Specifically, for any number of tasks (t ∈ (1, ..., 53)) that we want to conduct in one continual learning setting, we randomly sample t tasks from the 53 tasks 50,000 times with replacement and compute the mean accuracy (mean) and standard deviation 338 Figure 17.8: The Figure shows quantitative results of the CLR transformation ability on CLEVR and KannadaMNIST datasets. We visualize the feature maps in the first residual group of ResNet-50 that initially has a large initial gap between pre-train and FINETUNE (or SCRATCH). The results show that after channel-wise linear transformation, the feature after pre-train could be reprogrammed towards the goal feature (FINETUNE or SCRATCH). Pretrained indicates the frozen Imagenet pretrained ResNet-50 backbone. Finetune is a finetuned ResNet-50 backbone with Imagenet pretrained initialization while Scratch is a trained ResNet-50 backbone with random initialization. fig:visual (std). Figure 17.9 shows the Bootstrapping statistic results, which show the change of mean and std when we increase the total number of tasks. The X-axis represents the task number t we want to conduct. For instance, if the continual learning task number t=10, then we randomly sample 10 tasks from the 53-dataset and calculate the mean accuracy. We repeat the sampling 50000 times and get the std. Y-axis shows the mean Accuracy (solid blue line) on the sampled tasks with replacement and std as the shaded light blue range. Since in our CLR method, the order of task is not mattered (we 339 Figure 17.9: Bootstrapping statistic results. The X-axis represents the number of tasks t in a specific continual learning task. Y-axis shows the mean Accuracy (solid blue line) on the sampled tasks with replacement and std as the shaded light blue range. fig:bootstrapping-std Figure 17.10: Bootstrapping statistic results with detailed accuracy log. The X-axis represents the number of tasks t in a specific continual learning task. Y-axis shows the mean Accuracy (solid blue line) on the sampled tasks with replacement. The shaded light blue range shows the min and max range in the given task number t. We use the solid red line to represent our reported results in the main paper Fig.5, which filled in the shaded light blue range. fig:bootstrapping-minmax have the same performance on a specific task no matter the sequence), this allows us to simulate what would happen if we learn a different sample of tasks given a specific task number t. We observe that the mean accuracy is stable and not influenced by t when the sample number is large. For std, when the task number t is small, the std is relatively large, and the std decreases with task number t increase. When t=53, the std becomes zero. 340 Figure 17.10 shows the Bootstrapping statistic results with detailed max and minimum accuracy logs. The X-axis represents the number of tasks t in a specific continual learning task. Y-axis shows the mean Accuracy (solid blue line) on the sampled tasks with replacement. The shaded light blue range shows the min and max range in the given task number t among 50000 times task samples. We use the solid red line to represent our reported results in the main paper Fig.5, which filled in the shaded light blue range. 17.9 More experiments to explore the trade-off between parameter and performance Several other versions of our method may include methods with higher accuracy but higher cost. Our main method - CLR (the one in the paper) adds the CLR layer after all the original convolutional kernels except for the 1×1 kernels, saving many parameters. CLR-Full version applying the CLR layer to all convolutional kernels in the pretrained model which reaches 85.85% in accuracy and cost 1.69× parameters compared to our main method (CLR). The CLR-Reduced version adds a smaller version of CLR layer with 1×1 2D reprogramming kernels after all 1×1 original Conv kernels and normal CLR layer with 3×3 2D reprogramming kernels after the rest CONV kernels. It reaches 85.7% in accuracy and costs 1.08× parameters compared to our main method (CLR). The CLR-mixed version learns a weighted combination of the original and our reprogrammed feature maps. The intuition is that we keep some proportion of the original feature and add the new features learned after reprogramming. Specifically, A trainable parameter A decides the weight of the summation of the reprogrammed feature and the original feature map. Equation 17.2 shows the details of the weighted combination. xˆ′ k = A ∗ CLRk(x ′ k ) + (1 − A) ∗ x ′ k (17.2) {c17-s-eq:1} c17-s-eq:1 where x ′ k is the kth channel of the feature map from the original Convolutional layer and CLRk is the corresponding linear transformation. It reaches 86.25% in accuracy and costs 1.79× parameters compared to our main method (CLR). 341 The results are shown in Figure 17.11. Theoretically, more trainable parameters could lead to better performance, and it is true for CLR-mixed version, which achieves +0.2% than our main method. Interestingly, the CLR-full version achieves lower average accuracy than the main method, while most of the per-task accuracy is higher (43 out of 53 tasks). Figure 17.11: Per-task accuracy of our main method and other versions of our method after learning all 53 tasks in the continual learning setting. c17-fig:ablation 17.10 Transfer learning with CLR-based model c17-s-sec:4.2 We apply our CLR method to the transfer learning problem, where we only care about the accuracy of the transferred dataset while do not need to maintain performance on previous datasets. Datasets. We use the same 53-dataset to evaluate transfer learning performance. Specifically, we use the ImageNet pretrained ResNet-50 model as initialization and apply our method and 4 baseline transfer learning methods 53 times, on 53 different classification tasks. Baselines. We have four baseline methods: 1) learn from scratch (SCRATCH), where the backbone ResNet-50 is randomly initialized with no prior knowledge, and then uses the training set of each task to train the whole network from scratch. 2) finetune the whole backbone and last layer (FINETUNE), 3) finetune only the last layer (LINEAR), 4) 342 Head2Toe method [136] use the fixed backbone and need two steps: 1) feature selection: train the model by adding a large fully connection between all intermediate features and the last layer and select the important connection by adding regularization. 2) keep the important skip connection and retrain the added layers. Figure 17.12 shows the average accuracy on all 53 classification tasks and the details of each task, and Figure 17.13 shows the detailed result for transfer learning. Our CLR achieves the best average accuracy on the 53-dataset compared with all baselines. Specifically, CLR achieves almost 5% average improvement on 53 datasets over Head2Toe, and even larger improvement over LINEAR, FINETUNE, and SCRATCH. This shows the effectiveness of the CLR-based model in transfer learning problems. Figure 17.12: Bar plot of transfer learning performance on 53-dataset. fig:transfer 17.11 CIFAR-100 Result We also show our method’s result on incremental CIFAR-100 dataset with other previous baselines in Table 17.4 343 Figure 17.13: Transfer learning result on 53-dataset of our method and other baselines (LINEAR, SCRATCH, FINETUNE, and Head2Toe) fig:transfer_result Method Average Acc LwF 24% iCARL 49% RPS 57% CCLL 85% EWC 41% SI 52% CLR (Ours) 94.2% Table 17.4: We applied our method on CIFAR-100 dataset with 10 tasks, each containing 10 classes with comparisons to baselines from CCLL, using ResNet-18 as the backbone. table:cifar 344 Chapter 18 Conclusions ch:conclusions Modern AI models are becoming increasingly powerful, characterized by ever-larger parameter sizes. Correspondingly, the size and quality of datasets needed to train these advanced models have also seen a significant increase. In the context of scalable model training, data has emerged as a crucial factor, often acting as a bottleneck. This is because, from a knowledge transfer perspective, a model is essentially a concentrated form of the knowledge contained within the data. Therefore, the creation of datasets that are not only larger in size but also higher in quality is becoming increasingly vital. However, the use of real-world datasets is often hampered by challenges such as the high costs of data annotation, privacy concerns, and difficulties in scaling. This thesis has thoroughly investigated controllable AI-generated data and its critical role in scalable model training, introducing a range of innovative methods and insights. We have shown that generative models can be precisely directed in various ways, including adjusting single or multiple fine-grained attributes, managing high-level categories and compositions, controlling physical properties in 3D, and influencing overall data distribution. AI-generated data, serving either as a supplement or a replacement for real data, proves invaluable for training and evaluating downstream models. A key development discussed in the second part (chapter 8-9) is the shift from human-led to model-driven data generation. This transition represents a major step forward, highlighting the benefits of on-demand data generation for improving the efficiency and independence of model training. The third part of this thesis (chapter 10-15) introduces the concept of model explainability and generalization, serving as a vital feedback mechanism in two key areas: (1) It offers insights into model failures, reasoning processes, and knowledge limitations, providing signals to enhance model performance. (2) It utilizes explainability and robustness as feedback tools to guide data generation, further improving model efficacy. The innovative approaches explored here 345 not only contribute to better model performance but also significantly aid in extending the models’ applicability to a diverse range of new tasks (generalization). In the last part (Sections 16-17), this thesis takes an innovative approach by redefining model parameters as a unique form of “Data” for efficient knowledge sharing, particularly in the context of lifelong learning. This novel perspective paves the way for more effective transfer and use of knowledge among various AI models. It addresses critical challenges in knowledge retention and adaptability within neural networks, marking a significant advancement in the field. In summary, this thesis highlights the significant role of AI-generated data in revolutionizing model training. It showcases how this data can effectively substitute real-world data, leading to more efficient, flexible, and selfreliant sources for training and evaluating AI models. The methodologies and insights developed here, particularly in controllable data generation, provide innovative tools and viewpoints to influence large-scale AI model training. The progression from controlled data generation to understanding model explainability, and reimagining model parameters as data, represents a comprehensive approach in AI development. The author hopes that this thesis will inspire readers and the broader community, sparking new ideas and encouraging further exploration in the ever-evolving field of AI. 346 Bibliography [1] http://lib.stat.cmu.edu/datasets/boston. [2] https://link.springer.com/article/10.1023/A:1007421730016. [3] https://www.science.org/doi/full/10.1126/science.1105809. [4] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother. “Augmented reality meets computer vision: Efficient data generation for urban driving scenes”. In: International Journal of Computer Vision 126 (2018), pp. 961–972. [5] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. “SLIC superpixels compared to state-of-the-art superpixel methods”. In: IEEE transactions on pattern analysis and machine intelligence 34.11 (2012), pp. 2274–2282. [6] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. “Learning representations and generative models for 3d point clouds”. In: International conference on machine learning. PMLR. 2018, pp. 40–49. [7] Amina Adadi and Mohammed Berrada. “Peeking inside the black-box: A survey on Explainable Artificial Intelligence (XAI)”. In: IEEE Access 6 (2018), pp. 52138–52160. [8] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. “Sanity checks for saliency maps”. In: Advances in neural information processing systems 31 (2018). [9] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. “Don’t just assume; look and answer: Overcoming priors for visual question answering”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 4971–4980. [10] Jiwoon Ahn, Sunghyun Cho, and Suha Kwak. “Weakly supervised learning of instance segmentation with inter-pixel relations”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 2209–2218. [11] Phillip Isola Ali Jahanian Lucy Chai. “On the "steerability" of generative adversarial networks”. In: CoRR (2019). [12] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. “Memory aware synapses: Learning what (not) to forget”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 139–154. 347 [13] Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. “Expert gate: Lifelong learning with a network of experts”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 3366–3375. [14] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. “Gradient based sample selection for online continual learning”. In: Advances in neural information processing systems 32 (2019). [15] Rahaf Aljundi, Marcus Rohrbach, and Tinne Tuytelaars. “Selfless sequential learning”. In: arXiv preprint arXiv:1806.05421 (2018). [16] Yonah Amir, Michal Harel, and Rafael Malach. “Cortical hierarchy reflected in the organization of intrinsic connections in macaque monkey visual cortex”. In: Journal of Comparative Neurology 334.1 (1993), pp. 19–46. [17] Phil Ammirato, Patrick Poirson, Eunbyung Park, Jana Košecká, and Alexander C Berg. “A dataset for developing and benchmarking active vision”. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE. 2017, pp. 1378–1385. [18] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. “Bottom-up and top-down attention for image captioning and visual question answering”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 6077–6086. [19] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. “Learning to learn by gradient descent by gradient descent”. In: Advances in Neural Information Processing Systems. 2016, pp. 3981–3989. [20] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. “Vqa: Visual question answering”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 2425–2433. [21] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. “Multiscale combinatorial grouping”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, pp. 328–335. [22] Sercan O. Arik and Tomas Pfister. TabNet: Attentive Interpretable Tabular Learning. 2019. [23] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. “Invariant risk minimization”. In: arXiv preprint arXiv:1907.02893 (2019). [24] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein generative adversarial networks”. In: International conference on machine learning. PMLR. 2017, pp. 214–223. [25] Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A Closer Look at Memorization in Deep Networks. 2017. [26] Aditya Arun, CV Jawahar, and M Pawan Kumar. “Weakly supervised instance segmentation by learning annotation consistent instances”. In: European Conference on Computer Vision. Springer. 2020, pp. 254–270. 348 [27] Yuval Atzmon and Gal Chechik. “Probabilistic AND-OR Attribute Grouping for Zero-Shot Learning”. In: Uncertainty in Artificial Intelligence. 2018. [28] Frank Cole Babbitt et al. Plutarch’s moralia. Vol. 197. W. Heinemann, 1927. [29] Mohammad Taha Bahadori, Krzysztof Chalupka, Edward Choi, Robert Chen, Walter F Stewart, and Jimeng Sun. “Causal regularization”. In: arXiv preprint arXiv:1702.02604 (2017). [30] Amr Bakry and Ahmed Elgammal. “Untangling object-view manifold for multiview recognition and pose estimation”. In: European Conference on Computer Vision. Springer. 2014, pp. 434–449. [31] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. 2022. DOI: 10.48550/ARXIV.2211.01324. [32] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. “ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc., 2019. URL: https://proceedings.neurips.cc/paper/2019/file/97af07a14cacba681feacf3012730892- Paper.pdf. [33] Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. “Deep learning: a statistical viewpoint”. In: Acta Numerica 30 (2021), pp. 87–201. DOI: 10.1017/S0962492921000027. [34] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. “ARKitScenes - A Diverse Real-World Dataset for 3D Indoor Scene Understanding Using Mobile RGB-D Data”. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). 2021. URL: https://openreview.net/forum?id=tjZjv_qh_CE. [35] Harkirat Singh Behl, Atilim Güne¸s Baydin, Ran Gal, Philip HS Torr, and Vibhav Vineet. “Autosimulate:(quickly) learning synthetic data generation”. In: European Conference on Computer Vision. Springer. 2020, pp. 255–271. [36] Abhijit Bendale and Terrance Boult. “Towards open world recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1893–1902. [37] Yoshua Bengio, Aaron Courville, and Pascal Vincent. “Representation learning: A review and new perspectives”. In: IEEE transactions on pattern analysis and machine intelligence 35.8 (2013), pp. 1798–1828. [38] Ari S Benjamin, David Rolnick, and Konrad Kording. “Measuring and regularizing networks in function space”. In: arXiv preprint arXiv:1805.08289 (2018). [39] Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. “Are we done with imagenet?” In: arXiv preprint arXiv:2006.07159 (2020). 349 [40] Avishek Bhattacharjee, Samik Banerjee, and Sukhendu Das. “PosIX-GAN: Generating multiple poses using GAN for Pose-Invariant Face Recognition”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 0–0. [41] Sai Bi, Zexiang Xu, Pratul Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, and Ravi Ramamoorthi. “Neural reflectance fields for appearance acquisition”. In: arXiv preprint arXiv:2008.03824 (2020). [42] Irving Biederman. “Recognition-by-components: a theory of human image understanding.” In: Psychological review 94.2 (1987), p. 115. [43] Alsallakh Bilal, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren. “Do convolutional neural networks learn class hierarchy?” In: IEEE transactions on visualization and computer graphics 24.1 (2017), pp. 152–162. [44] John Binder, Daphne Koller, Stuart Russell, and Keiji Kanazawa. “Adaptive probabilistic networks with hidden variables”. In: Machine Learning 29.2 (1997), pp. 213–244. [45] Julian Bitterwolf, Alexander Meinke, Maximilian Augustin, and Matthias Hein. Revisiting Out-of-Distribution Detection: A Simple Baseline is Surprisingly Effective. 2022. URL: https://openreview.net/forum?id=-BTmxCddppP. [46] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. “End to end learning for self-driving cars”. In: arXiv preprint arXiv:1604.07316 (2016). [47] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecnˇ y, Stefano Mazzocchi, Brendan McMahan, et al. ` “Towards federated learning at scale: System design”. In: Proceedings of machine learning and systems 1 (2019), pp. 374–388. [48] Ali Borji, Saeed Izadi, and Laurent Itti. “ilab-20m: A large-scale controlled object dataset to investigate deep learning”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 2221–2230. [49] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. “Food-101 – Mining Discriminative Components with Random Forests”. In: European Conference on Computer Vision. 2014. [50] Nick Bostrom and Eliezer Yudkowsky. “The ethics of artificial intelligence”. In: Artificial intelligence safety and security. Chapman and Hall/CRC, 2018, pp. 57–69. [51] Jeffrey S Bowers. Grandmother cells and localist representations: a review of current thinking. 2017. [52] Christopher Bowles, Liang Chen, Ricardo Guerrero, Paul Bentley, Roger Gunn, Alexander Hammers, David Alexander Dickie, Maria Valdés Hernández, Joanna Wardlaw, and Daniel Rueckert. “Gan augmentation: Augmenting training data using generative adversarial networks”. In: arXiv preprint arXiv:1810.10863 (2018). 350 [53] Garrick Brazil and Xiaoming Liu. “M3d-rpn: Monocular 3d region proposal network for object detection”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 9287–9296. [54] Wieland Brendel and Matthias Bethge. “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet”. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL: https://openreview.net/forum?id=SkfMWhAqYQ. [55] Andrew Brock, Jeff Donahue, and Karen Simonyan. “Large Scale GAN Training for High Fidelity Natural Image Synthesis”. In: International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=B1xsqj09Fm. [56] Tim Brooks, Aleksander Holynski, and Alexei A Efros. “Instructpix2pix: Learning to follow image editing instructions”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 18392–18402. [57] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [58] Anna L Buczak and Erhan Guven. “A survey of data mining and machine learning methods for cyber security intrusion detection”. In: IEEE Communications surveys & tutorials 18.2 (2015), pp. 1153–1176. [59] Joy Buolamwini and Timnit Gebru. “Gender shades: Intersectional accuracy disparities in commercial gender classification”. In: Conference on fairness, accountability and transparency. PMLR. 2018, pp. 77–91. [60] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. “Understanding disentangling in beta-VAE”. In: arXiv preprint arXiv:1804.03599 (2018). [61] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. “Dark experience for general continual learning: a strong, simple baseline”. In: Advances in neural information processing systems 33 (2020), pp. 15920–15930. [62] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. “Activitynet: A large-scale video benchmark for human activity understanding”. In: Proceedings of the ieee conference on computer vision and pattern recognition. 2015, pp. 961–970. [63] Charles F Cadieu, Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A Solomon, Najib J Majaj, and James J DiCarlo. “Deep neural networks rival the representation of primate IT cortex for core visual object recognition”. In: PLoS computational biology 10.12 (2014), e1003963. [64] Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M Dollar. “The ycb object and model set: Towards common benchmarks for manipulation research”. In: 2015 international conference on advanced robotics (ICAR). IEEE. 2015, pp. 510–517. 351 [65] Berk Calli, Aaron Walsman, Arjun Singh, Siddhartha Srinivasa, Pieter Abbeel, and Aaron M Dollar. “Benchmarking in manipulation research: The YCB object and model set and benchmarking protocols”. In: arXiv preprint arXiv:1502.03143 (2015). [66] Jonathan S Cant and Melvyn A Goodale. “Attention to form or surface properties modulates different regions of human occipitotemporal cortex”. In: Cerebral Cortex 17.3 (2007), pp. 713–731. [67] Jonathan S Cant, Mary-Ellen Large, Lindsay McCall, and Melvyn A Goodale. “Independent processing of form, colour, and texture in object perception”. In: Perception 37.1 (2008), pp. 57–78. [68] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. “Unsupervised learning of visual features by contrasting cluster assignments”. In: Advances in neural information processing systems 33 (2020), pp. 9912–9924. [69] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. “Emerging properties in self-supervised vision transformers”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021, pp. 9650–9660. [70] Rich Caruana. “Multitask learning”. In: Machine learning 28.1 (1997), pp. 41–75. [71] Paola Cascante-Bonilla, Khaled Shehada, James Seale Smith, Sivan Doveh, Donghyun Kim, Rameswar Panda, Gul Varol, Aude Oliva, Vicente Ordonez, Rogerio Feris, et al. “Going beyond nouns with vision & language models using synthetic data”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, pp. 20155–20165. [72] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. “Anomaly detection using one-class neural networks”. In: arXiv preprint arXiv:1802.06360 (2018). [73] Raghavendra Chalapathy, Aditya Krishna Menon, and Sanjay Chawla. “Robust, deep and inductive anomaly detection”. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I 10. Springer. 2017, pp. 36–51. [74] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. “Matterport3D: Learning from RGB-D Data in Indoor Environments”. In: International Conference on 3D Vision (3DV) (2017). [75] Huiwen Chang, Jingwan Lu, Fisher Yu, and Adam Finkelstein. “Pairedcyclegan: Asymmetric style transfer for applying and removing makeup”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 40–48. [76] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, and Dilip Krishnan. Muse: Text-To-Image Generation via Masked Generative Transformers. 2023. DOI: 10.48550/ARXIV.2301.00704. [77] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. “Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks”. In: WACV. 2018. 352 [78] Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. “Efficient lifelong learning with a-gem”. In: arXiv preprint arXiv:1812.00420 (2018). [79] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, and Marc’Aurelio Ranzato. “On tiny episodic memories in continual learning”. In: arXiv preprint arXiv:1902.10486 (2019). [80] Guowei Chen, Yi Liu, Jian Wang, Juncai Peng, Yuying Hao, Lutao Chu, Shiyu Tang, Zewu Wu, Zeyu Chen, Zhiliang Yu, et al. “PP-Matting: High-Accuracy Natural Image Matting”. In: arXiv preprint arXiv:2204.09433 (2022). [81] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. “The rise of deep learning in drug discovery”. In: Drug discovery today 23.6 (2018), pp. 1241–1250. [82] Ricky T. Q. Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. “Isolating Sources of Disentanglement in Variational Autoencoders”. In: Advances in Neural Information Processing Systems. 2018. [83] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. “Infogan: Interpretable representation learning by information maximizing generative adversarial nets”. In: Advances in neural information processing systems. 2016, pp. 2172–2180. [84] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. “3d object proposals for accurate object class detection”. In: Advances in neural information processing systems 28 (2015). [85] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. “Multi-view 3d object detection network for autonomous driving”. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017, pp. 1907–1915. [86] Yongqiang Chen, Yatao Bian, Kaiwen Zhou, Binghui Xie, Bo Han, and James Cheng. “Rethinking Invariant Graph Representation Learning without Environment Partitions”. In: (2023). [87] Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, and Raquel Urtasun. “Geosim: Realistic video simulation via geometry-aware composition for self-driving”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, pp. 7230–7240. [88] Hao Cheng, Yufei Wang, Haoliang Li, Alex C. Kot, and Bihan Wen. “Disentangled Feature Representation for Few-shot Image Classification”. In: CoRR abs/2109.12548 (2021). arXiv: 2109.12548. URL: https://arxiv.org/abs/2109.12548. [89] Brian Cheung, Alex Terekhov, Yubei Chen, Pulkit Agrawal, and Bruno Olshausen. “Superposition of many models into one”. In: arXiv preprint arXiv:1902.05522 (2019). [90] David Maxwell Chickering. “Optimal structure identification with greedy search”. In: Journal of machine learning research 3.Nov (2002), pp. 507–554. [91] Hyunsun Choi, Eric Jang, and Alexander A Alemi. “WAIC, but why? generative ensembles for robust anomaly detection”. In: arXiv preprint arXiv:1810.01392 (2018). 353 [92] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 8789–8797. [93] Hisham Cholakkal, Guolei Sun, Fahad Shahbaz Khan, and Ling Shao. “Object counting and instance segmentation with image-level supervision”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 12397–12405. [94] F. Chollet. “Xception: Deep Learning with Depthwise Separable Convolutions”. In: CVPR. 2017, pp. 1800–1807. [95] François Chollet. “Xception: Deep learning with depthwise separable convolutions”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1251–1258. [96] Benoıt Colson, Patrice Marcotte, and Gilles Savard. “An overview of bilevel optimization”. In: Annals of operations research 153.1 (2007), pp. 235–256. [97] Michael Crawshaw. “Multi-task learning with deep neural networks: A survey”. In: arXiv preprint arXiv:2009.09796 (2020). [98] Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. “Environment inference for invariant learning”. In: International Conference on Machine Learning. PMLR. 2021, pp. 2189–2200. [99] Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky, Taedong Yun, Xiaohua Zhai, and D. Sculley. Underspecification Presents Challenges for Credibility in Modern Machine Learning. 2020. [100] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. “ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes”. In: Proc. Computer Vision and Pattern Recognition (CVPR), IEEE. 2017. [101] Shakir Mohamed Danilo Jimenez Rezende. “Variational Inference with Normalizing Flows”. In: ICML. 2015. [102] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. “Embodied question answering”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 1–10. [103] Thomas Davenport and Ravi Kalakota. “The potential for artificial intelligence in healthcare”. In: Future healthcare journal 6.2 (2019), p. 94. [104] Henry T. Davis and Michael L. Feldstein. “The generalized Pareto law as a model for progressively censored survival data”. In: Biometrika 66.2 (1979), pp. 299–306. DOI: 10.1093/biomet/66.2.299. 354 [105] Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. “A continual learning survey: Defying forgetting in classification tasks”. In: IEEE transactions on pattern analysis and machine intelligence 44.7 (2021), pp. 3366–3385. [106] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. “Objaverse-XL: A Universe of 10M+ 3D Objects”. In: arXiv preprint arXiv:2307.05663 (2023). [107] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. “Objaverse: A Universe of Annotated 3D Objects”. In: arXiv preprint arXiv:2212.08051 (2022). [108] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. “Objaverse: A Universe of Annotated 3D Objects”. In: arXiv preprint arXiv:2212.08051 (2022). [109] Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 5982–5994. [110] Dominik Dellermann, Adrian Calma, Nikolaus Lipusch, Thorsten Weber, Sascha Weigel, and Philipp Ebel. The future of human-AI collaboration: a taxonomy of design knowledge for hybrid intelligence systems. 2021. arXiv: 2105.03354 [cs.AI]. [111] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. “Large-scale object classification using label relation graphs”. In: ECCV. 2014, pp. 48–64. [112] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248–255. [113] Li Deng. “The mnist database of handwritten digit images for machine learning research”. In: IEEE Signal Processing Magazine 29.6 (2012), pp. 141–142. [114] Maximilian Denninger, Martin Sundermeyer, Dominik Winkelbauer, Youssef Zidan, Dmitry Olefir, Mohamad Elbadrawy, Ahsan Lodhi, and Harinandan Katam. “Blenderproc”. In: arXiv preprint arXiv:1911.01911 (2019). [115] Maximilian Denninger, Dominik Winkelbauer, Martin Sundermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. “BlenderProc2: A Procedural Pipeline for Photorealistic Rendering”. In: Journal of Open Source Software 8.82 (2023), p. 4901. DOI: 10.21105/joss.04901. 355 [116] Sachin S Deshmukh, Jeremy L Johnson, and James J Knierim. “Perirhinal cortex represents nonspatial, but not spatial, information in rats foraging in the presence of objects: comparison with lateral entorhinal cortex”. In: Hippocampus 22.10 (2012), pp. 2045–2058. [117] Jeevan Devaranjan, Amlan Kar, and Sanja Fidler. “Meta-sim2: Unsupervised learning of scene structure for synthetic data generation”. In: European Conference on Computer Vision. Springer. 2020, pp. 715–733. [118] Edgar A DeYoe, George J Carman, Peter Bandettini, Seth Glickman, JON Wieser, Robert Cox, David Miller, and Jay Neitz. “Mapping striate and extrastriate visual areas in human cerebral cortex”. In: Proceedings of the National Academy of Sciences 93.6 (1996), pp. 2382–2386. [119] Prafulla Dhariwal and Alexander Nichol. “Diffusion models beat gans on image synthesis”. In: Advances in neural information processing systems 34 (2021), pp. 8780–8794. [120] Max Welling Diederik Kingma. “Autoencoding Variational Bayes”. In: ICLR. 2014. [121] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. “Cogview: Mastering text-to-image generation via transformers”. In: Advances in Neural Information Processing Systems 34 (2021). [122] Santosh K Divvala, Derek Hoiem, James H Hays, Alexei A Efros, and Martial Hebert. “An empirical study of context in object detection”. In: 2009 IEEE Conference on computer vision and Pattern Recognition. IEEE. 2009, pp. 1271–1278. [123] Carl Doersch and Andrew Zisserman. “Sim2real transfer learning for 3D human pose estimation: motion to the rescue”. In: NeurIPS. 2019. [124] Barbara Dosher and Zhong-Lin Lu. “Visual perceptual learning and models”. In: Annual review of vision science 3 (2017), pp. 343–363. [125] Finale Doshi-Velez and Been Kim. “Towards a rigorous science of interpretable machine learning”. In: arXiv preprint arXiv:1702.08608 (2017). [126] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL: https://openreview.net/forum?id=YicbFdNTTy. [127] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. “FlowNet: Learning Optical Flow With Convolutional Networks”. In: ICCV. Dec. 2015. [128] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. “Modeling visual context is key to augmenting object detection datasets”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 364–380. 356 [129] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. “Cut, paste and learn: Surprisingly easy synthesis for instance detection”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 1301–1310. [130] Christopher L Edwards, Perrine M Ruby, Josie E Malinowski, Paul D Bennett, and Mark T Blagrove. “Dreaming and insight”. In: Frontiers in Psychology 4 (2013), p. 979. [131] Mathias Eitz, James Hays, and Marc Alexa. “How do humans sketch objects?” In: ACM Tra