Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Networked cooperative perception: towards robust and efficient autonomous driving
(USC Thesis Other)
Networked cooperative perception: towards robust and efficient autonomous driving
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NETWORKED COOPERATIVE PERCEPTION: TOWARDS ROBUST AND EFFICIENT AUTONOMOUS DRIVING by Hang Qiu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2021 Copyright 2021 Hang Qiu To my parents ii Acknowledgements From reading the first National Science Foundation (NSF) grant to writing an NSF proposal being granted, my Ph.D. study has been a wonderful and fruitful journey of my life. I am so fortunate to have Prof. Ramesh Govindan as my advisor guiding me through. "My goal is to help all of you achieve your goals", said Ramesh when it was his turn to share semester goals at one of our kickoff group meetings, a simple yet elegant sentence that summarizes years of dedication to his students. Throughout this journey, Ramesh has always been mind-blowingly inspirational, with accurate and objective judgments, and strong, unwavering trust and support. He is my advisor, my mentor, my role model, and my precious friend. He would answer my calls day and night no matter what nuance my trouble was. I am forever grateful and indebted to Ramesh and will continue the dedication as I pursue my academic career. I am also blessed with exciting collaborations with Prof. Konstantinos Psounis, Prof. Marco Gruteser, Dr. Krishna Chintalapudi, and Dr. Fan Bai, without whom much of my research would not have been possible [177, 116, 186, 182, 184, 181, 126, 125]. Konstantinos taught me one of the first graduate level classes and treated me with tons of fun playing with software-defined radios (WARPs) in the anechoic chamber at USC, which resulted in a publication in Mobihoc’16 [186]. He continues to offer candid advice on both research and academic career which have been tremendously helpful. Marco and Fan were on the first NSF grant that I worked on. We held weekly meetings and only got to meet each other online until years later. I remember vividly, the first time meeting Marco in person at a conference in Germany, and Fan when he visited our lab. When we met, our collaboration had already gone fruitful and we were close colleagues. Yet finally meeting in person brought us even closer. Likewise memorable are the two fantastic summers I spent in Seattle working with Krishna as his intern at Microsoft Research. Much credit for my iii Ph.D. journey goes to the great advice both on my research and the academic career from these mentors, as well as my collaborators Prof. Giuseppe Caire, Prof. Keith Chugg, Prof. Gaurav Sukhatme, Prof. Tarek Abdelzaher, Prof. B. S. Manjunath, Dr. Rahul Urgaonkar, Dr. Swati Rallapalli. I am so grateful to have the opportunity to work with them. I would also like to thank my thesis committee member, Prof. Konstantinos Psounis and Prof. Joseph Lim who have been immensely helpful in shaping this dissertation. Throughout the journey, I am extremely lucky to have talented student collaborators along the way. Fawad Ahmad, Pohan Huang, Xiaochen Liu, and Matthew McCartney have been my phenomenal partners for years, who have spent countless days and nights working with me together, building and debugging various systems, playing with LiDARs, stereo cameras, reading sensors from vehicle on-board diagnostics (OBD) buses, taking test drives around campus using their personal cars as well as our lab’s treasure, an antique Buick Lucerne. It is also a fortune to cross my Ph.D. trajectory with that of a group of amazing labmates at Networked Systems Lab (NSL): Zahaib, Yitao, Rui, Jianfeng, Mingyang, Jane, Haonan, Xing, Yurong, Bin, Shuai, Pradipta, Christina, Yi-Ching, Tobias, Luis, Masoud, Omid, Yuliang, Xuemei, Matt(Calder), Kyriakos, Sucha, Weiwu, Xiao, Rajrup, Pooria, Sulagna, Aastha, who have accompanied my journey, tuned in to practice talks, offered valuable feedback, volunteered to participate in endless tests and experiments, and shared each other’s ups and downs. I also owe many thanks to the department advisors Diane (ECE) and Lizsl (CS), lab admins Michael and Jack. They helped me focus on my research while taking care of everything else from scheduling to equipment, travel, food, etc. My gratitude also goes to my close friends who came to the US in the same year with me pursuing their graduate degrees: Yuke, Xinyu, Zhengping, Huijun, Jiaxiao, Qingkai, Ye, Longfeng, Qi, Wei. Besides texts and calls, we share wonderful memories in San Diego at Christmas, in Seattle under July 4th fireworks, in Grand Canyon counting stars. The friendship has been a constant source of joy and happiness, a strong support and motivation, a treasure in my heart. Finally, I am much obliged to my parents, grandparents, aunts, uncle, and cousins, for their unconditional support, sacrifices, patience, guide, and love. Words are insufficient to express my gratitude. I love you all. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xiv Chapter 1: Introduction 1 1.1 Autonomous Driving Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Cooperative Perception for Autonomous Vehicles . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Systems towards Cooperative Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Vision beyond Cooperative Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Related Work 9 Chapter 3: Augmented Vehicular Reality 13 3.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 A VR Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Relative Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Extending Vehicular Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.3 Detecting and Isolating Dynamic Objects . . . . . . . . . . . . . . . . . . . . . . 23 3.2.4 Extracting Object Motion and Reconstruction . . . . . . . . . . . . . . . . . . . . 24 3.2.5 Adaptive Frame Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.6 Cooperative A VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 A VR Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 A VR Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 The Benefits of A VR for ADAS and Autonomous Driving . . . . . . . . . . . . . 30 3.4.2 A VR End-to-End Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.3 Accuracy Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.4 Throughput and Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Limitation and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 4: Scalable Cooperative Perception 43 4.1 AUTOCAST Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Metadata Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 v 4.2.2 AUTOCAST Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.2.1 Preliminaries - Notation and PHY layer . . . . . . . . . . . . . . . . . . 52 4.2.2.2 Problem Formulation - Markov Decision Process . . . . . . . . . . . . 52 4.2.2.3 Scheduling Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Data Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Spatial Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Trajectory Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 End-to-end Scenario Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.3 Experiments with V2V radios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.4 Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Chapter 5: Automated Groundtruth Collection and Quality Control for Machine Vision 72 5.1 Satyam Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1.2 Satyam Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1.3 Satyam Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Job Rendition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3.1 Dominant Compact Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.2 Fusion Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.3 Result Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 HIT management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.1 ML Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.2 Quality of Satyam Groundtruth . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.3 Re-training Models using Satyam . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5.4 Satyam In Real-World Deployments . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5.5 Time-to-Completion and Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.6 Price Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.7 Worker Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5.8 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 6: Hybrid Human-Machine Labeling to Reduce Annotation Cost 102 6.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2 Cost Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.2.1 Predicting|S ? (D,B)| as a function of|B| . . . . . . . . . . . . . . . . . . . . . . 107 6.2.2 Modeling Active Learning Training Costs . . . . . . . . . . . . . . . . . . . . . . 109 6.3 The MCAL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.4.1 Reduction in Labeling Costs using MCAL . . . . . . . . . . . . . . . . . . . . . . 113 6.4.2 Effect of cheaper labeling costs . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4.3 Gains from Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.4 Effect of Relaxing Accuracy Requirement . . . . . . . . . . . . . . . . . . . . . . 120 6.4.5 Efficacy of Truncated Power Law . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 vi Chapter 7: Conclusion 123 7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Reference List 127 vii List Of Tables 6.1 Training Costs (USD/Image) for various DNNs . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3 Oracle Assisted Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4 Values of optimal choices discovered by MCAL as a fraction of|X| forε = 10% . . . . . . 120 6.5 Values of optimal choices discovered by MCAL as a fraction of|X| forε = 10% . . . . . . 120 viii List Of Figures 1.1 3D Sensors used in Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 LiDAR Point Cloud and Stereo Camera Images . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 A VR extends vision beyond line-of-sight. . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 AutoCast coordinates V2X communications in a busy intersection. . . . . . . . . . . . . . 6 1.5 Satyam collects high quality labels for KITTI [93] and Waymo [232] dataset in just 2 days, with 10x lower cost than public labeling services. We have used Satyam to build FourSeasons [9], a trans-seasonal detection and tracking dataset. . . . . . . . . . . . . . . 7 3.1 A VR allows a follower vehicle to see objects that it cannot otherwise see because it is obstructed by a leader vehicle. The pictures show a bird-eye view of a leader, a follower and a merged point cloud with different sensing modalities. The top (a) is generated by stereo cameras, while the one on the bottom (b) is obtained from LiDARs. . . . . . . . . . 13 3.2 Mockup of a heads-up display with A VR’s extended vision. . . . . . . . . . . . . . . . . . 17 3.3 A VR sender and receiver side components. . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 The green dots represent static features that are used to construct the sparse HD map, while features from the moving vehicle are filtered out. . . . . . . . . . . . . . . . . . . . . . . 21 3.5 An illustration of a homography transformation. The frame on the right is taken immediately before the frame on the left and the green box represents the part of the left frame visible on the right. The line match features in the two frames. . . . . . . . . . . . . . . . . . . . 24 3.6 Estimating the motion of a vehicle. The lines show the motion of features (represented by circles) across two successive frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.7 Pipelining the different components carefully enables components to be executed in parallel, resulting in higher frame rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 Path Planning and Road Detection in Action: with extended vision, the oncoming vehicle and more road surface area is detected, and the path planner decides to not attempt overtaking. 31 3.9 A VR helps the follower detect twice the road than it might have otherwise. . . . . . . . . . 31 ix 3.10 With A VR, the follower would avoid the overtake maneuver. . . . . . . . . . . . . . . . . 31 3.11 End-to-End results demonstrating A VR over a 60Ghz wireless link between two cars in motion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.12 Two A VR use cases with different delay requirements: a stationary obstruction, and the overtaking scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.13 The maximum permissible end-to-end delay for the two use cases, for two different sensor types, as a function of speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.14 Experiment setup for evaluating end-to-end reconstruction accuracy. . . . . . . . . . . . . 35 3.15 End-to-End reconstruction error, and its main components: camera pose estimation, motion compensation, and camera calibration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.16 At lower bandwidths, reconstruction accuracy can increase by half a meter because of errors in speed estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.17 At higher delays, dead reckoning can help contain reconstruction errors significantly, motivating our use of velocity vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.18 LiDAR can improve speed estimates by an order of magnitude. . . . . . . . . . . . . . . . 38 3.19 Pipelining enables A VR to process 30fps with an end-to-end delay of under 100ms. . . . . 40 4.1 AUTOCAST enables multiple vehicles cooperative perception into occlusion and beyond sensing range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 AUTOCAST System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 AUTOCAST Scheduler Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 An illustration ofH n i,k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Empty occupancy grids indicate occluded area. . . . . . . . . . . . . . . . . . . . . . . . 57 4.6 Overtaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Unprotected Left Turn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.8 Red Light Violation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.9 Crash, Deadlocks and Near Miss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.10 Reaction Time, Near-miss and Crash Details . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.11 Scheduled vs. DSRC Transmission: Overtaking . . . . . . . . . . . . . . . . . . . . . . . 66 4.12 Scheduled vs. DSRC Transmission: Unprotected Left Turn . . . . . . . . . . . . . . . . . 66 x 4.13 Scheduled vs. DSRC Transmission: Red Light Violation . . . . . . . . . . . . . . . . . . 66 4.14 Total Rewards vs. Number of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.15 Computation Time vs. Number of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.16 Latency vs. Normalized Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.17 Distributed AUTOCAST Sharing Region. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Examples of Satyam Results on Detection, Segmentation, and Tracking. . . . . . . . . . . 73 5.2 Satyam’s jobs, tasks and HITs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Satyam Job Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Overview of the Satyam’s components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Image Classification Task Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.6 Object Counting Task Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.7 Object Detection and Localization Task Page . . . . . . . . . . . . . . . . . . . . . . . . 80 5.8 Object Segmentation Task Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.9 Multi-Object Tracking Task Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.10 Amazon MTurk HITs Web Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.11 Quality Control Loop in Satyam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.12 Example groundtruth fusion in Multi-object Detection . . . . . . . . . . . . . . . . . . . . 84 5.13 Groundtruth fusion in Multi-object Tracking is a 3-D extension of Multi-object detection . 84 5.14 Satyam Accuracy, Latency, and Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.15 Example Results of Satyam Detection from KITTI . . . . . . . . . . . . . . . . . . . . . 92 5.16 Example Results of Satyam Segmentation from PASCAL . . . . . . . . . . . . . . . . . . 92 5.17 Example Results of Satyam Tracking from KITTI . . . . . . . . . . . . . . . . . . . . . . 92 5.18 Linguistic confusion between van and truck . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.19 Confusion Matrix of Satyam Result on ImageNet-10 . . . . . . . . . . . . . . . . . . . . 93 5.20 Confusion Matrix of Satyam Result on JHMDB . . . . . . . . . . . . . . . . . . . . . . . 93 5.21 Example of counting error resulting from partially visible cars . . . . . . . . . . . . . . . 93 xi 5.22 Example of missing segmentation labels from PASCAL. From left to right: raw image, PASCAL label, Satyam label. Satyam segments a small truck on the top left corner which was not present in the ground truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.23 Training Performance of Satyam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.24 End to End Training using Satyam Labels . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.25 CDF of Time Spent Per Task of All Job Categories . . . . . . . . . . . . . . . . . . . . . 95 5.26 Improvement of performance with fine-tuned YOLO. . . . . . . . . . . . . . . . . . . . . 97 5.27 Histogram of Time Per Counting Task over Different Datasets . . . . . . . . . . . . . . . 97 5.28 Adaptive Pricing on Counting Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.29 Approval Rates for various Satyam templates . . . . . . . . . . . . . . . . . . . . . . . . 98 5.30 Accuracy, Latency and Cost: Image Classification . . . . . . . . . . . . . . . . . . . . . . 99 5.31 Accuracy, Latency and Cost: Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.32 Accuracy, Latency and Cost: Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1 Differences between MCAL and Active Learning. Active learning outputs an ML model using few samples from the data set. MCAL completely annotates and outputs the dataset. MCAL must also use the ML model to annotate samples reliably (red arrow) unlike active learning schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Relationship between training set size and generalization error ((S θ (D(B)))), fit to a power law and a truncated power law, for CIFAR-10 using RESNET18 for variousθ. . . . 108 6.3 Dependence of (S θ (D(B))) onδ is ”small” especially towards the end of active learning. Here, (|B|| = 16, 000) for CIFAR-10 using RESNET18. . . . . . . . . . . . . . . . . . . 108 6.4 Error prediction improves with with increasing number of error estimates for CIFAR-10 using RESNET18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5 Total cost of labeling for various data sets, for i) Human labeling, ii) MCAL and iii) Oracle assisted AL for various DNN architectures. . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.6 Performance of MCAL compared to active learning with different batch sizesδ on Fashion using Amazon labeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.7 Performance of MCAL compared to active learning with different batch sizesδ on CIFAR-10 using Amazon labeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.8 Performance of MCAL compared to active learning with different batch sizesδ on CIFAR- 100 using Amazon labeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 xii 6.9 AL training cost (in $) as a function of batch size (δ) for Fashion Data Set using CNN18, RESNET18 and RESNET50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.10 AL training cost (in $) as a function of batch size (δ) for CIFAR-10 Data Set using CNN18, RESNET18 and RESNET50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.11 AL training cost (in $) as a function of batch size (δ) for CIFAR-100 Data Set using CNN18, RESNET18 and RESNET50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.12 Fraction of machine labeled images ( |S ∗ (D(B))| |X| ) using Naive AL for different fixedδ values.117 6.13 Performance of MCAL compared to active learning with different batch sizesδ on Fashion using Satyam labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.14 Performance of MCAL compared to active learning with different batch sizesδ on CIFAR-10 using Satyam labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.15 Performance of MCAL compared to active learning with different batch sizesδ on CIFAR- 100 using Satyam labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.16 Gains due to use of active learning in MCAL using Amazon Labeling forε = 0.05 . . . . . 119 6.17 Gains due to use of active learning in MCAL using Satyam Labeling forε = 0.05 . . . . . 119 6.18 Cost savings obtained by relaxing quality requirement fromε = 0.05 toε = 0.1 using Amazon Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.19 Cost savings obtained by relaxing quality requirement fromε = 0.05 toε = 0.1 using Satyam Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.20 Power-law and Truncated Power-law fits on CIFAR-100 using CNN18 . . . . . . . . . . . 121 6.21 Power-law and Truncated Power-law fits on CIFAR-100 using RESNET18 . . . . . . . . . 121 6.22 Power-law and Truncated Power-law fits on CIFAR-100 using RESNET50 . . . . . . . . . 121 xiii Abstract Using advanced 3D sensors and sophisticated deep learning models, autonomous cars are already transform- ing how people commute daily. However, anticipating and managing corner-cases is a significant remaining challenge for further advancement of autonomous cars. Human drivers, on the other hand, are extremely good at handling corner-cases, so if autonomous vehicles are to be widely accepted, they must achieve human-level reliability. In this dissertation, we show that wireless connectivity has the potential to fundamentally re-architect the transportation systems beyond human-driver performance. The dissertation presents algorithms and systems that enable cooperative perception among networked vehicles and infrastructure sensors that can substantially augment perception and driving capabilities. The dissertation makes several fundamental contributions. Augmented Vehicular Reality (A VR) is the first system in which vehicles can exchange 3D sensor information to help each other see through obstacles, and thereby make more reliable and efficient driving decisions. To scale cooperative perception to multiple autonomous vehicles, the dissertation presents AutoCast, a scalable cooperative perception and coordination framework, that enables efficient and cooperative exchange of sensor data and metadata in practical dense traffic scenarios. To effectively leverage shared sensors to improve the robustness of perception and driving, autonomous cars need large high quality datasets. Satyam is an open-source annotation platform, that automates human annotation collection and deploys quality control techniques to deal with untrained workers, human errors, and spammers. Finally, the dissertation includes a hybrid human-machine labeling framework, active labeling, that effectively leverages machine labeling to significantly reduce the total annotation cost. These annotations can be used to build novel large scale datasets that are tailored to new applications, such as training agents using extended vision. xiv Chapter 1 Introduction Autonomous driving vehicles are rapidly clocking up test miles across the world. However, a significant remaining challenge for the widespread deployment is the ability to anticipate and manage unusual events and corner-cases. Human drivers, on the other hand, are extremely good at handling corner-cases. Current autonomous driving solutions [231, 63, 220, 42, 245, 170, 109, 35] imitate human driving behaviors by mounting advanced 3D sensors (e.g., LiDAR, RADAR, Stereo Camera) to “see” the environment from the perspective of a single vehicle. Yet even with these sophisticated sensors, these prototypes find it hard to achieve even close to human-level reliability (100M miles between fatalities [169]). This dissertation explores networked cooperative perception, an alternative approach to vehicular per- ception that leverages autonomous driving vehicles’ capability to communicate. This approach overcomes the traditional shortcomings of the line-of-sight visibility from individual 3D sensors, providing multiple vantage points for vehicles to collaboratively sense the environment beyond occlusion and sensing range. The dissertation also contributes to methods of acquiring high-quality training data in a cost-effective manner for AI-enabled perception pipelines. Specifically, this dissertation shows how to 1) extract, share, and merge critical 3D sensor data among vehicles so that vehicles can make more informed driving decisions, 2) architect vehicle coordination and communication mechanism to scale sensor sharing up to dense traffic scenarios with limited wireless bandwidth, 3) automate human annotations to build massive scale application-specific machine vision 1 datasets for novel deep learning models, 4) reduce annotation cost by incorporating machine labeling using active learning. The dissertation presents algorithmic innovations and system designs that shift the design paradigm from single-vehicle-based approach to collaborative perception that leverages shared sensors. Beyond sensing, the dissertation also paves the way towards a collaborative autonomous driving system, where vehicles can share a common awareness of the situation and collaboratively drive without traffic regulation. 1.1 Autonomous Driving Perception Autonomous vehicles use advanced sensors for depth perception, such as radar, LiDAR, and stereo cameras (Figure 1.1). These 3D sensors permit vehicles to precisely position themselves with respect to the surrounding environment, and to recognize objects and other hazards in the environment. These 3D sensors periodically capture, at rates of several 3D frames per second, representations of the environment called point clouds (Figure 1.2). A point cloud captures the 3D view, from the perspective of the vehicle, of static (e.g., buildings) and dynamic (e.g., other vehicles, pedestrians) objects in the environment. Because points in a point cloud are associated with three dimensional position information, vehicles can use pre-computed 3D maps of the environment, together with the outputs of these 3D sensors, to precisely position themselves. Figure 1.1: 3D Sensors used in Au- tonomous Driving Figure 1.2: LiDAR Point Cloud and Stereo Camera Images However, all of these 3D sensors suffer from two significant limitations. They have limited range which can further be impaired by extreme weather conditions, lighting conditions, sensor failures, etc. They 2 primarily 1 have line-of-sight visibility, so cannot perceive occluded objects. These limitations can impact the reliability of autonomous driving or ADAS systems in several situations: a car waiting to turn left at an unprotected left turn has limited visibility due to a truck waiting to turn left in the opposite direction; or, a car is forced to react quickly when the car ahead brakes suddenly due to an obstruction on the road that is not visible to vehicles behind it. 1.2 Cooperative Perception for Autonomous Vehicles By communicating and sharing sensors among vehicles, cooperative perception opens an entirely new perspective to approach to perception for autonomous driving. Unlike the human driving perspective, which relies on a fusion of views observed from windshield and rear-view mirrors, the extended vision given by shared sensors provides visibility to occluded areas and areas beyond the sensing range. This augmented visibility can give autonomous driving vehicles more reaction time to make better informed decisions. Shifting the design paradigm from single vehicle based perception to cooperative perception that leverages shared sensors, there are several key challenges to address. 1) The limited bandwidth of vehicle-to- vehicle (V2V) and vehicle-to-infrastructure (V2I) communication requires a careful selection of compressed data to be shared. 2) Vehicles make safety-critical driving decisions in less than a second, which requires optimized coordination to deliver the shared data to each and every targeted recipient with a minimum latency possible. 3) In order to make sense of the shared data, training neural agents needs yet another round of human annotation and supervision, which is laborious and prohibitively expensive [7]. To tackle the challenges above requires novel algorithm innovation as well as careful system design. For example, to bridge the gap between wireless channel capacity and sensor data bandwidth, we may need to design algorithms that aggressively identify the relevance and significance for sharing and carefully optimize transmissions on the wireless channel. To minimize the labor of human data annotation, we need to design massive scale annotation collection system that can produce high quality machine vision datasets in face of untrained workers, human errors, and spammers. 1 Some radar sensors can shoot signals under a vehicle to detect one more car ahead, but with limited accuracy. 3 This dissertation details the system designed and the algorithms implemented to enable networked cooperative perception for autonomous driving. The algorithms and system design are implemented using commercially available off-the-shelf LiDAR, stereo cameras, and wireless radios, and demonstrates their feasibility, accuracy and efficacy via thorough empirical evaluations. 1.3 Systems towards Cooperative Perception This section gives an overview of the dissertation on cooperative perception for autonomous driving. To start with, augmented vehicular reality (A VR), first demonstrates the feasibility of cooperative perception, which enables real-time extended 3D vision beyond occlusion and sensing range between a pair of vehicles. Next, AutoCast illustrates the techniques to coordinate between vehicles in dense traffic scenarios. Then, the second part of the dissertation focuses on the annotations of these shared visual sensors for the training of autonomous driving agents. The dissertation presents the Satyam system, an open-source annotation platform, that gives researchers access to collecting massive scale high quality machine vision data annotations within just a few days. These annotations can be used to build novel large scale datasets tailored to new applications, such as training agents using extended vision. Finally, the dissertation includes a hybrid human-machine labeling framework that effectively leverages machine labeling to significantly reduce the total annotation cost. Augmented Vehicular Reality. Autonomous vehicles have 3D sensors such as multi-beam LiDAR, RADAR, and stereo cameras. These sensors can be used to detect, track, and localize moving objects in an instantaneous 3D view of the environment (e.g., a point cloud). However, these 3D sensors only provide line-of-sight perception and obstacles can often block a vehicle’s sensing range. To circumvent these limitations, Augmented Vehicular Reality (A VR) [177] builds an augmented vision (Figure 1.3) by wirelessly sharing visual information between vehicles, effectively extending the visual horizon. A VR devises novel localization techniques to position a recipient of a point cloud with respect to the sender, efficiently transform and render the point cloud in the receiver’s perspective in real-time. With limited wireless capacity, A VR detects and isolates objects in the environment using a homography 4 Figure 1.3: A VR extends vision beyond line-of-sight. transformation to significantly reduce the bandwidth requirement. Further, A VR extracts motion vectors that enable reconstruction by dead-reckoning at the receiver and uses an adaptive transmission strategy that sends motion vectors instead of point clouds to cope with channel variability. A VR involves careful design of a multi-threaded pipeline that processes and transmits visual information at 30 fps with a <100ms end-to-end delay. The extended vision, when used as input to path planning algorithms, can avoid dangerous overtake attempts resulting from limited visibility. This work was awarded the best paper runner-up [2] at Mobisys’18, and highlighted in the GetMobile Magazine [179]. Together with my work CarMap [37], A VR was adopted by General Motors with two global patents granted [185, 38]. Cooperative Perception at Scale. A VR demonstrated the benefits of cooperative perception for au- tonomous vehicles. To scale A VR up to multiple vehicles, AutoCast, a cooperative perception and coor- dination architecture, enables efficient and cooperative exchange of sensor data and metadata in practical dense traffic scenarios. Take a busy intersection as an example (Figure 1.4), a left-turning car (H) needs information about cars rushing past yellow lights (A) in the opposite direction, which could be blocked by another left-turning truck (C). A right-turning car (I) needs pedestrian information that can be shared by other participants (F) or a road-side unit (RSU). Coordinating and scheduling these communications over the vehicular network before the data becomes irrelevant is challenging due to bandwidth limitations 5 A B C D F G H I J K M O N L E RSU C H: Vehicle A F C: Pedestrians F I: Vehicle C & Pedestrians H C: Vehicle I & peds Figure 1.4: AutoCast coordinates V2X communications in a busy intersection. and latency constraints. AutoCast enables vehicles to efficiently and cooperatively exchange sensor data by introducing a control plane for interaction, clustering, and scheduling. In AutoCast, vehicles extract objects and assess the relevance with respect to each receiver. AutoCast prioritizes important objects in the schedule while eliminating duplicates and redundancies. AutoCast scales gracefully with vehicle density, enabling transmissions of 2 to 3x more useful information than sharing data in a random order. Automated Groundtruth Collection and Quality Control for Machine Vision. To effectively process the cooperative perception result, autonomous driving agents often need retrained or fine-tuned models to actuate accordingly. Autonomous cars can use these models to make cooperative driving decisions. To pave the way for correct actuation and enable novel cooperation schemes, my research looked at how to collect groundtruth annotations at scale (Satyam) and build huge datasets (FourSeasons) for a custom application with minimum cost (ActiveLabeling). Groundtruth is crucial for testing and training machine learning (ML) based systems. Crowdtasking platforms, such as Amazon Mechanical Turk (AMT), can be used to acquire this groundtruth by obtaining results from multiple workers, and fusing these results. Automating this fusion for various ML tasks is important to reduce the burden on the researcher, but is complicated by the need to ensure high quality in the face of untrained workers, human errors, and spammers. To abstract this laborious process, Satyam [182] 6 Figure 1.5: Satyam collects high quality labels for KITTI [93] and Waymo [232] dataset in just 2 days, with 10x lower cost than public labeling services. We have used Satyam to build FourSeasons [9], a trans-seasonal detection and tracking dataset. introduces a unified framework for automated quality control for complex vision tasks such as object detection, tracking, and segmentation, etc. Satyam also provides customizable UI templates for popular vision tasks, automates AMT task management (launching, payments, pricing, etc.), and filters inefficient workers. We validate Satyam’s quality control techniques using several popular benchmarks (KITTI, Waymo, PascalVOC, etc.) with over 98% precision and recall and discover up to 3% extra missing labels. Satyam is open-sourced for researchers to build their application specific datasets. We have used Satyam to collect and build a trans-seasonal detection and tracking benchmark, FourSeasons [9]. FourSeasons cap- tures the long term seasonal and diurnal variations for traffic surveillance cameras by providing annotations of videos spanning over a year. The FourSeasons Benchmark can be used to evaluate the generalizability of machine vision models to lighting conditions, weather and seasonal effects, which are critical to the robustness of autonomous driving perception. This line of research was adopted by Microsoft Azure ML and was featured in Microsoft Ignite 2019 [16]. 7 Hybrid Human-Machine Active Labeling to Reduce Annotation Cost. Human annotations are pro- hibitively expensive. As an extension to Satyam, Active Labeling is an annotation framework that explores minimum cost hybrid human-machine labeling. By bringing machine learning models into the labeling loop, Active Labeling uses active learning to train a classifier to accurately label part of the data set. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion, CIFAR-10 and CIFAR-100. In some cases, our approach has 6× lower overall cost relative to human labeling, and is always cheaper than the cheapest active learning strategy. 1.4 Vision beyond Cooperative Perception This dissertation explores the first few key steps towards enabling cooperative perception. The architecture design, the algorithm frameworks, and the specific techniques generalize along several dimensions. For example, the cooperative perception architecture is 1) not dependent on sensing modalities, 2) capable of detecting and sharing any form of objects with or without communication capability, 3) not dependent on any pre-defined vocabulary, and 4) agnostic to the specific machine vision tasks which the shared sensor data is used for. The collaborative perspective extends beyond the sensing and perception problems addressed in this dissertation. The fact that vehicles can “see” the world from several vantage points inspires many brand new approaches (§7.1) to several aspects of autonomous driving (e.g., robotics, computer vision, machine learning, transportation infrastructure, security, etc.). For example, cooperative perception 1) enables the transition from designing standalone autonomous driving agents to developing collaborative driving systems, 2) motivates the revolution from infrastructure for human drivers to infrastructure for autonomous agents. 3) brings new challenges to autonomous driving security to build trustful interaction among swarms of autonomous vehicles. Fundamentally, this dissertation lays the foundation of cooperative perception, which provides an alternative to designing the autonomous driving systems from the human driver’s perspective. The dissertation contributes to the on-going joint effort towards more robust and efficient autonomous driving and intelligent transportation systems. 8 Chapter 2 Related Work Autonomous driving technologies have the potential to revolutionize the intelligent transportation system and are likely to generate huge impact on human lives. With the emergence of advanced wireless communication technologies such as 5G and Cellular-V2X Technologies [15], connected autonomous vehicles are becoming one of the key components of the smart city ecosystem. Building such technologies is a joint effort involving a wide coalition of several communities (e.g., mobile networking, robotics, computer vision, machine learning, security, etc.). The research presented in this thesis is built on recent literature on vehicle communication and networking, 3D sensing, autonomous vehicle perception, localization, and machine learning on vision. • Connected Vehicles and Infrastructure. Connected vehicles promise a great opportunity to improve the safety and reliability of self-driving cars. Vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2X) communications both play an important role to share surrounding information among vehicles. Communication technologies, e.g.,, DSRC [133] and LTE-Direct [87, 187], provide capabilities to exchange information among cars by different types of transmission, i.e.,, multicasting, broadcasting, and unicasting. Work on connected vehicles [192] explored technical challenges in inter-vehicle communications and laid out a research agenda for future efforts in this space [74]. The networking community has studied content-centric [148] or cloud-mediated [147] inter-vehicle communication methods for exchanging sensor information between vehicles. Prior research has also explored using 9 connected vehicle capabilities for Collaborative Adaptive Cruise Control (CACC) [40]. Automak- ers are deploying V2V/V2X communications in their upcoming models [27, 31]). The academic community has also started to build city-scale advanced wireless research platforms (COSMOS [5]), as well as large connected vehicle testbed in the U.S. (MCity [34]) and Europe (DRIVE C2X [209]), which gives an opportunity to explore the application feasibility of connected vehicles via V2V communications in practice. Cooperative perception depends upon these technologies, and is a compelling use of this technology. • Vehicular Sensing and Perception. Collecting visual information from sensors (e.g., LiDAR, stereo cameras, etc.) is a major part of autonomous driving systems. These systems rely on such information to make a proper on-road decisions for detection [58, 243, 204], segmentation [113], tracking [157], and motion forecasting [158, 54]. In addition, there is a large body of work that explores vehicle context and behavior sensing [89, 195, 181] to enhance vehicular situational awareness. Thanks to advanced computer vision and mobile sensing technologies, all the sensing capability can already be leveraged efficiently in a single car setting [156]. This thesis takes the next step in designing how to share this information among nearby vehicles. • Localization. Localization of vehicles is crucial for autonomous driving [215, 30, 221] and this line of research has explored two avenues. GPS enhancements focus on improving the accuracy of absolute GPS-based positioning using differential GPS [99], inertial sensors on a smartphone [50, 103], onboard vehicle sensors, digital maps, crowd-sourcing [126], and WiFi and cellular signals [214]. These enhancements increase the accuracy to a meter. A VR builds upon a line of work in the robotics community on simultaneous localization and mapping [76]. Visual SLAM techniques have used monocular cameras [68], stereo camera [78, 235], infrared (Kinect [168]), as well as LiDAR [110]. Cooperative perception builds on existing work to develop an accurate relative localization capability using sparse 3D features detected in the environment. These features can be used to localize one vehicle with respect to another. 10 • Vehicle Sensor Sharing. Past research has attempted to realize some form of context sharing among vehicles. Rybicki et al. [196] discuss challenges of V ANET-based approaches, and propose to leverage infrastructure to build distributed, cooperative Traffic Information Systems. Other work has explored robust inter-vehicle communication [192, 146, 67], an automotive ontology for intelligent vehicles [82], principles of community sensing that offer mechanisms for sharing data from privately held sensors [141], and sensing architectures [153] for semantic services. Motivated by the “see- through" system [95], several prior works on streaming video for enhanced visibility have been proposed [51, 190, 83, 128, 151]. This line of work has focused on scenarios with two cars, where a leader car delivers its whole video to a follower vehicle. The approach does not use 3D sensors, so lacks the depth perception that is crucial for automated safety applications in autonomous driving or ADAS systems. This thesis enables clusters of vehicles to share 3D sensor information at scale. • Visual Sensor Annotation. Large scale data annotation is one of the key enabler of machine learning for vision. ImageNet [69] is the first large scale computer vision dataset with millions of images annotated by human beings. The massive scale of annotation was enabled by using a crowdtasking platform, Amazon Mechanical Turk (AMT [33]). One key challenge in large scale annotation is to ensure output quality in face of untrained workers, human errors, and spammers. ImageNet collects classification labels and uses majority voting for consensus [69]. Prior work [211] has also shown crowdtasking to be successful for detection where quality control is achieved by using workers to rate other workers, and majority voting to pick the best bounding box. Other approaches have developed systems built on top of AMT for feature generation for sub-class labeling [70], and sentence-level textual descriptions [143, 173]. This thesis presents a large scale annotation system for various complex machine vision tasks (detection, tracking, segmentation, etc.) and a generic consensus algorithm for quality control. A second challenge in large scale annotation is cost. Prior work has explored cost-quality tradeoff [91] in different crowd-tasking settings: de-aliasing entity descriptions [226, 134], or determining answers 11 to a set of questions [135]. This thesis presents a hybrid human-machine labeling framework that uses active learning to train a model for labeling to lower the total annotation cost. This work is built on two lines of machine learning literature: active learning and model generalization error estimation. Active learning [202] aims to reduce labeling cost in training a model, by iteratively selecting the most informative samples for labeling. Early work focused on designing metrics for sample selection based on margin sampling [199, 124, 202], region-based sampling [62], max entropy [65] and least confidence [64]. Recent work has focused on developing metrics tailored to specific tasks, such as classification [201], detection [52, 131], and segmentation [138, 237]. However, these approaches only targets model performance using minimum samples possible while ignoring the iterative training cost. The annotation framework presented in this thesis balances the training cost against the human annotation cost, aiming at a minimum total cost to annotate an entire dataset. The technique developed is based on the empirical work that has observed a power-law relationship between generalization error and training set size [46, 111, 127, 197, 84, 59] across a wide variety of tasks and models. Finally, the work presented in this thesis has been followed up by remarkable effort in the networking and mobile computing research community on edge-assisted vehicle cooperative perception [57, 43, 66, 219], directional vehicular connectivity using millimeter wave V2X [230], vehicular fog computing [244, 92], and vehicular networking security [90] and privacy [234]. The 3D object detection and sharing technique also led to the exploration of multi-user augmented reality systems [154, 175, 241, 41, 105]. 12 Chapter 3 Augmented Vehicular Reality Currently, autonomous driving vehicles uses advanced 3D sensors to build 3D point cloud models for environment perception. As discussed in Chapter 1, these 3D sensors suffer from the limitations of sensing range and line-of-sight visibility. These limitations can impact the reliability of autonomous driving or ADAS systems in situations where important details that can affect driving decisions are missed due to occlusion or limited sensing range (§1.1). In these situations, vehicles can benefit from wirelessly sharing visual information with each other, effectively extending the visual horizon of the vehicle and circumventing line-of-sight limitations. This would augment vehicular visibility into hazards, and enable autonomous vehicles to improve perception under challenging scenarios (Figure 3.1), or ADAS technologies to guide human users in making proper Leader Point Cloud Follower Point Cloud Merged Point Cloud Blind Spot Out of Range Extended Range No Blind Spot (a) Stereo Camera Point Cloud Leader Point Cloud Follower Point Cloud Merged Point Cloud Blind Spot (b) LiDAR Point Cloud Car Detected by Leader Figure 3.1: A VR allows a follower vehicle to see objects that it cannot otherwise see because it is obstructed by a leader vehicle. The pictures show a bird-eye view of a leader, a follower and a merged point cloud with different sensing modalities. The top (a) is generated by stereo cameras, while the one on the bottom (b) is obtained from LiDARs. 13 driving decisions. This capability, which we call Augmented Vehicular Reality (A VR), aims to combine emerging vision technologies that are being developed specifically for vehicles [17], together with off-the- shelf communication technologies. In A VR 1 , vehicles exchange raw 3D sensor outputs (point clouds). To achieve this, however, A VR must solve several challenging problems: how to find a common coordinate frame of reference between two cars; how to resolve perspective differences between the communicating cars; and how to minimize the communication bandwidth and latency for transferring 3D views. Contributions. To address these challenges, A VR makes three contributions (§3.2). First, A VR devises a novel relative positioning technique where the recipient of a point cloud can position itself with respect to the sender. This positioning technique adapts a recent feature-based SLAM technique developed for stereo vision, which extracts sparse features of the environment. Using a sparse 3D map containing only these features, a receiver can position itself with respect to the sender by learning the sender’s position relative to the shared sparse 3D map. Then, a receiver can re-orient received point clouds with respect to its own position using a perspective transformation. Second, because transferring full 3D point clouds of the environment can exceed the capacity of even future wireless technologies, A VR incorporates techniques to reduce bandwidth requirements. It isolates dynamic objects in the environment using a homography transformation, so it need only transmit point clouds of these objects. It uses off-the-shelf lossless compression on these point clouds, and also extracts motion vectors that enable reconstruction by dead-reckoning at the receiver. It uses an adaptive transmission strategy that sends motion vectors instead of point clouds to cope with channel variability. Finally, it permits cooperative sharing: two cars with overlapping perspectives can cooperate to eliminate redundancy in the set of objects shared with a third car. Third, A VR’s algorithms can incur significant processing latency, which can impact the throughput of the system (the rate at which 3D frames are processed) and the freshness of information at the receiver. Our 1 A VR Video Demo: https://youtu.be/9rOtH3hDcw8 14 A VR prototype incorporates careful pipelining to increase the frame rate, and leverages motion prediction to hide latency. Our A VR prototype (§3.4), when used with a 60GHz radio, can transmit visual information within 150-200 ms. This delay is significantly lower than the maximum permissible delay for A VR to help improve safety while overtaking or stopping due to obstructions. It also adapts gracefully to channel variability. A VR’s extended vision, when used as input to path planning algorithms, can avoid dangerous overtake attempts resulting from limited visibility. Using extensive traces of stereo camera data, we show that A VR can achieve 20-30 cm positioning accuracy, which corresponds to 2-10% of lane widths or vehicle lengths. A VR’s processing pipeline is optimized to process frames at 30fps. Its major source of error, speed estimation, can be dramatically improved by the use of LiDAR; we are currently working on incorporating LiDAR into A VR. 3.1 Background and Motivation 3D Sensors in Cars. Today, a few high-end vehicles have a 3D sensing capability that provides depth perception of the car’s surroundings. This capability is likely to become pervasive in future vehicles and can be achieved using 3D sensors such as advanced multi-beam LiDAR, RADAR, long-range ultrasonic and forward-facing or surrounding-view camera sensors. These 3D sensors can be used to detect and track moving objects, and to produce a high-definition (HD) map for localization [215, 222]. This HD map makes the car aware of its surroundings: i.e., where is the curb, what is the height of the traffic light, etc., and is able to provide meter-level mapping and absolute positioning accuracy. Recent research in autonomous driving [97, 109] has leveraged some of these advanced sensors to improve perception. All of these 3D sensors have one common feature: they generate periodic 3D frames, where each frame represents the instantaneous 3D view of the environment. A 2D image frame is represented by an array of pixels, but a 3D frame is represented by a point cloud. Each point in the point cloud frame contains the three-dimensional position (which enables depth perception) and (optionally) the color of the point. These 3D sensors can differ in the rate at which they generate point clouds, and their field of view. For example, 15 the 64-beam Velodyne LiDAR [24] can collect point clouds at 10 Hz containing a total of 2.2 million points each second encompassing a 360° view with an effective sensing range of 120 m. We primarily demonstrate and evaluate A VR using stereo cameras, which can collect point clouds at 60 Hz with over 55 million points per second, but with a limited field of view (110°) and an effective sensing range of 20 m [25]. The Problem. These 3D sensors only provide line-of-sight perception and obstacles can often block a vehicle’s sensing range. Moreover, the effective sensing range of these sensors is often limited by weather conditions (e.g., fog, rain, snow, etc.) or lighting conditions [207, 94]. These limitations can impact the efficacy of autonomous driving or advanced driver assistance systems (ADAS). Consider the following examples. A car is following a slow-moving truck on a single-lane highway. The car would like to overtake the truck, but its 3D sensor is obstructed by the truck so it cannot see oncoming cars in the opposite lane. Similarly, two cars, waiting to turn left at an unprotected intersection, can each be “blinded” by the other. As a third example, consider a platoon of two cars, a leader and a follower. The leader’s driver, distracted, brakes suddenly upon noticing a pedestrian entering a crosswalk. The follower, unable to see the pedestrian, cannot brake in time to prevent rear-ending the leader. In each of these scenarios, even if the 3D sensors are not obstructed, they may not have sufficient range to view oncoming vehicles or pedestrians. Augmented Vehicular Reality. In this chapter, we explore the feasibility of a simple idea: extending the visual range of vehicles by wirelessly communicating visual information between vehicles. We use the term Augmented Vehicular Reality (A VR) to denote the capability that embodies this idea. Specifically, in the one-lane highway scenario, if the truck were to communicate visual information from its 3D sensors to the car, using some form of V2V technology, the latter’s autonomous driving or ADAS software could determine the safest time and the speed at which to overtake the truck. Similarly, in the left-turn or the platoon scenarios, the transmission of visual information, can help cars turn, or stop, safely. In each of these cases, transmission of visual information can either compensate for the line-of-sight limitations of 3D sensors, or their limited range. 16 Figure 3.2: Mockup of a heads-up display with A VR’s extended vision. A VR can also extend vehicular vision over larger spatial regions, or over time. It can be used to detect and warn users of hazards such as parked cars on the shoulder, or objects on the road. It can also be used to augment HD maps to include environment transients like lane closures or construction activity. What Visual Information to Transmit. Instead of transmitting point clouds, A VR could choose to transmit object labels and associated metadata (e.g., the label “car” together with the car’s position). This is similar to the labels exchanged over DSRC except that, in current DSRC protocols, vehicles only share their own position but not the position of other vehicles observed from sensors. Labels lack contextual information present in point clouds, such as lane markers, traffic regulators, and unexpected objects not defined in the label dictionary. Such contextual information can improve the quality of driving decisions. However, point cloud transmission is necessary in some cases. The first is when the extended view is displayed using a heads-up display. Figure 3.2 shows a mock-up of a heads-up display where a car can visualize a vehicle in the opposite lane. The second is when the sender and receiver have different autonomous driving algorithms and processing capabilities. Autonomous driving algorithms use images and point clouds to map, localize, detect and control. Some of those algorithms, such as object detection [191], path planning and steering control [55], are composed of deep and wide neural networks that require substantial computing resources, such as expensive GPUs. In general, the accuracy and speed of autonomous driving algorithms is proportional to the computing resources available to them. Higher-end (e.g., luxury) vehicles are likely to have more computing resources, and would be able to make more accurate control 17 Sparse HD Map Sparse HD Map Sparse HD Map Sender Receiver Feature Extraction Dynamic Object Extraction Point Cloud Compression Motion Vector Estimation Relative Localization Relative Localization Feature Extraction Perspective Transformation Reconstruction Autonomous Driving or ADAS 3D Frames 3D Frames Camera Coordinates Object Point Clouds Motion Vectors Figure 3.3: A VR sender and receiver side components. decisions if they were to receive raw point clouds from other cars, than if they were to receive labels generated by cars with fewer computing resources. 3.2 A VR Design A VR consists of two logically distinct sets of components (Figure 3.3). One runs on a sender and contains the 3D frame processing algorithms that generate visual descriptions to be sent to one or more receivers. Receivers can either feed the received visual descriptions to a heads-up display, or reconstruct an extended view containing the visual descriptions received from the sender with their own 3D sensor outputs 2 . This extended view can be fed into ADAS or autonomous driving software. A VR poses several challenges. First, for A VR, each vehicle needs to transform the received visual information into its own view. To do this, A VR must estimate its position and orientation with respect to 2 Each vehicle includes both sender-side and receiver-side processing capabilities. We describe them separately for ease of exposition. 18 the sender of the raw sensor data, and then perform a perspective transformation that re-orients the received point cloud. To address this challenge, both the sender and the receiver share a sparse 3D map of the road and the road-side structures. This map is analogous to the 3D maps that autonomous driving systems use to position themselves with respect to the environment, but with one important difference: it is sparse, in that it only contains features (the green squares in Figure 3.4) extracted from the denser 3D maps used by these systems. For A VR, such a sparse map suffices for relative positioning. As we describe later, the sparse 3D map can be constructed offline, and potentially crowd-sourced. With this sparse map, the sender processes 3D frames from its camera and extracts features within the 3D frames, then uses some of these features to position its own camera relative to the sparse 3D map. The sender sends its position and a compressed (see below) representation of the 3D frame to the receiver. The receiver uses the sender’s camera coordinates, features extracted from its own 3D sensor, and its own copy of the sparse 3D map to estimate its position relative to the sender. After decompressing the received point clouds of dynamic objects, the receiver applies a perspective transformation to these objects to position them within its own coordinate frame of reference. Second, if A VR were to transmit 3D frames at full frame rates, the bandwidth requirement could overwhelm current and future wireless technologies. Fortunately, successive 3D frames contain significant redundancy: static objects in the environment may, in most cases, not need to be communicated between vehicles, because they may already be available in precomputed HD maps. For this reason, an A VR sender can also, instead of sending full frames, transmit point clouds representing dynamic objects (e.g., cars, pedestrians) within its field of view, and also the motion vector of these dynamic objects. The receiver uses the object’s motion vectors to reconstruct the object position, and superimposes the received object’s point cloud onto its own 3D frame. Third, many of the 3D sensor processing algorithms are resource-intensive, and this impacts A VR in two ways. It can limit the rate at which frames are processed (the throughput), and lower frame rates can impact the accuracy of algorithms that detect and track objects or that estimate position. It can also increase the latency between when a 3D frame is captured and when the corresponding point cloud is received at 19 another vehicle. A VR selects, where possible, lightweight sensor processing algorithms, and also optimizes the processing pipelines to permit high throughput and low end-to-end latency. Its motion vectors permit receivers to hide latency, as described later. A complete realization of A VR must include mechanisms that prevent or detect sensor tampering and protect the privacy of participants who share sensor data (§3.6). We have left such mechanisms to future work. Our initial design of A VR is based on relatively inexpensive (2 orders of magnitude cheaper than high-end LiDARs) off-the-shelf stereo cameras, but we also present some evaluations with a LiDAR device. 3.2.1 Relative Localization In A VR, a receiver needs to be able to estimate its position relative to the sender. Absolute positions from GPS would suffice for this, but GPS is known to exhibit high error especially in urban environments, even with positioning enhancements [126]. Specialized ranging hardware can estimate distance and relative orientation between the vehicles, but this would add to the overall cost of the vehicle. The depth perception from the 3D sensors can estimate relative position between the sender and receiver, but, in A VR, the sender and receiver need not necessarily be within line-of-sight. Key idea. Instead, A VR uses prior work in stereo-vision based simultaneous localization and mapping (SLAM, [166]). This work generates sparse 3D features of the environment (called ORB features [166]), where each feature is associated with a precise position relative to a well-defined coordinate frame of reference. A VR uses this capability as follows. Suppose car A drives through a street, and computes the 3D features of the environment using its stereo vision camera. These 3D features contribute to a sparse 3D map of the environment and each feature has a position relative to the position of A’s camera at the beginning of A’s scan of the street (we call this A’s coordinate frame). Another car, B, if it has this static map, can use this idea in reverse: it can determine 3D features using its stereo camera, then position those features in car A’s coordinate frame by matching the features, thereby enabling it to track its own camera’s position. A third car, C, which also shares A’s map, can position itself also in A’s coordinate frame of reference. Thus, B and C can each position themselves in a consistent frame 20 Figure 3.4: The green dots represent static features that are used to construct the sparse HD map, while features from the moving vehicle are filtered out. of reference, so that B can correctly determine its position relative to C, and correctly overlay C’s shared view over its own. To our knowledge, this is a novel use of stereo-vision based SLAM. Generating the sparse 3D map. As a car traverses a street, all stable features on the street, from the buildings, the traffic signs, the sidewalks, etc., are recorded, together with their coordinates, as if the camera were doing a 3D scan of the street. A feature is considered stable only when its absolute world 3D coordinates remain at the same position (within a noise threshold) across a series of consecutive frames. Figure 3.4 shows the features detected in an example frame. Each green dot represents a stable feature. Features from moving objects, such as the passing car on the left in Figure 3.4, are not matched or recorded. Each car can then crowd-source its collected map. Map crowd-sourcing is a more scalable way of collecting 3D maps than today’s 3D map collection mechanisms which use dedicated fleets of vehicles that require significant capital investment. We have developed a technique to stitch together crowd-sourced sparse 3D maps. Suppose car A traverses one segment of street X. If car B traverses even a small part of that same segment of X, and then traverses a perpendicular street Y, then B can upload its sparse map to a cloud service. As long as B’s traversal overlaps even a little with A’s sparse map, the cloud service can combine the two sparse 3D maps by translating B’s map into A’s coordinate frame of reference, generating a composite sparse 3D map. This can be extended to multiple participants, and we have used this technique 21 to generate a static map of our campus. The details of map stitching are described in a companion paper [39, 37]. In practice, we have found that A VR needs only one traversal of a road segment to collect features sufficient for a map since these features represent static objects in the environment. The amount of data needed for each road segment depends on the complexity of the environment. As an example, A VR creates a 97MB sparse HD map for a 0.1 mile stretch of a road on our campus. If necessary, this can be compressed using standard compression techniques [200]. Using the sparse 3D map for relative positioning. The sender processes each 3D frame, and extracts the ORB features, then matches these with the sparse 3D map. A VR uses the ORB-SLAM2 software [166] which matches up to 2000 features and uses these to triangulate the sender’s camera position at that frame. This process runs continuously, to compute, at each frame, the sender’s camera position relative to the sparse 3D map’s coordinate frame. The sender continuously transmits these position estimates to the receiver. The receiver uses the same technique to estimate the position of its own camera every 3D frame, with respect to the sparse 3D map’s coordinate frame. The sender and receiver can be synchronized to within inter-frame granularity using network time synchronization methods [162, 159], and the receiver can then use the sender’s position estimates to determine its position relative to the sender. 3.2.2 Extending Vehicular Vision With the help of the sparse 3D map’s coordinate frame, vehicles are able to precisely localize themselves (more precisely, their cameras), both in 3D position and orientation, with respect to other vehicles easily. However, vehicles can have very different perspectives of the world depending on the location and orientation of their sensors. So, if car A (the sender) wants to share its 3D frame with car B (the receiver), A VR needs to transform the frame’s point cloud in the sender’s local view to that of the receiver. This perspective transformation is performed at the receiver (Figure 3.3). Key Idea. The receiver knows the pose (position and orientation) of the sender’s camera, as well as its own camera, in the sparse map’s coordinate frame. Using these, it can compute a transformation 22 matrixT cw , as shown in Equation 3.1, which contains a 3x3 rotation matrix and a 3-element translation vector. This transformation matrix describes how to transform a position from one camera’s coordinate frame to the sparse 3D map’s coordinate frame. Specifically, to transform a pointV in the camera (c) domain (V = [x,y,z, 1] T ) to a point in the world (w) domain (V 0 = [x 0 ,y 0 ,z 0 , 1] T ), we use [x,y,z, 1] T = T cw ∗ [x,y,z, 1] T , where Tcw = RotX.x RotY.x RotZ.x Translation.x RotX.y RotY.y RotZ.y Translation.y RotX.z RotY.z RotZ.z Translation.z 0 0 0 1 (3.1) The perspective transformation. Now, suppose that the sender’s transformation matrix isT aw and the receiver’s isT bw , then, the receiver can compute the perspective transformation of a pointV a in the sender’s view to a pointV b in the receiver’s view as follows:V b =T −1 bw ∗T aw ∗V a . 3.2.3 Detecting and Isolating Dynamic Objects Transferring full 3D frames between a sender and a receiver stresses the capabilities of today’s wireless technologies: VGA 3D frames at 10fps require 400 Mbps, while 1080p frames need 4 Gbps. So, A VR will require several techniques to reduce the raw sensor data to be transmitted between vehicles. Key idea. A VR extracts and transmits point clouds, at the sender, only for dynamic objects (moving vehicles, pedestrians etc.). To identify a moving object, A VR can analyze the motion of each point in successive frames. Unfortunately, it is a non-trivial task to match two points among consecutive frames. Prior point cloud matching techniques [115, 149] involve heavy computation unsuitable for real-time applications. A VR exploits the fact that its cameras capture video, and matches pixels between successive frames instead of points. It can do this because, for stereo vision cameras, a pixel and its corresponding point share the same 2D coordinate. The homography transformation. A VR uses the ORB features extracted in every frame to find the homography transformation matrix between two successive frames. This matrix determines the position of 23 Figure 3.5: An illustration of a homography transformation. The frame on the right is taken immediately before the frame on the left and the green box represents the part of the left frame visible on the right. The line match features in the two frames. the current frame in the last frame. Because vehicles usually move forward, the last frame often captures more of the scene than the current frame (Figure 3.5). A VR matches ORB features between successive frames to compute the homography matrixH, a 2D matrix that can transform one pixel (P = [x,y] T ) to the same pixel (P 0 = [x 0 y 0 ] T ) in the previous frame with a different location (P 0 =H∗P ). A VR’s sender computes this transformation continuously for every successive pair of frames. Detecting dynamic objects. Once pixels can be matched between frames, A VR can match the correspond- ing points because of the correspondence between the two. It computes the Euclidean distance between matching points from consecutive frames. Before calculating this point displacement, A VR transforms the matching points into a common coordinate frame (e.g., the current frame, using perspective transformation, discussed above). It then applies a threshold to determine the points belonging to dynamic objects. This threshold is necessary because sensor noise can result in small displacements even for points belonging to static objects: with a vehicle cruising at 20mph, and a stereo camera recording at 30fps, the average displacement of the stationary points is< 5cm per frame. To further reduce bandwidth requirements, A VR uses point cloud compression techniques [200]. 3.2.4 Extracting Object Motion and Reconstruction Even after extracting dynamic objects and compressing them, the bandwidth requirements for A VR can still be significant. 24 Figure 3.6: Estimating the motion of a vehicle. The lines show the motion of features (represented by circles) across two successive frames. Key idea. To further reduce bandwidth requirements, A VR estimates, at the sender-side, the motion vector of dynamic objects detected in the previous step. Then, it transmits the compressed point cloud for a dynamic object from a frame, and only the motion vector for, e.g.,k frames following that frame (A VR adaptsk dynamically, as discussed below). Since the motion vector can be compactly represented, this provides an additional bandwidth savings of nearly a factor ofk. Estimating the motion vector. A VR leverages computations performed in previous step to estimate the motion vector. It determines which ORB features belong to dynamic objects using the homography matrix (§3.2.3), then computes the average motion vector of those feature points as an estimate of the total motion (Figure 3.6). When there are multiple moving objects, A VR uses optical flow segmentation [227] to separate the objects and compute separate motion vectors. Finally, based on the frame timestamps, A VR converts motion vector into speed and direction estimates (velocity vectors) and transmits these to the receiver. To get smoother velocity vector estimates, A VR tracks the same set of features overm frames, and computes the motion vector for the current frame using features fromm frames in the past, rather than the immediately previous frame. Features may not persist after a few frames, and new features can appear in a frame, so A VR must use a consistent set of features while permitting feature entry and removal; we omit the details for brevity. 25 Reconstruction. On the receiver, A VR uses the point cloud and subsequent velocity vectors to “dead- reckon” the position of the sender and thereby reconstruct the received object at the correct position. The A VR receiver applies perspective transformation on the received object point cloud, then applies each received velocity vector to each point in the point cloud to obtain an estimated position of the point cloud. A VR can then superimpose this point cloud into its own 3D frame and feed the resulting composite frame into an ADAS or autonomous driving algorithm. Latency hiding. In practice, as we discuss below, A VR’s processing pipeline can incur some delay. Thus, an object in a frame captured at the sender at timeT might arrive at the receiver atT +δ, after processing and transmission delays. A VR can use this dead-reckoning to hide this latency: dead-reckoning can calculate the expected position of the sender atT +δ using the last known velocity vector for the object. Thus, object motion estimation serves two purposes: it can reduce bandwidth usage and hide the processing latency. 3.2.5 Adaptive Frame Transmission In A VR, we consider situations where a leading car may transmit point clouds to a follower. The sender has a choice of transmitting full frames, only dynamic objects, or motion vectors. Dynamics in the environment can cause fluctuations in channel capacity, or in the number of dynamic objects in the scene. To adapt to these variations, A VR uses an adaptive strategy to decide which of these to transmit. If a previous transmission of a full frame completes before the next frame has been generated, A VR transmits the next frame in its entirety. If not, A VR transmits only the dynamic objects in the next frame. When channel capacity is insufficient event to transmit dynamic objects, A VR reverts to transmitting only velocity vectors 3 . This technique, inspired by adaptive bitrate techniques for video, naturally adapts to capacity and scene changes, and we demonstrate this experimentally (§3.4). 3.2.6 Cooperative A VR A final technique to reduce bandwidth requirements, and one that is easy to realize in our A VR design, is cooperative AVR. Consider two vehicles A and B which are traveling in the same direction on adjacent 3 Velocity vectors need to be transmitted with full frames and dynamic objects since they are needed for latency hiding. 26 lanes, followed by another car C. Cars A and B have overlapping views, and can eliminate redundancy in transmitting objects to each other and to C: B can avoid transmitting objects in its view that are also in A’s view (which B can determine when it receives the camera coordinates and pose from A). To do this, B would first need to perform a reverse perspective transformation on each dynamic object it detects, to determine if this object falls in the field of view (FOV) of A. Specifically, assume the angle of the horizontal FOV isα, and the angle of the vertical FOVβ, and assumex axis is horizontal, y is vertical,z is the depth. Then, the 3D coordinates (x,y,z) of the transformed object should satisfy both −z∗ tan α 2 ≤x≤z∗ tan α 2 and−z∗ tan β 2 ≤y≤z∗ tan β 2 to be within the FOV of A. In other words, B only transmits objects that are outside the rectangular pyramid defined by A’s FOV . For those objects that are in the FOV of A, B double-checks the coordinates of the dynamic objects sent by A to verify the redundancy before removing them from its own transmission queue. For the cooperation to work, B must determine that C can also receive the objects sent by A, which can be achieved by exchanging neighbor sets between neighboring nodes. 3.3 A VR Optimizations Implementation. Our A VR implementation builds upon a publicly available implementation of ORB- SLAM2 [172] which is designed for stereo vision and is currently the highest ranked open source visual odometry algorithm on the popular vehicle vision KITTI benchmark [93]. We use ORB-SLAM2 for relative localization, the PCL library [194] for lossless point cloud compression, OpenCV for homography transformations and feature tracking, but have built other modules from scratch, including perspective transformation, dynamic object isolation, motion vector estimation, real-time bandwidth adaptive streaming and 3D frame reconstruction. Our A VR prototype is 11,786 lines of C++ code. Our current implementation uses the ZED stereo vision cameras [238] which have a built-in GPU pipeline designed to synthesize 3D frames. Optimizations. Many of our modules perform complex processing on large 3D frames, and can incur significant processing latency. Our initial, unoptimized prototype could only process 1 frame every 3 27 GPU -> CPU Download Computing 3D Position & Tracking Camera Orientation Estimation ORB Feature Detection & Matching Motion Analysis & Compression Feature Detection & Optical Flow Feature Detection & Homography Feature Detection & Processing GPU -> CPU Download Motion Analysis & Compression Stage 1: Preprocess Stage 2: Localization Stage 3: Postprocess Camera Orientation Estimation ORB Feature Detection & Matching Computing 3D Position & Tracking Figure 3.7: Pipelining the different components carefully enables components to be executed in parallel, resulting in higher frame rates. seconds on a relatively powerful desktop machine, so we implemented several optimizations to increase the frame rate and reduce the end-to-end delay. The primary bottleneck in A VR is the underlying SLAM algorithm. From each 3D frame, SLAM computes the 3D position of each feature, and uses a non-linear pose optimization algorithm for each feature. The complexity of this step is a linear function of the number of features that ORB-SLAM2 computes. The software, by default, detects up to 2000 features, but we have been able to re-configure ORB-SLAM to detect up to 500 features without noticeable loss of accuracy while reducing processing latency by 4x. Several A VR algorithms, including optical flow, homography transformation and SLAM, use features, and feature extraction can increase processing latency. A VR computes ORB features once, and uses these for localization, dynamic object isolation, and motion vector estimation (Figure 3.7). Loading a 3D frame from GPU memory to CPU memory incurs latency. To interleave I/O and computation, and to leverage parallelism where possible, we employ pipelining. While pipelining does not affect the end-to-end latency experienced by a frame, it can significantly improve the throughput of A VR (in terms of frame rate). We carefully designed A VR’s 3-stage pipeline (Figure 3.7) so that each stage has comparable latency. The pre-processing stage transfers the 3D frame from the GPU and performs ORB feature extraction and feature pose optimization. When the pre-processing stage processes frame k + 1, the localization stage processes framek, computing 3D positions and matching features. In parallel, 28 the post-processing stage processes thek− 1-th frame to first finish camera pose estimation, then extract dynamic objects, estimate their motion vectors, and compress the point cloud. 3.4 A VR Evaluation Methodology. Our evaluation uses end-to-end experiments as well as traces of stereo camera data collected on our campus. Our end-to-end experiments involve using two Alienware laptops each with an Intel 7th generation quad-core i7 CPU clocked at 4.4 GHz, 16 GB of DDR4 RAM and an nVidia 1080p GPU equipped with 2560 CUDA cores. We place one laptop in a leader vehicle, and the other laptop in a follower vehicle. Each laptop is connected to a ZED [238] camera and to a TP-Link Talon AD7200 802.11ad wireless router [217] on each vehicle. The routers communicate using the wireless distribution system (WDS) mode in the 60 GHz band. While 802.11ad has several drawbacks with respect to vehicle-to-vehicle communication, including the fact that it requires line-of-sight communication, we use it to demonstrate proof-of-concept. We are not aware of any off-the-shelf high bandwidth wireless radios designed specifically for vehicle-to-vehicle communication. One possibility is to incorporate A VR into future versions of DSRC technology. The US FCC has reserved 75 MHz for DSRC over seven 10 MHz channels. Some of these channels have been reserved (e.g., for Basic Safety Messages that announce a vehicle’s position and trajectory), but the usage of several service channels is still under discussion. In our setup, ZED can create a real-time point cloud at a frame rate of up to 60fps with a resolution of 720p (1280 x 720), and up to 100 fps with VGA (640 x 480). Our traces record at 30fps, which is the rate A VR’s pipeline is able to achieve. We evaluate A VR on traces of stereo camera data, while driving these vehicles around campus. Our traces span 123,000 stereo frames, or about 0.5 TB of point cloud data. In addition, we have generated a sparse 3D map of our campus by driving a vehicle with a ZED camera around the campus. We use this sparse 3D map for relative localization in our trace driven evaluations. In one of our evaluations, we use an HDL-32E LiDAR sensor with a range of 80-100m to demonstrate improvements to A VR that might be possible with LiDAR. We have left a complete integration of LiDAR 29 (which can, in general, perform better than stereo cameras under poor visibility e.g., at night) into the A VR pipeline, and the evaluation of A VR performance under those conditions for future work (§3.5). Metrics. In our end-to-end experiments, we investigate end-to-end delay, bandwidth requirements, and the performance of adaptive transmission. To measure end-to-end delay, we synchronize the clocks on the laptops before the experiment. For our trace-driven studies, our primary metric is accuracy of the position of the reconstructed object in the receiver’s view. This accuracy is a function of the bandwidth used to transmit the dynamic objects. Contributing to this accuracy is the error induced by relative localization, and by using motion vectors to predict position; we also quantify these. We also evaluate two other metrics, throughput and latency of the A VR pipeline. Lower throughput can impact accuracy, as can high latency. 3.4.1 The Benefits of A VR for ADAS and Autonomous Driving Autonomous driving and ADAS systems use several building blocks including localization, object detection, drivable space detection, path planning, and so on. Many of these could benefit from A VR. We have implemented two of these algorithms, road surface detection and path planning, to demonstrate the benefits of A VR for the overtaking scenario, in which a follower would like to overtake a leader car, but its view is obstructed by the leader. Our road surface detection algorithm uses the point cloud library (PCL), and applies several optimiza- tions to extract the points on the road plane from our stereo camera data. Next, we convert the point cloud into an occupancy grid, where each grid (0.5m× 0.5m) is either drivable, occupied, or undefined. Finally, we use the A* search algorithm to find a viable path that can permit a box corresponding to the vehicle’s dimensions to pass through, without hitting any occupied or undefined grid. We then collected a trace with two vehicles, a leader and a follower, driving along a road and a third oncoming vehicle in the opposite lane. The follower would like to overtake the leader. We then fed this trace into our path planner. Figure 3.8 shows the result of both road detection and path planning with and without A VR, where the point cloud of the detected road is marked in blue, and the planned path is marked in connected green crosses. In the first case without A VR, the follower could only see the leader’s trunk 30 a) Without AVR b) With AVR Figure 3.8: Path Planning and Road Detection in Action: with extended vision, the oncoming vehicle and more road surface area is detected, and the path planner decides to not attempt overtaking. 350 400 450 Frame ID 0 50 100 150 200 Road Area Detected (m 2 ) w/o AVR w/ AVR Figure 3.9: A VR helps the follower detect twice the road than it might have otherwise. 350 400 450 Frame ID -20 -10 0 10 20 Planned Path Angle (deg) w/o AVR w/ AVR Figure 3.10: With A VR, the follower would avoid the overtake maneuver. and detect road surface upto the sensing range limit with occlusion. The planner could find a clear path to overtake the leader switching to the left lane. With A VR, the follower can not only detect more of the road surface, but also the oncoming vehicle, so the path planning algorithm does not attempt the overtaking maneuver, instead choosing to follow the leader. This example demonstrates that extended vision can help autonomous driving algorithms avoid potential hazards. Figure 3.9 shows that with A VR, the follower is able to detect twice as much visible road surface as without A VR, thanks to the extended vision. Figure 3.10 shows the planned path angle over the entire trace. Without A VR, the planner switches to the left lane to overtake the leader until it can sense the oncoming vehicle, whereas with A VR, the planner decides to not switch lanes but to follow the leader. 31 3.4.2 A VR End-to-End Performance In this experiment, we quantify the performance of our adaptive transmission strategy by running our A VR prototype live on two moving vehicles (driven within our campus) connected by a 60 GHz 802.11ad link. Specifically, we demonstrate two adaptive transmission strategies: one in which full frames are transmitted, else motion vectors, and another in which dynamic objects are transmitted, else motion vectors. We are interested in three different aspects of performance: the average bandwidth achieved over these radios, the end-to-end delay between when a frame was generated at the transmitter and when they were received, the transmission delay between received frame and the receiver’s own frame at reception, the fraction of frames for which motion vectors are transmitted, and the average run-length of motion vectors. Figure 3.11 shows the mean of these quantities averaged over 3 mins for the two scenarios: full and dynamic. In the full frame case, A VR transmits on average 1.17 velocity vectors per frame to adapt to the bandwidth, achieving an effective throughput of 367.02 Mbps. For comparison the maximum throughput we have been able to achieve between these two radios is 700-900 Mbps in the lab. The frequency of the velocity vectors indicates that the radios do not have enough capacity to sustain transmission of full point clouds every frame. Transmitting only the dynamic point clouds significantly reduces the bandwidth requirement by one order of magnitude, improving both end-to-end delay and transmission delay by 41% and 34% respectively. One interesting quantity is the velocity vector streak (the average number of consecutive velocity vectors), to which the accuracy is very sensitive. The longer velocity streak is, the longer A VR performs dead reckoning with only velocity estimates. In the full frame scenario, A VR’s bandwidth adaptive scheme transmits an average of 2.8 velocity vectors continuously, whereas in the dynamic case, a velocity vector is almost always followed by a dynamic object, and velocity vectors are very rare (about 3 in 1000 frames). How much delay is enough. Our current end-to-end delays are about 220ms for transmitting full frames, and about 130ms when dynamic objects are transmitted. To understand if these delays are sufficient, we 32 Figure 3.11: End-to-End results demonstrating A VR over a 60Ghz wireless link between two cars in motion. Safe following distance 2v Safe stopping distance D (v) A B L Sensor range R A B C v A = v B = v v A = v B = v c = v Obstruction Scenario Overtaking Scenario Sensor range R Safe following distance 2v Figure 3.12: Two A VR use cases with different delay requirements: a stationary obstruction, and the overtaking scenario. consider simple models of two A VR use cases (Figure 3.12): the obstruction use case, and the overtaking use case. In both of these cases, we assume that a human is making driving decisions. For the obstruction example (Figure 3.12, top), consider car A following car B, while a stationary object L (e.g., a parked car) is in the same lane ahead of B. Assume for simplicity that all cars are traveling at the same speedv. Assume also that car A follows the two second rule [20] for safe following distances between cars. Car B sees L when the latter is within sensing range of B. For this information to be useful to A, it must reach A ahead of the safe stopping distance for a car. Most drivers can decelerate at about 6m/s 2 , and have a reaction time of 1s [22]. Then, the maximum permissible delay for A VR to be useful in this scenario is 0.84 + R v . 33 20 30 40 50 60 70 80 Speed (mph) 0 2 4 6 8 10 12 14 Permissive Delay (sec) Obstruction, Stereo Obstruction, Lidar Overtake, Stereo Overtake, Lidar Figure 3.13: The maximum permissible end-to-end delay for the two use cases, for two different sensor types, as a function of speed. In the overtaking case (Figure 3.12, bottom), consider a car A following car B, while car C is an oncoming car in the opposite lane. Assume for simplicity that all cars are traveling at the same speedv. Assume also that car A follows the two second rule [20] for safe following distances between cars. Now car B sees car C when the latter is within sensor (stereo camera or LiDAR) range of B. For this information to be useful to A, it must reach A before A itself can see C. Then, the maximum permissible delay for A VR to be useful in this setting is R 2v . Figure 3.13 plots the delay as a function of vehicle speed for the two scenarios using these equations, and assuming nominal ranges for stereo cameras (20m) and LiDARs (100m). From Figure 3.11, the end-to-end delay for A VR is about 200ms. As the figure shows, A VR can use either sensor for the obstruction scenario: even at high speeds, a stereo camera is sufficient to be useful. Interestingly, however, the overtaking scenario is much more stringent: at higher speeds, A VR latencies start to approach the maximum permissible delay, suggesting that A VR will need LiDAR for this scenario. 34 C A L B Figure 3.14: Experiment setup for evaluating end-to-end reconstruction accuracy. 3.4.3 Accuracy Results In this section, we evaluate the accuracy of the extended vision, namely the position of the reconstructed point cloud. We first evaluate the end-to-end reconstruction accuracy of both static objects and dynamic objects with different relative speed. Next, we conduct a detailed analysis of the tradeoff between bandwidth and reconstruction accuracy and the sensitivity to processing and transmission delay. Finally, we investigate whether using a LiDAR device could improve speed estimation accuracy. Reconstruction. The primary measure of accuracy is the positional error of A VR reconstruction. That is, if p is the known position of the object in the sender’s view, and p’ is the derived position at the receiver, the position error is given by|p−p 0 |. More precisely, we estimate the position error for a given ORB feature corresponding to the object, and average these position errors across multiple ORB features. The positioning accuracy of ORB-SLAM2 has been benchmarked on the KITTI visual odometry dataset [136]. Setup. For this experiment (Figure 3.14) we mounted cameras A and B on two moving vehicles, one in front of the other, with a third vehicle C moving in the opposite direction. L is stationary object on the side of the road. We measure the relative positional error of L and C as viewed from A without extended vision, and their reconstructed view with extended vision at A, using visual information received from B. For this experiment, we collect stereo camera traces from the two cameras A and B while varying their speeds with respect to the third moving vehicle C. The leading vehicle B transmits the full point cloud it 35 observes along with the velocity estimate of C at every frame. At the following vehicle A, A VR reconstructs the received point cloud from B using perspective transformation. To evaluate the positional accuracy of L, A compares the estimated position of L in its own point cloud with the estimated position in the reconstructed point cloud it received from B. Similarly, A predicts the positional accuracy of C by compensating the delay with respect to its current frame for the reconstructed point cloud from B, and compares it with the position of C in its current point cloud. For this, A uses the velocity vectors received from B while receiving the last full object point cloud. We collected traces of different relative speeds between objects and cameras of 10, 20, 30 mph, while keeping A, B, C at the same speed. When evaluating the accuracy, we assume an average transmission delay of 60 ms (Figure 3.11), and compare the reconstructed dynamic point cloud versus the receiver’s own point cloud frame by frame. For static objects, since there are no motion estimates, we randomly sample frames from the two footages and compute the reconstruction error. Intuitively, the only sources of error during the reconstruction of static objects, are the camera pose estimation and camera calibration, while the major source of error for the dynamic object is motion compensation. To understand the impact of camera calibration, we also run two A VR instances using one footage from the same camera, not only for static objects, but also dynamics, which emulates perfect calibration. Results. Figure 3.15 shows the reconstruction error for both the static object L and dynamic object C for different speeds of the vehicles A and B. At low speeds, with perfect calibration, static objects can be localized to within 1.6 cm at the median. At 30 mph, the errors are still low, about 7 cm in the 90th percentile. These errors are entirely due to camera pose estimation, and accuracy decreases at higher speeds for static objects because of the noise in camera pose due to car motion. The more challenging case for reconstruction accuracy is when the object is also moving (Figure 3.15). In this case, with perfect calibration, 90th percentile reconstruction errors are within 20 cm for speeds up to 20 mph, going up to 40 cm for the 30 mph case. The difference in accuracy between static and dynamic objects is entirely attributable to errors in speed estimation: these errors add about 15 cm to the reconstruction error. Below, we explore if using LiDAR can enable more accurate speed estimation. 36 Figure 3.15: End-to-End reconstruction error, and its main components: camera pose estimation, motion compensation, and camera calibration. To put these numbers in perspective, the lane width in the US highway system is about 3.7 m, while the average car length is 4.3 m [225]. Thus, the reconstruction error for a static object is about 1% of these quantities, and for a dynamic object is 5-10%. Because many autonomous driving algorithms make decisions at the scales of cars or lane widths, we believe these reconstruction accuracies will be acceptable for future ADAS and autonomous driving systems. The errors described are intrinsic to our system. There is an extrinsic source of error, camera calibration. Realistic calibration add 5-30 cm error to reconstruction likely due to the relatively cheap stereo camera we use. Production vehicles are likely to have more finely calibrated cameras. The Bandwidth / Accuracy Tradeoff. Reconstruction error can depend on bandwidth, so in this set of experiments, we throttle the wireless link bandwidth to evaluate the trade-off between bandwidth and accuracy while using adaptive transmissions. A VR sends a full point cloud, the key cloud, followed by a series of velocity vectors when the streaming rate is larger than the link capacity. The data size of the velocity vectors is negligible compared to the volume of the point cloud. Thus, throttling the bandwidth directly triggers A VR to send fewer full point clouds and more motion estimates, which impacts the reconstruction accuracy. In this experiment, we throttle the bandwidth with different thresholds, and evaluate the reconstruction accuracy degradation, as well as the ratio between the number of velocity estimates sent versus a full point cloud. 37 0 20 40 60 Bandwidth Throttle (Mbps) 0 0.5 1 1.5 Accuracy Degradation (m) 0 2 4 6 8 10 Velocity Estimates / Point Cloud Figure 3.16: At lower bandwidths, reconstruction accuracy can in- crease by half a meter because of errors in speed estimation. 100 200 300 400 500 Delay (ms) 0 0.5 1 1.5 2 2.5 3 Accuracy Degradation (m) W/ Dead Reckoning W/O Dead Reckoning Figure 3.17: At higher delays, dead reckoning can help contain reconstruction errors significantly, motivating our use of velocity vec- tors. Stereo Camera Lidar 0 5 10 15 20 Speed Estimation Error (%) 20mph 25mph 30mph 35mph Figure 3.18: LiDAR can improve speed estimates by an order of magnitude. Results. Figure 3.16 shows that average reconstruction error degradation increases significantly up to nearly half a meter once bandwidth is throttled below 20 Mbps. In this regime, A VR transmits, on average, 6-8 motion vectors per full frame. Speed estimation using motion vectors obtained from stereo cameras can be error prone, and can impact reconstruction accuracy. We show below that, when we use LiDAR, speed estimation is significantly more accurate. The impact of delays. Between when a frame is captured at a sender, and when an object is reconstructed at the receiver, there can be two sources of delay: the A VR pipeline’s processing latency and the vehicle-to- vehicle transmission latency. A VR can hide these latencies by dead-reckoning, using motion vectors, the current position of the dynamic object. In this section, we quantify the reconstruction accuracy degradation due to the impact of dead reckoning of various delays. Results. Reconstruction error increases by 30-40 cm when the delay increases from 100 ms to 500 ms (Figure 3.17). The figure also quantifies the accuracy degradation without latency hiding: if A VR simply used the last key cloud as an estimate for the current position of the vehicle, its error can be nearly 2 m at high latencies. Such a high error would significantly reduce the usefulness of A VR for safety-based driving assistance, and motivates the use of dead-reckoning using speed estimates. Better Speed Estimation. The main source of reconstruction errors comes from motion estimation among consecutive frames. To understand whether speed estimation would be significantly better with LiDAR, we 38 obtained a Velodyne 32-beam LiDAR 4 . We mounted both a stereo camera and the LiDAR on the side of the road, and drove a car at 4 different fixed 5 speeds (20, 25, 30, and 35 mph). For the stereo camera, we use A VR to estimate the motion vector, then derive speed estimates from the motion vector. For LiDAR, we measure the front and rear positions of the car in each frame and calculate the speed estimates. All speed estimates are averaged over a sliding time window. As Figure 3.18 shows, LiDAR is several times more accurate than a stereo camera at estimating speed. While stereo cameras can estimate speeds to within 10-15% error, LiDAR speed estimates are in the 1-2% range across all the speeds, almost an order of magnitude lower. Cooperative A VR. We collected traces from two cars, driven side by side in adjacent lanes, on city main streets with normal traffic. Over 12,000 frames, on average 61.63% (standard deviation 24.9%) of the objects appeared in both cars’ views, reducing the total bandwidth of transmitting the combined view of the two vehicles by almost another 30%. 3.4.4 Throughput and Latency Throughput and latency of the processing pipeline also significantly impact the performance of A VR. These are a function of (a) our algorithms, (b) our performance optimizations, and (c) the hardware platforms on which we have evaluated A VR. The compute capabilities of on-board platforms have evolved significantly in the past couple of years: the current NVidia Drive PX 2 platform has 2 Denver 2.0 CPUs, four ARM Cortex A57 CPUs, and a 256-core Pascal GPU [10]. We do not have access to this device, so our evaluations are conducted on a desktop machine with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 12GB DDR3 RAM @ 1.6GHz, 144-core GeForce GTX 635M GPU with a memory of 1GB. Results. Figure 3.19 shows the average processing latency of each module in the pipeline. With all the optimizations (§3.7), A VR finishes the whole pipeline in 33 ms, enabling it to have a throughput of 30 frames per second while finishing all the computations described above. The pipeline is well-balanced: the localization stage (thread 2) takes 33 ms, while the preprocessing stage requires 22 ms. Post-processing 4 We have been unable to incorporate LiDAR into A VR because we have not found a suitable off-the-shelf SLAM implementation. This is left to future work. 5 Cruise control does not work reliably at lower speeds, hence this choice of speed. 39 Preprocess 21.291 ms Feature Processing 10.922 ms GPU -> CPU 8.764 ms Preprocess 21.291 ms GPU -> CPU 8.764 ms Localization 33.344 ms Postprocess 29.794 ms Motion Analysis & Compression 23.309 ms Camera Pose 5.350 ms Postprocess 29.794 ms Motion Analysis & Compression 23.309 ms Camera Pose 5.350 ms Postprocess 29.794 ms Motion Analysis & Compression 23.309 ms Camera Pose 5.350 ms Thread 1 Thread 2 Thread 3 Time (ms) 0 15 30 45 60 75 90 ORB Feature Detection & Matching 21.349 ms Computing 3D Position & Tracking 10.320 ms Localization 33.344 ms ORB Feature Detection & Matching 21.349 ms Computing 3D Position & Tracking 10.320 ms Localization 33.344 ms ORB Feature Detection & Matching 21.349 ms Computing 3D Position & Tracking 10.320 ms Preprocess 21.291 ms GPU -> CPU 8.764 ms Feature Processing 10.922 ms Feature Processing 10.922 ms Figure 3.19: Pipelining enables A VR to process 30fps with an end-to-end delay of under 100ms. is at 30 ms. Within each stage, the boxes indicate the primary processing bottlenecks: GPU to CPU data transfer, feature detection and matching, and point cloud compression. In the future, we plan to focus on these bottlenecks. The end-to-end processing latency for a given frame (e.g. the blue frame) through the three stages of the pipeline is below 100 ms. This is about twice the network transmission latency for dynamic objects (Figure 3.11). Processing latency can even be further reduced through careful engineering of the pipeline and choice of platform. Moreover, higher-end vehicles are likely to have lower processing latency since they can afford more on-board compute power than lower-end vehicles. These numbers explain why latency hiding is an important component of A VR. Finally, A VR consumes on average 1.4GB RAM and 537 MB GPU memory, well within the capabilities of today’s platforms. Since on-board power sources are plentiful, we have not optimized for, and do not quantify, energy usage. 3.5 Limitation and Future Work A VR currently uses ZED [238], a short range stereo camera, which limits the system in two ways. First, we have experimentally observed that a shorter range depth sensor may be able to detect fewer 3D features 40 in some environments, such as open roads with few roadside features (trees, buildings etc.). Since ORB- SLAM2’s accuracy increases with the number of matched features, positioning accuracy can be low in these environments. Second, inclement weather and changing lighting conditions might result in a different set of features than those present in the 3D map. Fortunately, A VR does not need to match every feature in the map in order to localize a vehicle. It performs quite well in our experiments where our traces were gathered under different lighting conditions. Future work can improve localization accuracy when fewer features are visible. Also, these shortcomings of the stereo camera can be addressed by incorporating LiDAR into A VR and we plan to do so by incorporating a newly proposed LiDAR SLAM [100] algorithm. One motivation for communicating point clouds in A VR is the use of a heads-up display. Heads-up displays can reduce distraction and improve safety [108]. While beyond the scope of this work, it would be important for future work to perform user studies to determine if A VR-based displays like Figure 3.2 can improve driving decisions. Cooperative AVR (§3.2.6) is a step towards scaling A VR out to clusters of vehicles, but future work can explore point cloud stream compression [129], as well as intermediate representations such as 3D features. Finally, future work can explore lightweight representations of the 3D map. Currently, A VR stores all the stationary features and metadata for localization in the sparse 3D map, which is storage and communication- intensive. In future work, we plan to investigate lightweight representations and their impact on localization accuracy. 3.6 Related Work Connected Vehicles. Work on connected vehicles [192] explored technical challenges in inter-vehicle communications and laid out a research agenda for future efforts in this space [74]. Other research has studied content-centric [148] or cloud-mediated [147] inter-vehicle communication methods for exchanging sensor information between vehicles. Prior research has explored using connected vehicle capabilities for Collaborative Adaptive Cruise Control (CACC) [40]. A VR is an extension of our earlier work [180] and introduces methods to isolate and track dynamic objects, designs cooperative A VR, analyzes the redundancy 41 to save bandwidth, and conducts an extensive evaluation aimed at exploring feasibility. Prior work on streaming video from leader to a follower for enhanced visibility [96, 171] does not use 3D sensors, so lacks the depth perception that is crucial for automated safety applications in autonomous driving or ADAS systems. Finally, automakers have started to deploy V2V/V2X communication in their upcoming high-end models e.g., Mercedes E-class [27] and Cadillac [32]. A large connected vehicle testbed in the US aims to explore the feasibility and applications of inter-vehicle communication at scale [6, 21], while pilot projects in Europe (simTD [117] and C2X [210]) are evaluating these technologies. A VR depends upon these technologies, and is a compelling use of this technology. Vehicle Sensor Processing. Prior work on sensor information processing has ranged from detection and tracking of vehicles using 360 degree cameras mounted on vehicles [89], to recognizing roadside landmarks [195], to using OBD sensors [126, 176] and phone sensors [56, 176] to detect dangerous driving behaviour, to systems that monitor blind spots using a stereo panoramic camera [161] and pre-crash vehicle detection systems [88]. Other work has proposed infrastructures to collect and exfiltrate vehicle sensor data to the cloud for analytics [118]. A VR is qualitatively different, focusing on sharing raw 3D sensor information between vehicles. Localization. Localization of vehicles is crucial for autonomous driving [215, 30, 221] and this line of research has explored two avenues. GPS enhancements focus on improving the accuracy of absolute GPS- based positioning using differential GPS [99], inertial sensors on a smartphone [50, 103], onboard vehicle sensors, digital maps and crowd-sourcing [126], and WiFi and cellular signals [214]. These enhancements increase the accuracy to a meter. A VR builds upon a line of work in the robotics community on simultaneous localization and mapping [76]. Visual SLAM techniques have used monocular cameras [68], stereo camera [78, 235], and LiDAR [110]. Kinect [168] can also produce high-quality 3D scan of an indoor environment using infrared. A VR adds a relative localization capability to ORB-SLAM2 [166], using the observation that its sparse 3D feature map can be used to localize one vehicle’s camera with respect to another. 42 Chapter 4 Scalable Cooperative Perception In this chapter, we explore the problem of scalable cooperative perception. To motivate the problem, consider a busy intersection (Figure 4.1) with complex traffic dynamics: people and bicyclists crossing the street, traffic waiting to turn right or left, together with traffic flowing in the direction of the green light. In such a scenario, there may be tens to hundreds of traffic participants. If all participants were radio-equipped, then, safety can be increased by cooperative awareness [79], in which each participant announces its location and potentially its trajectory. However, in more complex settings with pedestrians and bicyclists, cooperative perception, where each vehicle shares what it sees, is necessary. But there is a significant mismatch between the rates at which 3D sensors generate data, and what V2V standards can sustain (§4.1). A 64-beam LiDAR can generate 10 point clouds per second totaling 2.2 million points per second [23]. However, V2V standards like DSRC, and LTE-V can likely support a few Mbps to a few tens of Mbps. To address this mismatch, it is necessary to share only essential information from the 3D sensor at each vehicle. At the same time, we take the position that it is important to transmit raw sensor readings instead of more processed detections (e.g., bounding boxes), since the former has more contextual information (e.g., the shape of an object) that may be useful to the receiver. To address this tension, we observe that, with LiDAR sensors, point clouds of roadway objects can be compact enough to fit within the wireless capacity constraint, but still need careful transmission scheduling. Further, we observe that traffic demand can be reduced by analyzing object visibility (if a vehicle can see an 43 A B C D F G H I J K M O N L E RSU C H: Vehicle A F C: Pedestrians F I: Vehicle C & Pedestrians H C: Vehicle I & peds Figure 4.1: AUTOCAST enables multiple vehicles cooperative perception into occlusion and beyond sensing range. object, another vehicle does not need to send it the object) and relevance (a vehicle turning right has no interest in vehicles going straight in the opposite lane). Contributions. We embody these ideas in the context of a system called AUTOCAST 1 . AUTOCAST makes three contributions. First, AUTOCAST develops a suite of fast spatial reasoning algorithms that analyzes point clouds to determine visibility and relevance. At a high-level these rely on metadata exchanges between vehicles which share positions and trajectories of observed objects, and each vehicle uses fast geometric algorithms to analyze these object relationships. Second, AUTOCAST develops an efficient scheduling algorithm leveraging an MDP formulation. The MDP can reason about optimal ordering of objects across different radio frames, which is useful because AUTOCAST schedules sensor data every 100 ms (the rate at which LiDAR frames are generated) which can in general encompass more than one radio frame. Because the optimal MDP solution requires dynamic programming which can be too slow in our setting, we also develop a greedy heuristic that we show to yield a near optimal solution for the experiments we consider. 1 AUTOCAST Vidoe Demo: https://autocastnet.wordpress.com/ 44 Third, to demonstrate the benefits of AUTOCAST’s cooperative perception end-to-end, we needed a trajectory planner for vehicles (so we could demonstrate crash-avoidance, for example). However, existing planners are all designed based on sensor data collected from a single vehicle. We developed a planner capable of reasoning cooperative perception; this planner extendsA? search to find collision free trajectories. We implement AUTOCAST (the protocol operations and scheduler) in Carla [73], a state-of-the-art autonomous vehicle simulator. Our evaluation results (§4.4) shows that AUTOCAST can achieve a 200% gain in data sharing utility compared to a system which attempts to share the full visual information of each car in a round robin fashion. AUTOCAST enables clusters of vehicle to achieve nearly 300% sensing area extension, and scales gracefully with the vehicle density. 4.1 AUTOCAST Motivation and Overview Goal. Prior work [178, 95] has proposed cooperative perception to improve vehicular safety. In this approach, vehicles cooperatively exchange sensor readings from (stereo) cameras to extend their visual horizon. The benefits of extended perception for driver assist and autonomous driving systems are clear: a vehicle can make decisions much earlier than it otherwise might have been able to. Other prior work has considered safety enhancements by having vehicles actively broadcast their location continuously over short range DSRC radios [133]. These approaches cannot capture passive participants (pedestrians and bicyclists); extended perception can. Prior work [178] has demonstrated cooperative depth perception (where vehicles exchange information from a 3D sensor) only for two vehicles. AUTOCAST’s goal is to develop a practical extended depth perception technique that scales to more complex vehicular settings with tens or hundreds of participants. An example of such a setting is a busy intersection with several tens of pedestrians, bicyclists, together with tens or hundreds of vehicles attempting to cross the intersection. Key challenge. Unlike [178], AUTOCAST focuses on vehicles that use LiDAR for depth perception. LiDAR enables the vehicle to precisely pinpoint the position of other participants. A Velodyne LIDAR with 64 beams can generate up to 2.3 M points per frame. At 10 fps, this translates to about 2.2 Gbps. 45 At least for the foreseeable future, vehicle communication technologies cannot support these data rates. Theoretically, DSRC/802.11p [26] can achieve 3-27 Mbps, depending on modulation and error correction coding (ECC) rate, and current commercial products tend to achieve up to 6 Mbps. 4G and Wi-Fi, which many modern vehicles have, increasingly support direct modes and wider channel bandwidths. Both WiFi-direct (802.11n/ac) and LTE/LTE-direct [45, 14, 18, 19] can achieve nominal bandwidths of up to 300 Mbps. That said, WiFi-direct via 802.11n/ac cannot adapt to the highly variable wireless channel between fast moving vehicles, so the only 802.11 version used today for vehicular communications is 802.11p. LTE-direct can support 10-20 simultaneous transmissions at 20 Mbps between vehicles within a 1 km 2 area per 10 MHz channel, if communicating vehicles are within 100m [15] and existing chips that implement LTE-direct for vehicular applications, tend to achieve around 10 Mbps. Finally, 5G extensions for vehicles (5GAA [1]) can alleviate some of these problems, but are unlikely to remove the network bottleneck at high participant densities. Approach. To overcome this network bottleneck, AUTOCAST must aggressively reduce the amount of information transmitted over the wireless channel. To do this, it relies on three observations. (1) While a full LiDAR frame captures a significant amount of information including buildings, trees etc., only objects (either moving, or static like a parked car) on the roadway are important for making dynamic driving decisions. (2) If car A wants to transmit roadway objects to car B, some of those objects may already be visible to B; A need only transmit roadway objects occluded at B. (3) Finally, B may not need all occluded roadway objects, since some of them may not be relevant to its driving decision. For example, if it is currently driving straight, it may not need occluded objects on the opposite lanes on a divided highway. Transmitting object point clouds. A LiDAR detects reflections from surfaces (“returns”) of its beams. So, in the LiDAR output, each surface in the environment consists of points at which returns occur. Thus, each pedestrian or vehicle is represented by a set of points (a point cloud). Because LiDAR beams are radial, points are denser for nearby objects than for objects further away. Moreover, point clouds for faraway objects are smaller than those for nearby objects. 46 When possible, AUTOCAST should transmit point clouds of the object. Other representations, like a bounding box, while being potentially more compact, are less desirable. A bounding box can require significant computation: 3-D object detection neural networks require 60− 100 ms on a cutting edge GPU. Moreover, bounding boxes can be loose, covering more than just the object; this inaccuracy can potentially affect the correctness of control decisions. Finally, a bounding box has less information than a point cloud; e.g., from the point cloud it may be possible to infer the shape of the vehicle. Object sizes and decision timescales. From photorealistic simulations of different driving scenarios (§4.4), we have found that LiDAR point clouds for vehicles have up to 200 points, requiring 38.4 kbits. Whether this fits within the network channel depends on how often each vehicle would transmits point clouds to other vehicles. Ideally, to enable a vehicle to track participants (especially other fast-moving vehicles) precisely, each vehicle must receive (and make trajectory planning decisions on) points clouds at sub-second timescales. The finest decision interval, denoted byT d is 100 ms, which is the interval at which the Velodyne LiDAR generates a frame [23]. In 100 ms, a vehicle traveling at 40 mph moves about 1.8 m. Thus, by the time a vehicle receives object information, that object may have moved by that distance. We discuss later (§4.3.2) how AUTOCAST accounts for this delay in its trajectory planning. EveryT d , then, each vehicle transmits point clouds to other vehicles. At a nominal channel capacity of 10 Mbps, AUTOCAST can transmit about 25 objects. The number of participants at a busy intersection can potentially be larger than that, so AUTOCAST must carefully schedule point cloud transmissions on the channel. AUTOCAST Architecture. This discussion motivates the following research questions: (a) What should be the objective of the transmission scheduler, and does there exist a fast, near optimal scheduler for this? (b) What information does the scheduler base its decisions on? (c) How can vehicles extract this information from LiDAR data? (d) How can vehicles use the transmitted point clouds to plan collision free trajectories? AUTOCAST addresses these questions in the context of an end-to-end architecture for extended depth perception (Figure 4.2). The data-plane (§4.3) is responsible for processing, transmitting, and using point 47 Vehicle 1 Vehicle N Control Channel Metadata Exchange Sensor Exchange Spatial Reasoning Trajectory Planning Autonomous Driving or ADAS 3D Frames ... Road-side Unit Metadata Exchange AutoCast Scheduler Data Channel Figure 4.2: AUTOCAST System Architecture clouds to make trajectory planning decisions. The control-plane (§4.2) makes transmission scheduling decisions. The data plane. This component, which runs exclusively on vehicles, has two subcomponents. Spatial reasoning (§4.3.1) extracts moving objects from LiDAR sensors. For each object, it determines which vehicles cannot see this object; those are the vehicles to whom this object should potentially be sent. Each vehicle runs a trajectory planning component (§4.3.2), which, after it receives objects from other vehicles, adapts its current trajectory based on the received information. The control plane. Two subcomponents constitute the control plane. The metadata exchange component (§4.2.1) runs both on the vehicle and the RSU, and implements a protocol to exchange metadata (needed for scheduling) between them. It obtains this metadata from the data plane. The scheduler (§4.2.2) uses this information to compute a transmission schedule, which it relays back to the vehicles using the metadata exchanger. The data plane executes the transmission schedule. 48 This decoupling of data and control ensures that bandwidth intensive point cloud data is directly transmitted between vehicles, while at the same time a centralized entity is able to make near optimal scheduling decisions. 4.2 Control Plane We now describe AUTOCAST’s metadata exchange and scheduling components (Figure 4.2). Both the control plane and the data plane assume that each vehicle is able to accurately position itself using a 3-D map [37] and Simultaneous Mapping and Localization (SLAM [240, 239]) algorithms, so that each vehicle knows its own position precisely at all times. This kind of positioning technology is mature enough that all autonomous vehicles on the road today have it. 4.2.1 Metadata Exchange Deployment setting. Vehicles exchange metadata between themselves and also with a road-side unit (RSU). Because the specifics of the metadata exchange can depend upon the relationship between radio range and road geometry, we describe metadata exchange for a concrete deployment setting, an intersection. Intersections are also where cooperative perception can help most [101], because they are among the most hazardous parts of the road network. Control messages. Assume that there is a radio-equipped RSU at an intersection (Figure 4.3). AUTOCAST participants (the RSU and vehicles) periodically broadcast control messages everyT d , the timescale at which the scheduler makes decisions (§4.1). These control messages are highly likely to reach each vehicle that is close to or at the intersection, as well as the RSU. This is because lane widths are on the order of 3-4 m [34, 209], so intersections of major streets with 3 lanes in each direction can be on the order of 30 m×30 m. On the other hand, the nominal radio range of LTE-directR is 500 m [187], so even vehicles far away from the intersection can hear control messages sent by the RSU at the intersection. Information exchanged in control messages. In AUTOCAST, participants exchange three types of in- formation in these messages. Standardization efforts have defined vehicle-to-vehicle messaging formats 49 Figure 4.3: AUTOCAST Scheduler Domain. that exchange similar cooperative awareness messages [79]; we have left it to future work to design a standard-compliant AUTOCAST message exchange. Trajectory. Each vehicle transmits its current trajectory to other participants; the vehicle’s planner (§4.3.2) generates and updates the trajectory every T d . A vehicle’s trajectory consists of a series of waypoints and their associated timestamps. Each waypoint indicates the position a vehicle expects to be at the corresponding timestamp. In practice, to limit control overhead, AUTOCAST compresses trajectories. The first waypoint in the trajectory is the current vehicle pose. Denote byt i the trajectory of thei-th vehicle. The Object Map. Using its 3D sensor, each vehicle can extract point clouds of roadway objects; these are stationary or moving objects (vehicles, pedestrians, bicyclists) on the road surface. Denote byo i,k the k-th object in vehiclei’s view. Now, vehiclei receives broadcasted trajectories from other vehicles. Using spatial reasoning techniques described in §4.3.1, vehiclei computes the following two quantities for eacho i,k in its view: (1)v (i,k),j is a boolean value that indicates whethero i,k is visible to vehiclej. (2)r (i,k),j is a value that indicates whether o i,k is relevant toj’s current trajectoryt j . We make the notion of relevance precise in §4.3.1. Vehiclei then broadcasts an object map that contains: (a) an ID for each objecto i,k , (b) the size ofo i,k in bytes, (c)v (i,k),j , and (d)r (i,k),j . In §4.2.2, we explain how the scheduler uses these values to compute a transmission schedule. 50 The transmission schedule. The RSU at the intersection receives trajectories and object maps from vehicles within radio range. It identifies all those vehicles within a circle of radiusR/2 2 , whereR is the radio range; these are the vehicles that can plausibly hear each other and they constitute the scheduler’s domain (Figure 4.3). For these vehicles, using the received object maps, the RSU can compute a transmission schedule that determines which vehiclei should transmit which objectk to which vehiclej. The RSU then broadcasts this transmission schedule to all vehicles. The precise form of the transmission schedule and the mechanism for transmitting this depends on the underlying radio technology (e.g., DSRC, LTE- V); we discuss this, together with how vehicles within a sharing region multiplex their transmissions to cooperatively execute the centralized schedule, in §4.2.2 and §4.5. Other details. We have described AUTOCAST in an intersection setting. In §4.5 we discuss how to deploy a distributed AUTOCAST for any setting, e.g., in a highway setting. Control messages can be lost; In case of a loss, AUTOCAST will use the most recent control message available and extrapolate, see §4.4 for more details. 4.2.2 AUTOCAST Scheduler In this section, we discuss AUTOCAST’s scheduler. Depending on the network bandwidth and the number and size of objectso i,k relevant to other vehicles, it may not be feasible to transmit all objects before the next decision intervalT d . The scheduler decides which objects to transmit at every decision interval and at what order. For example, if an object is likely to cause an imminent collision, it must have higher priority in the transmission schedule. The underlying PHY layer may be able to transmit multiple PHY frames during one decision interval; the scheduler must decide which objects to transmit in which frames. The scheduler communicates the transmission schedule to all vehicles in its domain (Figure 4.3); each vehicle then broadcasts the specific object in the assigned PHY frame. 2 More precisely, AUTOCAST associates a small guard band around the circle to allow for vehicle movement. This guard band (Figure 4.3) can be calculated from the posted speed limits: at 40 mph, a vehicle can move about 1.8 m in oneT d (100 ms) interval. 51 4.2.2.1 Preliminaries - Notation and PHY layer LetC ={1,...,C} be the set of vehicles andK ={1,...,K} be the set of objects across all vehicles. Letn,n = 1,...N be then th frame of the current decision interval which has a total ofN frames. Let x n i,k ={0, 1} be a decision variable indicating whether vehiclei transmits its objectk at framen,S n f be the size of framen, andT n be the duration of framen, thus P n T n ≤T d . Finally, letB be the operational network bandwidth, thusT n =S n f /B. T n andB depend on the PHY technology. There are two technologies available today, DSRC and LTE-V (a variant of LTE-direct).B varies between 5 and 10 Mbps andT n varies between 10 and 100 ms in current standards proposals/commercial products. DSRC is based on TDMA while LTE-V offers both an OFDM option and a simpler TDMA option. In case of OFDM, a frame multiplexes transmissions from multiple vehicles similarly to uplink frames in cellular networks. In case of TDMA, a frame consists of concatenated in time transmissions from multiple vehicles. We discuss more about these technologies in §4.5. We assume the PHY layer uses QPSK as per common practice in vehicle communication systems due to the challenging channel [28, 29]. Thus, the system can deliverL =B× log(1 +γ QPSK ) bits per unit time, whereγ QPSK is the SINR value required by QPSK. We model the PHY layer by the successful probability of delivering an object between two vehicles. We define a CxC channel matrix comprising of these probabilities as follows,P = [p i,j ,i,j = 1...C], wherep i,j ∈ [0, 1] indicates the successful probability of delivery from vehiclei to vehiclej,∀i,j∈C, and we further assume thatp i,j are independent. 4.2.2.2 Problem Formulation - Markov Decision Process Because a decision interval may have multiple frames, we formulate the scheduling problem as a Markov Decision Process (MDP) such that the scheduler optimizes the scheduling decision across all frames. State. Leth n (i,k),j indicate whether vehiclej has received objectk from vehiclei by framen, andq n j = {h n (i,k),j ,∀i,k}. We define the state of the system at framen byS n ={q n 1 ,q n 2 ,...,q n C },∀n∈{1,...,N}, 52 whereS n ∈S, withS denoting the state space. Note that since the MDP state changes for each frame,n represents the discrete time steps over which the MDP operates. Action. Lets i,k denote the size of objecto i,k . Then let A n = ( x n i,k ={0, 1},∀i∈C,∀k∈K| X ∀i∈C X ∀k∈K s i,k ×x n i,k ≤S n f ) denote the action taken at frame/time stepn, whereA n ∈A, withA denoting the action space. Reward Function. To maximize the total rewards the system needs to carefully decide the action (A n ) based on the current state (S n ). When the action is decided, the action reward can be calculated as follows: R n (S n ,S n+1 ,A n ) = X ∀j∈C X ∀i∈C X ∀k∈K x n i,k × (h n+1 (i,k),j −h n (i,k),j )×y n (i,k),j , (4.1) wherey n (i,k),j is the reward when objectk is transmitted from vehiclei to vehiclej, which we define by y n (i,k),j = (1−v (i,k),j )×r (i,k),j . The rationale for this definition is that there is a reward if vehiclej receives objectk by vehiclei if objectk is invisible and relevant to vehiclej, see §4.2.1. Transition Probability. We compute the transmission probability from one state to another based on action A n as follows: P A n S n ,S n+1 = Y ∀i:x n i,k =1 Y ∀j∈V 1 p i,j × Y ∀j∈V 0 (1−p i,j ) , (4.2) whereV 1 ={j|(h n+1 (i,k),j = 1,h n (i,k),j = 0,x n i,k = 1)}, which corresponds to all vehiclesj scheduled to receive a relevant object during frame/time stepn + 1, andV 0 ={j∈C|(h n+1 (i,k),j = 0,h n (i,k),j = 0,x n i,k = 1}, which corresponds to vehicles that are not scheduled to receive a relevant object during that frame. 53 Markov Decision Process. We define a finite-horizon MDP by the tupleM(S,A,P A n S n ,S n+1 ,R n ). To solve the MDP, we should find value functions to measure the goodness of a particular state under a policy, and find optimal value functions to find the best possible goodness of states under an optimal policy. For this purpose, we first define a policy/decision ruleπ(S n ) at frame/time stepn∈{1,...,N} to be a mapping from states to actions. Then, the value function of every state, denoted byU π (S n ), can be expressed as follows: U π(S n ) (S n ) = X ∀S n+1 ∈S P π(S n ) S n ,S n+1 · R n (S n ,S n+1 ,π(S n )) +U π(S n+1 ) (S n+1 ) , (4.3) whereπ(S n )∈A is the action based on a policyπ. The rationale behind this equation is that an optimal policy can be constructed by going backwards in time: we first construct an optimal policy for the tail subproblem at the last stage at time stepn =N, then apply the optimal policy to the tail subproblem with the last two stages at time stepn =N− 1, and continue in this manner until an optimal policy for the entire problem is formed. The goal of the MDP is to obtain an optimal policyπ ∗ (S n )∈A which maximizes the value function in Eq. (4.3) using the Bellman optimality equation [49]. In theory one may use dynamic programming (DP) to do this. However, the time complexity of DP is exponential in the number of states [174]. This motivates us to seek scaleable approaches to find a good policy. 4.2.2.3 Scheduling Algorithms. We start by defining the weight of an object at frame/time stepn, denoted byH n i,k , to be the total rewards gained by the system if the object is successfully delivered to all interested vehicles during framen: H n (i,k) = X ∀j∈C,i6=j y n (i,k),j × (1−h n−1 (i,k),j ). (4.4) Note that the term (1−h n−1 (i,k),j ) indicates the object has not been received during previous frames. 54 1 2 1,1 1,2 ,1 1,3 ,2 !,! ,# $ ×(1−ℎ !,! ,# $%! ) !,! $ ⋮ ⋮ Figure 4.4: An illustration ofH n i,k . Consider the bipartite graph shown in Figure 4.4 which has destination vehicles on the left and objects on vehicles on the right. The weight of each edge between a vehicle and an object corresponds to the reward if this vehicle receives that object. Then,H n i,k can be computed by adding the weights of the edges connecting to the (i,k) node. Greedy Max-Weight Scheduler. Motivated by this representation, we may use greedy solutions to the maximum weight matching problem of a bipartite graph [206] to quickly find a good solution. Specifically, at every frame/time step, the scheduler may select the transmission pairs based on decreasing order of the H n i,k value, leading to the highest possible total weight/reward among the available transmissions for each frame, until there is nothing to deliver or the decision interval is over (i.e.,n =N). However, the above weight does not take the size of an object, s i,k , into account. Therefore, the scheduler may schedule an object which has a large weight but occupies a large portion of the frame, as opposed to scheduling a large number of smaller objects whose sum of weights might be larger than the weight of the single large object. One way to address this is to divide the weight of an object by its size, and use the modified weight,H n i,k /s i,k instead, corresponding to a normalized rewardy n (i,k),j /s i,k , over the size of the object. We summarize in pseudo code the proposed greedy Max-Weight algorithm (Algorithm 1). The time complexity of the scheduler can be easily shown to equalO(NCK log(CK)). CPU experiments show that 55 Algorithm 1 Greedy Max-Weight Scheduler Input: y n (i,k),j ,h n (i,k),j ,s i,k , andS n f Output: π(S n ) ={x n i,k } 1: forn∈{1,...,N} do 2: CalculateH n i,k from Eq. (4.4). 3: while P ∀i∈C P ∀k∈K s i,k ×x n i,k ≤S n f do 4: Select a TX pair (i,k) with the largest value of H n i,k s i,k . 5: Setx n i,k = 1. 6: end while 7: Updateh n (i,k),j based on the vehicle environment. 8: end for the Greedy Max-Weight scheduler runs fast and achieves near optimal performance for the scenarios of interest that we have studied (§4.4). FPTAS-based scheduler. We also propose to use a well known fully-polynomial time approximation scheme (FPTAS) [75] to solve the selection problem at every frame/time step. Before we introduce the FPTAS, we briefly introduce the dynamic programming framework, which solves the following equation recursively: DP (C×K,S n f ) = max{DP (C×K\ (i,k),S n f ), H n i,k +DP (C×K\ (i,k),S n f −s i,k )}. (4.5) To solve the problem we use the same approach as in [75]. The main idea is to formulate the scheduling problem at each time step as a standard binary Knapsack problem, and use the FPTAS of the binary Knapsack problem. We omit the details of the algorithm for space. While more efficient that dynamic programming, FPTAS still has high computational complexity (§4.4). 4.3 Data Plane Autonomous vehicles use 3D sensors for perception and a planning algorithm to determine the vehicle’s trajectory. AUTOCAST proposes to extend today’s autonomous driving with cooperative perception. Its 56 Figure 4.5: Empty occupancy grids indicate occluded area. data plane achieves cooperative perception using spatial reasoning algorithms that generate object maps (§4.2.1), and a planner that relies on cooperative perception to improve driving safety. 4.3.1 Spatial Reasoning This component runs on each vehicle, processes each frame of its LiDAR output and generates object maps. Specifically, for each object in its view, spatial reasoning determines the visibility of that object with respect to another vehiclev (i,k),j , and the relevance of that object to that vehicler (i,k),j (§4.2.1). To do this, it must (a) detect roadway objects within its view, (b) assess their geometric and temporal relationships. Extracting roadway objects. Several deep learning networks exist that can detect objects in LiDAR frames [204]. However, these can be slow, require significant compute resources, are sometimes inaccurate, and generate information (e.g., identify object classes and bounding boxes) that AUTOCAST does not need. For AUTOCAST, we simply need point clouds of stationary or moving objects on the road. To extract these roadway objects, we impose a fine 2-D occupancy grid from the birds-eye-view perspective of the LiDAR point cloud (Figure 4.5). More precisely, each 2-D grid element is a rectangular tube extending vertically on the z-axis. Each point in the LiDAR frame falls into exactly one grid element. Each grid element may contain: (a) no points, (b) only points on the ground, or (c) points above ground. AUTOCAST can determine if a point lies on the ground or above the ground because it knows the coordinates of the point, and the height of the LiDAR above the ground. Furthermore, AUTOCAST assumes that each 57 2-D grid is labeled as either on the roadway surface or not. This information can be obtained by running a segmentation algorithm for drivable space detection on the 3D map [160, 205]. Thus, all points in a 2-D grid element of type (c) (above the ground) constitute points belonging to a roadway object. The object itself consists of all contiguous type (c) grid elements (as discussed later in §4.4, we use sub-meter grid dimensions so it is unlikely that points belonging to two different vehicles would fall into the same grid). Visibility determination. Having determined all objects in its view, to determinev (i,k),j , vehiclei simply traces a ray from vehiclej’s current position around every objecto i,k 0 in its own view (o i,k 0 includes the vehiclei itself, if visible). Ifo i,k falls into the shadow ofo i,k 0, then the latter object occludes the former andv (i,k),j is false. If no sucho i,k 0 exists, and ifo i,k is within the LiDAR range ofj, thenv (i,k),j is true. Relevance determination. The intuition behind relevance determination is that some objects may not be of relevance to other vehicles, even if those vehicles cannot see this object. For instance, if a vehicle is turning right at an intersection, it is unlikely to need information about vehicles driving straight on the opposite lane. AUTOCAST’s current implementation uses a simple notion of relevance: an object is relevant to another vehicle if the trajectories of those two objects could potentially collide at some point in the future. More precisely, r (i,k),j is a value that assesses whether vehicle j’s trajectory can collide with o i,k . Vehiclei getsj’s trajectory from control messages. It obtainso i,k ’s trajectory by estimating this objects’ heading and velocity continuously over successive frames. By extrapolating these trajectories, AUTOCAST can determine if the two trajectories collide at some point in time. Given this, one can definer (i,k),j in two ways: (a) as a boolean value that is true whenj can collide witho i,k , or (b) as the reciprocal of the time to collision (a value between 0 and 1, assuming time is in milliseconds). The intuition for the latter choice is clear: objects thatj is likely to encounter sooner are more relevant. 4.3.2 Trajectory Planning Autonomous vehicles use sensor inputs to make driving decisions. These driving decisions occur at three different timescales: route planning occurs at the granularity of a trip, path and trajectory planning occurs at the granularity of a road segment (a few tens to hundreds of meters), and low-level control ensures that 58 the vehicle follows the planned trajectory by effecting steering and speed control. In AUTOCAST, vehicles must make these decisions, by incorporating received point-clouds into its own LiDAR output. In AUTOCAST, in order to quantify the end benefits of cooperative perception, we develop a path and trajectory planning algorithm that incorporates objects received from other vehicles. The large, existing literature on this topic (see, for example, [102, 150, 122]) does not take cooperative perception into account. Recent research has incorporated partial visibility into trajectory planning [77, 188]; we adapt the planning algorithm from one of these [77] for AUTOCAST. Perspective transformation. Before it can plan a trajectory, AUTOCAST must re-position the received point clouds into its own LiDAR output. It uses the 3-D map for this. The sending vehicle positions the point cloud in its own coordinate frame of reference. To re-position it to the receiver, letT s be the transformation matrix from the sender’s coordinate frame of reference to that of the 3-D map andT r be the transformation matrix for the receiver. To transform a pointV s in the sender’s view to a pointV r in the receiver’s, AUTOCAST uses:V r =T −1 r ∗T s ∗V s . Having done this, it updates each occupancy grid (§4.3.1) with the received point cloud, then uses the occupancy grid to determine a path and then a trajectory. Path Planning. This step determines a viable and safe path through drivable space that avoids all objects. It uses the occupancy grid defined above (§4.3.1) after augmenting it with received objects. To understand path viability, recall that each grid element can either have one or more points belonging to an object, or be unoccupied. Moreover, using the 3-D map, we can annotate whether a grid element belongs to drivable space or not, and also whether a vehicle can traverse the grid element in both directions or uni-directionally. The input to path planning is a source grid element and a target grid element. The output of path planning is a path in the 2-d grid, where the width of the path is the width of the car, and every grid element that intersects with the path must (a) be unoccupied and (b) be drivable in the direction from the source to the target. AUTOCAST usesA? heuristic search [106] to determine a valid path. We constrain the search so that the resulting path is smooth: i.e., it does not have sharp turns that could not be safely executed at the current speed. 59 Trajectory Planning and Collision Avoidance. On the resulting path, AUTOCAST picks equally spaced waypoints; a trajectory is a collection of waypoints and associated times at which the vehicle reaches those waypoints. To determine those times, the trajectory planner must determine a collision-free trajectory; when the vehicle is at a particular waypoint, all other vehicles must be far enough from the that waypoint. To determine this, AUTOCAST uses the estimated trajectory of received objects, as well as estimates of the trajectory of vehicles within its own sensor’s view. AUTOCAST also calculates the time of arrival to and departure from this waypoint based on estimated speed and vehicle dimension. When a predicted collision is far enough, AUTOCAST follows the planned trajectory until within stopping distance (based on current speed an brake deceleration) of that waypoint of collision. 4.4 Performance Evaluation In this section, we first evaluate AUTOCAST end-to-end: we show that cooperative perception can improve driving safety on three autonomous driving benchmarks. We then demonstrate that our AUTOCAST implementation can schedule transmissions on vehicular radios. We conclude with an evaluation of AUTOCAST performance at scale. 4.4.1 Methodology The Carla simulator. Evaluating autonomous driving in general, and AUTOCAST in particular, poses significant challenges. To address this, industry and academia have recently developed photo-realistic autonomous driving simulation platforms like Carla [73]. Carla uses a game engine to simulate the behavior of realistic environments, and contains built in models of freeways, suburban road, and downtown streets. Users can create vehicles that traverse these environments and that attach advanced sensors such as LiDAR, Camera, Depth Sensor to them. As these vehicles move through the environment, Carla simulates environment capture using these advanced sensors. Users can design planning and control algorithms using the captured environment to validate autonomous driving. 60 Bird-eye View Single AutoCast Figure 4.6: Overtaking Bird-eye View Single AutoCast Figure 4.7: Unprotected Left Turn Bird-eye View Single AutoCast Figure 4.8: Red Light Violation Implementation. We have implemented the scheduler, spatial reasoning and trajectory planning in Carla. The total AUTOCAST implementation is 27,124 lines of code. In addition, to configure the scenarios, we have developed on top the Carla autonomous driving challenge [4] evaluation code. In our implementation, all vehicles use Carla’s default longitudinal and lateral PID controller as the lower-level control to steer the vehicle along the planned trajectory. To simulate metadata exchange between vehicles, we have extended Carla to incorporate V2V . Specifi- cally, our implementation models LTE-Direct QPSK with 10 MHz bandwidth [87], which translates to a peak rate of∼ 7.2 Mbps. We implement LTE-Direct TDMA Mode 4 (see §4.5) and simulate V2V channel loss in all scenarios using models described in 3GPP standards [28, 29]. End-to-end evaluation scenarios. To demonstrate the benefits of AUTOCAST end-to-end, we have imple- mented three scenarios from the US National Highway Transportation Safety Administration (NHTSA) Precrash typology [167]. In these, occlusions can impact driving decisions significantly. Overtaking A stopped truck (Figure 4.6) on a single-lane road forces a car to move to the lane with on-coming traffic. The truck occludes the car’s view of the opposite lane. Unprotected left turn A car (Figure 4.7) and a truck wait to turn left in opposite directions at an intersec- tion. The truck blocks the car’s view oncoming traffic. Red-light violation A truck waits to turn left (Figure 4.8) at an intersection, and a car drives straight towards the intersection. Another car jumps the red-light in the perpendicular direction; the truck 61 can see the violator and stops to avoid collision, but the car crossing the intersection cannot see the violator. Experiments with real radios. To demonstrate that an implementation of AUTOCAST can plausibly work over real radios, we run AUTOCAST on a small-scale testbed using three iSmartWays DSRC radios [12]. In these experiments, we record the trace data from all scenarios; for each frame (every 100ms), the trace includes all exchanged metadata, the computed schedule, and the object point clouds. We then playback the trace over DSRC radios to validate if the scheduled transmissions complete in time. Comparison. In end-to-end evaluations, we compare AUTOCAST against an approach in which each car makes trajectory planning decisions based on its own sensor alone. In quantifying the efficacy of our greedy scheduler, we compare it with: 1) FPTAS, 2) Optimal using dynamic programming, and 3) Agnostic, where cars within range will deliver objects in a round-robin fashion. Metrics. We use several metrics to evaluate AUTOCAST. In end-to-end experiments, we quantify scenarios outcomes (e.g., a crash, or a near miss), the reaction time (between when a vehicle detects potential collision and the time needed for it to avoid the collision), and the closest distance between two vehicles at any point during the scenario. To evaluate the scheduler’s efficacy, we quantify rewards and the reward ratio (the percentage of the object reward is received), time to make a scheduling decision, and staleness of objects with different rewards. 4.4.2 End-to-end Scenario Evaluation Goal. The NHTSA pre-crash typology defines a set of challenging scenarios. In this section we seek to understand whether cooperative perception can result in safer driving outcomes than a system without this capability. We evaluate these scenarios in Carla: for each scenario, we explore different points of the scenario parameter space (described below), and record the metrics described above. Terminology and Experiment Setting. In each of our scenarios (Figure 4.6, Figure 4.7, Figure 4.8), there are three entities: the black sedan is the ego vehicle on which AUTOCAST runs, the red sedan is a collider which only uses its own sensors to plan its trajectory, and the orange truck is an occluder. In each scenario, 62 Overtaking Left Turn Red Light Single AutoCast Single AutoCast Single AutoCast 0 0.2 0.4 0.6 0.8 1 Crash Dead-lock Near-miss Safe Passage Figure 4.9: Crash, Deadlocks and Near Miss. the paths between the ego and the collider intersect. We set up their speeds such that their trajectories almost collide (i.e., both vehicles would come very close to each other if both did not see each other at all). Specifically, we generate several configurations as follows. We set the base speed of the collider to 3 different values (20 km/h, 30 km/h and 40 km/h). At a given base speed, the collider’s trajectory would (in the absence of avoidance) intersect with that of the ego. A second dimension of the configuration is an intersection delta; ranging from -2 s to +2 s (with steps of 0.25 s), a value ofδ means the collider actually arrives at the intersection pointδ s before (or after) the ego vehicle. This parameter controls how closely the two cars approach each other. This gives us a total of 24 different configurations for each scenario. Outcomes. With this setting, there are four possible outcome: safe passage, near-miss, crash, and deadlock. The first three are self-explanatory. In the fourth, which occurs only in Overtaking, both vehicles stop without colliding but neither can make forward progress. Beyond these outcomes, we are also interested in quantifying the closest distance between vehicles during any point in the simulation, and the reaction time (the time between when the ego is aware of the collider to the last possible moment before it can start braking). Figure 4.9 shows the outcomes across all scenarios and aggregated across all configurations. 63 20 30 40 0 10 20 30 Reaction Time (s) Overtaking 20 30 40 0 5 10 Closest Distance(m) Single AutoCast Single AutoCast Single AutoCast Single AutoCast 20 30 40 Speed (km/h) 0 0.5 1 20 30 40 0 10 20 30 Reaction Time (s) Left Turn 20 30 40 0 5 10 Closest Distance(m) Single AutoCast Single AutoCast Single AutoCast 20 30 40 Speed (km/h) 0 0.5 1 20 30 40 0 10 20 30 Reaction Time (s) Red Light 20 30 40 0 5 10 Closest Distance(m) Single AutoCast Single AutoCast Single AutoCast 20 30 40 Speed (km/h) 0 0.5 1 Figure 4.10: Reaction Time, Near-miss and Crash Details Results: Overtaking. Figure 4.9 shows, for AUTOCAST, and without cooperative perception (called Single), a stacked bar that counts the number of outcomes each type. Without cooperative perception, safe passage occurs only in 20% of the configurations, and crashes occur in about 5% of the configurations. These occur because the ego vehicle moves into the oncoming lane, but neither vehicle has enough time to stop. About 20% of the configurations result in near misses, which we define as being within 2m of each other. The remaining configurations that result in deadlock: the ego vehicle moves into the oncoming lane, both vehicles notice each other and are able to stop, but neither is able to make forward progress because the truck occupies the other lane. While the situation is not inherently unsafe, it does represent an undesirable driving outcome where participants must coordinate to resolve the deadlock. By contrast, AUTOCAST ensures safe passage in all configurations, but incurs near misses in about 5% of the configurations. To understand these results better, the first column of Figure 4.10 quantifies the distribution of reaction times and closest distances across all configurations at each collider speed. With AUTOCAST, cooperative 64 perception enables much higher reaction times (average 13.3 seconds at 20 km/h), than without (average 0.31 s at 20 km/h, and zero at higher speeds). In this scenario, when reaction time is zero then either a crash or a deadlock must happen. This is consistent with the third figure, which shows the distribution of outcomes by speed; safe outcomes only occur at 20 km/h. Finally, we have also investigated why AUTOCAST incurs some boundary-case near misses (with a minimum closest distance of 1.74 meters) in this scenario. This happens at low collider speeds; the ego is aware of the oncoming vehicle for 13 seconds and planned to stop the car as close and safe as possible to start lane change right after the collider pass. For this scenario, the closest distance between the vehicles is always small, because the two vehicles have to pass each other. Results: Unprotected left-turn. This scenario is more benign than overtaking, because the ego is ob- structed to a lesser extent. Without AUTOCAST, Figure 4.9 shows that 16.7% of the configurations resulted in a crash and 16.7% in a near miss. For this scenario, AUTOCAST ensures safe passage in all configurations. Because this scenario is benign, reaction times are in general higher both with and without AUTOCAST. Without AUTOCAST, crashes occur when the collider arrives in the shadow of the truck as the ego vehicle starts to make a left turn. When the collider’s speed is high, it cannot brake fast enough to avoid the collision or near miss. With AUTOCAST, reaction times and closest distances are generally quite high. Results: Red-light violation. The red light violator scenario is another very dangerous scenario (see the middle column of Figure 4.8 for occlusion and visible area). Without AUTOCAST, the ego vehicle will not be aware of the red light violator until it is very close to the last truck on the left lane. Figure 4.9 shows that all 100% of the configurations will result in a undesirable situation (75% crash, 25% near miss). The occlusion angle is so wide that there is no time for the ego vehicle to react whatsoever (0 sec in the third column of Figure 4.10). The moment that the ego vehicle see the collider is the time to brake. When human drivers encounter these situations, we have no other way but to slow down and sneak out bit by bit carefully. Result breakdown in the bottom right plot of Figure 4.10 also shows that a near-miss only happens in low speed case (20km/h). It is this kind of human interpretation of the environment, based on high-level reasoning and lots of experience (data), that is very challenging to be embedded into autonomous driving algorithms. However, even with a simple trajectory planner like the one proposed, with AUTOCAST, 65 Figure 4.11: Scheduled vs. DSRC Transmission: Overtaking Figure 4.12: Scheduled vs. DSRC Transmission: Unprotected Left Turn Figure 4.13: Scheduled vs. DSRC Transmission: Red Light Violation the risk can be substantially eliminated. AUTOCAST enables 7.2 - 13.3 seconds of awareness to track and estimate the colliders time of arrival so that it can make a decision when to brake or if a brake is necessary. AUTOCAST eliminated all crash cases, and limited the closest near miss distance to the 2 meters boundary. This is a design choice of how aggressive the controller decides to follow the trajectory. Given the shared vision, a conservative controller can brake even earlier to completely avoid coming in the 2 meters vicinity. 4.4.3 Experiments with V2V radios Methodology. In this section, we replay the transmission schedules from one configuration of each of our scenarios over a test bed with three DSRC radios; one radio each for the ego, collider, occluder. We programmed the DSRC radios to use LTE-Direct TDMA Mode 4 [87] (see discussions in §4.5), a listen-before-transmission mode to follow the schedule. To coordinate the radios, we designed simple handshake messages and timeout mechanism to maintain synchronization among all radios. The precise synchronization mechanism that incurs minimum overhead is itself an open research topic beyond the scope of AUTOCAST. For each scenario, we record the point cloud data to transmit and the computed transmission schedule. In the simulator, the schedule is based on a theoretical model of the channel with a fixed data rate. The goal of this evaluation is to validate whether the DSRC radios can finish the scheduled transmission in time and evaluate the significance of packet loss and its impact on the transmissions. Results. Figure 4.11, Figure 4.12, and Figure 4.13 show the scheduled transmissions in the upper subplot and the actual DSRC transmissions on the bottom. Each shared object is represented by a line with a unique 66 color compared to other objects in the same decision interval. Each colored line starts from the time of the beginning of the transmission, ends when the transmission completes. The x-axis represents the time in ms, the y-axis represents the time within each decision interval (100 ms). If an object is lost due to channel variability and dropped packets, the solid line becomes a dashed line. In Figure 4.11, the overtaking scenario, the collider’s point cloud as observed by the truck is in yellow, blue represents the truck’s point cloud as observed by the collider, and red is the ego’s point cloud as observed by the truck, and purple is the truck’s point cloud as observed by the ego. In this regime, notice that the red lines are larger since the truck is always close to the ego vehicle, but the collider point cloud is much smaller (smaller yellow lines). Blue and purple transmissions are also scheduled: the ego and collider broadcasting the truck’s (occluder’s) point cloud. In theory, they need not be transmitted because the truck is not relevant to any participant; however, AUTOCAST does transmit objects with less relevance when possible. As the collider approaches, its point cloud size becomes much bigger, and because it is relevant to the ego vehicle, AUTOCAST prioritizes its transmission. AUTOCAST was configured to generate transmission schedules based on 7.2 Mbps transmission rate. Surprisingly, the DSRC radios are able to finish the transmissions in almost half the time, indicating that the radio achieved 15 Mbps (not counting synchronization overhead). The loss rate was small; we observed an object loss rate of 1.67%. Most of the losses are bursty (e.g., around time 2000 ms); the low loss rate and bursty loss suggests that trajectory planners can use short-term dead reckoning to mitigate the effect of object loss. Finally, the height of each line indicates the total duration of transmissions at each slot. When this line is less than 100 ms, it means that the total volume of objects shared is significantly less than radio capacity. In the left-turn scenario (Figure 4.12) all objects are small, so there is far less demand for the channel, and DSRC can easily complete all transmissions. However, in the red light violator scenario (Figure 4.13), objects are much larger because the collider becomes visible to the occluder or only when it gets close to the intersection. In this case, the AUTOCAST-generated transmission schedule is “full”, while DSRC’s schedule during more frames is half because of the higher radio speed. DSRC exhibits similar bursty losses in these 67 5 10 20 40 0 50 100 150 200 Total Rewards 5 10 20 40 Number of Vehicles 0 20 40 60 80 100 Reward Ratio(%) Optimal FPTAS Greedy Agnostic Figure 4.14: Total Rewards vs. Number of Vehicles 5 10 20 40 Number of Vehicles 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 10 4 Computation Time (s) Optimal FPTAS Greedy Agnostic Figure 4.15: Computation Time vs. Number of Vehicles 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Latency (ms) C = 5 Greedy Agnostic 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 C = 10 0.2 0.4 0.6 0.8 1.0 Normalized Rewards 0 20 40 60 80 100 Latency (ms) C = 20 0.2 0.4 0.6 0.8 1.0 Normalized Rewards 0 20 40 60 80 100 C = 40 Figure 4.16: Latency vs. Normal- ized Rewards cases (at roughly the same object loss rate). Also, in both these cases, DSRC delays some transmissions to the next time slot: (see long blue line around 3700ms in Figure 4.12 and two red lines around 1300ms in Figure 4.13). We conjecture these stragglers are caused by backoffs, but have yet to confirm this. 4.4.4 Scheduling Algorithms We evaluate the optimality of different scheduling algorithm (Optimal, FPTAS, Greedy, Agnostic) in terms of total rewards and reward ratio, algorithm complexity and scalability with respect to the numberof vehicles, and transmission delay for objects with different rewards. We conduct this set of by setting the vehicle on autopilot mode, entering and exiting an intersection from all directions. Optimality and Scalability. We first evaluate the optimality. The upper plot of Figure 4.14 plots the total scheduled rewards when the number of vehicles varies from 5 to 40. Specifically, “Greedy" has less than 2% difference from “Optimal" while “Agnostic" has upto 40% difference from “Greedy". Figure 4.15 further shows the computation time of the schedule with different number of vehicles. Although the total rewards are very similar, greedy is two orders of magnitude faster than FPTAS whereas the running time of optimal (dynamic programming) can quickly become prohibitive. Vehicular environments can be highly dynamic which requires the schedule to computed frequently; only the proposed greedy algorithm can finish within the 100ms decision interval with upto 40 vehicles. Finally, we study how much of the scheduled rewards can actually be received (reward ratio). The lower plot of Figure 4.14 shows that when bandwidth is sufficient (with 5 or 10 vehicles), all four algorithms can fulfill 100% of the scheduled objects. Also, it is interesting to see the reward ratio starts to fall below 100% when 20 vehicles are present. When the 68 Channel 1 Channel 2 Channel 2 Guarding Margin in Forward Direction Figure 4.17: Distributed AUTOCAST Sharing Region. bandwidth is scarce, the optimality of AUTOCAST is evident: AUTOCAST achieves up to 25% more reward while transmitting on average nearly 20% more objects. Scheduled Latency and Staleness. The scheduled latency is measured by calculating the duration from the start of each decision interval (every 100 ms) to the time that a particular object is received. Figure 4.16 shows the scheduled latency of each object in 5,10,20,40 cars scenario. It gives more details behind the scene, explains the reward ratio difference by showing the transmission priority. AUTOCAST’s optimization always put the object with highest normalized rewards top of the schedule which results in lower scheduled latency, whereas the latency of objects scheduled in agnostic is random. When there are 5 vehicles, the average transmission delay of any object is less than 50 ms. When there are 20 vehicles. AUTOCAST can always schedule and send high-rewarding objects with less transmission times sooner than others. 4.5 Discussion Distributed AUTOCAST. In order to design a distributed protocol to operate at any location without RSU support, AUTOCAST needs to carefully and dynamically allocate different channels to sharing regions to avoid co-channel interference. Specifically, AUTOCAST marks the map with grids of size R×R. Neighboring grids are assigned alternating channels (Figure 4.17). Since autonomous vehicles are equipped with advanced 3D sensors (e.g., LiDAR, Stereo Camera) and are therefore capable to precisely localize with an error of around 10 cm, it is possible to choose which channel to transmit based on the vehicle location. 69 (Figure 4.17) When a cluster leader is selected, the sharing region is centered at the leader’s location at the time it is selected. PHY Technologies When a schedule is broadcasted, the cooperative execution of that schedule among vehicles has differences depending on which V2V technology is used. DSRC uses TDMA among vehicles. LTE-V may use TDMA (mode 4) or SC-FDMA, an OFDM variant, (mode 3) where the RSU or cluster head can assign frequency-time slots to vehicles based on the schedule [29]. In an intersection setting any LTE mode may be used, whereas under distributed setting it is easier to use TDMA-based LTE-V to avoid the need to coordinate different frequency channels across nearby regions. 4.6 Related Work Connected Vehicles and Infrastructure: Connected vehicles promise great opportunity to improve the safety and reliability of self-driving cars. Vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2X) communications both play an important role to share surrounding information among vehicles. Communica- tion technologies, e.g.,, DSRC [133] and LTE-Direct [87, 187], provide capabilities to exchange information among cars by different types of transmission, i.e.,, multicasting, broadcasting, and unicasting. Automakers are deploying V2V/V2X communications in their upcoming models [27, 31]). The academic community has started to build city-scale advanced wireless research platforms (COSMOS [5]), and large connected vehicle testbed in the U.S. (MCity [34]) and Europe (DRIVE C2X [209]), which gives an opportunity to explore the application feasibility of connected vehicles via V2V communications in practice. Sensor/Visual Information: Collecting visual information from sensors (e.g., LiDAR, stereo cameras, etc.) is a major part of autonomous driving systems. These systems rely on such information to make a proper on-road decisions for detection [58], tracking [157], and motion forecasting [158]. In addition, there is a large body of work that explores vehicle context and behavior sensing [89, 195, 181] to enhance vehicular situational awareness. Thanks to advanced computer vision and mobile sensing technologies, all the sensing capability can already be leveraged efficiently in a single car setting [156]. This work, and 70 several related works discussed below, take the next step of designing how to share this information among nearby vehicles. Vehicle Sensor Sharing: Past research has attempted to realize some form of context sharing among vehicles. Rybicki et al. [196] discuss challenges of V ANET-based approaches, and propose to leverage infrastructure to build distributed, cooperative Traffic Information Systems. Other work has explored robust inter-vehicle communication [192, 146, 67], an automotive ontology for intelligent vehicles [82], principles of community sensing that offer mechanisms for sharing data from privately held sensors [141], and sensing architectures [153] for semantic services. Motivated by the “see-through" system [95], several prior works on streaming video for enhanced visibility have been proposed [51, 190, 83, 128, 151]. The above line of work has focused on scenarios with two cars, where a leader car delivers its whole video to a follower vehicle, which in many scenarios not sufficient. In this work, we focus on enabling clusters of vehicles to share sensor information at scale. Moreover, we design a near-optimal policy to compute when, which views, and to which cars to deliver views over time. 71 Chapter 5 Automated Groundtruth Collection and Quality Control for Machine Vision The accuracy of deep neural network based machine vision systems depends on the groundtruth data used to train them. Practically deployed systems rely heavily on being trained and tested on groundtruth data from images/videos obtained from actual deployments. Often, practitioners start with a model trained on public data sets and then fine-tune the model by re-training the last few layers using groundtruth data on images/videos from the actual deployment [72, 193] in order to improve accuracy in the field. Obtaining groundtruth data, however, can present a significant barrier, as annotating images/videos often requires significant human labor and expertise. Today, practitioners use three different approaches to groundtruth collection. Large companies employ their own trained workforce to annotate groundtruth. Third parties such as [208] and [85] recruit, train and make available a trained workforce for annotation. Finally, crowdtasking platforms such as Amazon Mechanical Turk (AMT [33]) provide rapid access to a pool of untrained workers that can be leveraged to generate groundtruth annotations. While the first two approaches have the advantage of generating high quality groundtruth by using a trained workforce, they incur significant cost in recruitment and training, and are therefore often limited to well-funded companies. Consequently, employing a crowdtasking platform like AMT is often a preferred alternative for a large number of ML practitioners. Using AMT for obtaining groundtruth, however, presents several challenges that deter its widespread use. First, requesters may not always have the expertise needed 72 Satyam KITTI (c) Tracking (a) Detection Satyam KITTI Image (b) Segmentation Satyam PASCAL Figure 5.1: Examples of Satyam Results on Detection, Segmentation, and Tracking. to create user-friendly web user-interfaces to present to workers for annotation tasks. Second, worker quality varies widely in AMT, and results can be corrupted by spammers and bots, so requesters must curate results manually to obtain good groundtruth. Third, machine vision often requires groundtruth for hundreds or thousands of images and videos, and generate AMT Human Intelligence Tasks (HITs) manually is intractable, as is determining which workers need to be paid, or which workers to recruit. Goal, Approach and Contributions. In this chapter, we ask: Is it possible to design a groundtruth collection system that is accessible to non-experts while also being both cost-effective and accurate? To this end, we discuss the design, implementation, and evaluation of Satyam 1 (§5.1) which allows non-expert users to launch large groundtruth collection tasks in under a minute. Satyam users first place images/video clips at a cloud storage location. They then specify groundtruth collection declaratively using a web-portal. After a few hours to a few days (depending on the size and nature of the job), Satyam generates the groundtruth in a consumable format and notifies the user. Behind the scenes, in order to avoid the challenges of recruiting and managing trained annotators, Satyam leverages AMT workers, but automates generation of customized web-UIs, quality control and HIT management. 1 The Satyam portal [198] is functional, but has not yet been released for public use. 73 High-level Specification. A key challenge in using AMT arises because the HIT is too low-level of an abstraction for large-scale groundtruth collection. Satyam elevates the abstraction for groundtruth collection by observing that machine vision tasks naturally fall into a small number of categories (§5.2), e.g., classification (labeling objects in an image or a video), detection (identifying objects by drawing bounding boxes), segmentation (marking pixels corresponding to areas of interest) and a few others described in §5.2. Satyam allows users to specify their groundtruth requirements by providing customizable specification templates for each of these tasks. Automated Quality Control. Satyam automates quality control in the face of an untrained workforce and eliminates the need for manual curation (§5.3). It requests annotations from multiple workers for each image/video clip. Based on the assumption that different workers make independent errors in the groundtruth annotation, Satyam employs novel groundtruth-fusion techniques that identify and piece together the “correct parts” of the annotations from each worker, while rejecting the incorrect ones and requesting additional annotations until the fused result is satisfactory. Automated HIT Management - Pricing, Creation, Payment and Worker Filtering. Satyam automates posting HITs in AMT for each image/video in the specified storage location until the groundtruth for that image/video has been obtained. Instrumentation in Satyam’s annotation web-UIs allow it to measure the time taken for each HIT. Satyam uses this information to adaptively adjust the price to be paid for various hits and ensures that it matches the requester’s user-defined hourly wage rate. Satyam determines whether or not a worker deserves payment by comparing their work against the final generated groundtruth and disburses payments to deserving workers. When recruiting workers, it uses past performance to filter out under-performing workers. Implementation and Deployment. Satyam’s implementation is architected using a collection of cloud functions that can be auto-scaled, that support asynchronous result handling with humans in the loop, and that can be evolved and extended easily. Using an implementation of Satyam on Azure, we evaluate (§5.1) various aspects of Satyam. We find that Satyam’s groundtruth almost perfectly matches groundtruth from well known publicly available image/video data sets such as KITTI [93] which were annotated by experts 74 or by using sophisticated equipment. Further, ML models re-trained using Satyam groundtruth perform identically with the same models re-trained with these benchmark datasets. We have used Satyam for over a year launching over 162,000 HITs on AMT to over 12,000 unique workers. Examples of groundtruth generated by Satyam. Figure 5.1 show examples of the groundtruth generated by Satyam for detection, segmentation, and tracking, and how these compare with groundtruth from benchmark datasets. More examples are available in Figures 5.15, 5.16 and 5.17 and at [218, 71]. 5.1 Satyam Overview Satyam is designed to minimize friction for non-expert users when collecting groundtruth for ML-based machine vision systems. It uses AMT but eliminates the need for users to develop complex Web-UIs or manually intervene in quality control or HIT management. 5.1.1 Design Goals We now briefly describe the key design goals that shaped the architectural design of Satyam. Ease of use. Satyam’s primary design goal is ease of use. Satyam users should only be required to provide a high-level declarative specification of ground-truth collection: what kind of ground-truth is needed (e.g., class labels, bounding boxes), for which set of images/videos, how much the user is willing to pay, etc. Satyam should not require user intervention in designing UIs, assessing result quality, or ensuring the appropriate HIT price, etc., but must perform these automatically. Scalability. Satyam will be used by multiple concurrent users, each running several different ground-truth collection activities. Each activity in turn might spawn several tens of thousands of requests to workers, and each request might involve generating web user interfaces, assessing results, and spawning additional requests, all of which might involve significant compute and storage. Asynchronous Operation. Because Satyam relies on humans in the loop, its design needs to be able to tolerate large and unpredictable delays. Workers may complete some HITs within a few seconds of launch, or may take days to process HITs. 75 Evolvability. In designing algorithms for automating ground truth collection, Satyam needs to take several design dimensions into account: the requirements of the user, the constraints of the underlying AMT platform, variability in worker capabilities, and the complexity of assessing visual annotation quality. These algorithms are complex, and will evolve over time, and Satyam’s design must accommodate this evolution. Extensibility. ML for machine vision is rapidly evolving and future users will need new kinds of groundtruth data which Satyam must be able to accommodate. 5.1.2 Satyam Abstractions Satyam achieves ease of use and asynchronous operation by introducing different abstractions to represent logical units of work in groundtruth annotation (depicted in Figure 5.2). Satyam-job. Users specify their groundtruth collection requirements at a high-level as a Satyam-job, which has several parameters: the set of images/video clips, the kind of groundtruth desired (e.g., bounding rectangles for cars), payment rate ($/image or $/hour), the AMT requester account information to manage HITs on the user’s behalf, etc. At any instant, Satyam might be running multiple Satyam-jobs (Figure 5.3). Satyam-tasks. Satyam renders jobs to Satyam-tasks, which represent the smallest unit of groundtruth work sent to a worker. For example, a Satyam-task might consist of a single image in which a worker is asked to annotate all the bounding rectangles and their classes, or a short video clip in which a worker is asked to track one or more objects. A single Satyam-job might spawn hundreds to several tens of thousands of Satyam-tasks. HIT. A HIT is an AMT abstraction for the smallest unit of work for which a worker is paid. Satyam decouples HITs from Satyam-tasks, for two reasons. First, Satyam may batch multiple Satyam-tasks in a HIT. For example, it might show a worker 20 different images (each a different Satyam-task) and ask her to classify the images as a part of a single HIT. Batching increases the price per HIT thereby incentivizing workers more, and also increases worker throughput. Second, it allows a single Satyam-task to be associated with multiple HITs, one per worker: this permits Satyam to obtain groundtruth for the same image from multiple workers to ensure high quality results. 76 Figure 5.2: Satyam’s jobs, tasks and HITs Job Category Descrip �on Coverage Image Classi fica �on Select class name of displayed image 22.5% Video Classi fica �on Select class name of displayed video 5.3% Object Detec �on Draw/edit bounding boxes and select class labels for each object of interest in an image 25.9% Object Tracking Draw/edit bounding boxes and select class labels for each object of interest in an image 10.9% Object Segmenta �on Draw arbitrary polygons around various areas of interest in an image 33.1% Object Coun �ng Count the number of objects in an image or a video clip 0% OCR Recognize texts in an image 2.1% Figure 5.3: Satyam Job Categories Figure 5.4: Overview of the Satyam’s components 5.1.3 Satyam Architecture Satyam is architected as a collection of components (Figure 5.4) each implemented as a cloud function (e.g., an Azure function or an Amazon lambda) communicating through persistent storage. This design achieves several of Satyam’s goals. Each component can be scaled and evolved independently. Components can be triggered by users requesting new jobs or workers completing HITs and can thereby accommodate asynchronous operation. Finally, only some components need to be modified in order to extend Satyam to new types of ground truth collection. Satyam’s components can be grouped into three high-level functional units, as shown in Figure 5.4: Job Rendition, Quality Control and HIT Management. We describe these components, and their functional units, in the subsequent sections. 77 Job Rendition. This functional unit raises the level of abstraction groundtruth collection (Section 5.2). It is responsible for translating the user’s high level groundtruth collection requirements to AMT HITs and then compiling the AMT worker results into a presentable format for users. Users primarily interact with the Job-Submissions Portal [198] where they submit their groundtruth collection requirements. Submitted jobs are written to the Job-Table. Based on the job descriptions in the Job-Table, the Pre-processor may perform data manipulations such as splitting videos into smaller chunks. The Task Generator decomposes the Satyam-job into Satyam-tasks, one for each image/video chunk. The Task Portal is a web application that dynamically renders Satyam-tasks into web pages (based on their specifications) displayed to AMT workers. Finally, Groundtruth Compilation assembles the final results from the workers for the entire job and provides them to Satyam users as a JSON format file. Quality Control. AMT workers are typically untrained in groundtruth collection tasks and Satyam has little or no direct control over them. Further, some of the workers might even be bots intending to commit fraud [165]. The quality control components are responsible for ensuring that the groundtruth generated by Satyam is of high quality. In order to achieve this, Satyam sends the same task to multiple non-colluding workers and combines their results. The Result Aggregator identifies and fuses the “accurate parts” of workers’ results while rejecting the “inaccurate parts” using groundtruth fusion algorithms described in §5.3. For certain tasks, the Aggregator might determine that it requires more results to arrive at a conclusive high quality result. In that case, it presents the task to more workers until a high quality groundtruth is produced. The Results Evaluator compares fused results with the individual worker’s results to determine whether the worker performed acceptably or not and indicates this in the Result-Table. HIT-Management. These components directly interact with the Amazon AMT platform and manage HITs through their life-cycle (§5.4). The HIT Generator reads the task table and launches HITs in AMT and always ensures that there are no unfinished tasks with no HITs. It is also responsible for adaptive pricing – adaptively adjusting the HIT price by measuring the median time to task completion, and worker filtering – ensuring that under-performing workers are not recruited again. The HIT Payments component reads the results table and pays workers who have completed a task acceptably, while rejecting payments for 78 those who have not. The Task Purge component removes tasks from the task table that have already been aggregated, so that they are not launched as HITs again and the HIT Disposer removes any pending HITs for a completed job. 5.2 Job Rendition To achieve ease of use, Satyam needs to provide users with an expressive high-level specification framework for ground truth collection. Satyam leverages the observation that, in the past few years, the machine vision community has organized its efforts around a few well-defined categories of vision tasks: classification, detection or localization, tracking, segmentation, and so forth. Satyam’s job specification is based on the observation that different ground truth collection within the same vision task category (e.g., classification of vehicles vs. classification of animals) share significant commonality, while ground-truth collection for different vision task categories (e.g., classifying vehicles vs. tracking vehicles) are qualitatively different. Job Categories. Satyam defines a small number of job categories where each category has similar ground- truth collection requirements. Users can customize groundtruth collection by parameterizing a job category template. For example, to collect class label groundtruth for vehicles (e.g., car, truck, etc.), a user would select an image classification job template and specify the various vehicle class labels. Templatizing job categories also enables Satyam to automate all steps of ground-truth collection. The web UIs presented to AMT workers for different ground truth collection jobs in the same category (e.g., classification) are similar, so Satyam can automatically generate these from Web-UI templates. Moreover, quality control algorithms for ground truth collection in the same category are similar (modulo simple parametrization), so Satyam can also automate these. To determine which job categories to support, we examined the top 400 publicly available groundtruth datasets used by machine vision researchers [236] and categorized them with respect to the Web-UI requirements for obtaining the groundtruth (Figure 5.3). The coverage column indicates the fraction of datasets falling into each category. Satyam currently supports the first six categories in Figure 5.3, which 79 Figure 5.5: Image Classification Task Page Figure 5.6: Object Counting Task Page Figure 5.7: Object Detection and Localization Task Page Figure 5.8: Object Segmentation Task Page Figure 5.9: Multi-Object Tracking Task Page together account for the groundtruth requirements of more than 98.1% of popular datasets in machine vision. We now briefly describe a few of the most used currently available templates in Satyam. Image and Video Classification. The desired groundtruth in this category is the label (or labels), from among a list of provided class labels, that most appropriately describes the image/video. Class labels can describe objects in images such as cars or pedestrians and actions in video clips such as walking, running, and dancing. Satyam users customize (Figure 5.5) the corresponding job templates by providing the list of class labels and a link or description for them. To the workers, the web-UI displays the image/video clip with the appropriate instructions and a radio button list of class labels. Object Counting in Images and Videos. The desired groundtruth for this job category is a count of objects of a certain class, or of events in an image or video (e.g., the number of cars in a parking lot or the number of people entering a certain mall or airport). The user provides a description of the object/event. In the web-UI, the worker is shown an image/video clip (Figure 5.6), and the description provided of the object/event of interest, for which the worker is asked to provide a count. Object Detection in Images. The desired groundtruth in this category is a set of bounding boxes on an image marking parts of interest in the image, along with a class label that most appropriately describes 80 each box. Users specify (Figure 5.7) the object classes for which workers should draw bounding boxes describing areas within each image that need to be annotated. Workers see an image with a radio button list of object classes, using which the workers can select one class to draw/edit bounding boxes around all objects of the same class, e.g., all pedestrians in a traffic surveillance image, in one shot. For example, in a traffic surveillance scene, the objects of interest might be all the cars and pedestrians. The groundtruth required for such algorithms for each image, is the set of all bounding boxes enclosing the objects of interest and their respective category names. To support these cases we provide a template that generates a web-UI where workers are displayed an image and can draw/edit bounding boxes around objects of interest (using the mouse). A radio button list of the categories helps the workers categorize the object as well. Satyam users customize this template by specifying the categories of interest. The users may also specify a set of polygons describing the various areas of interest within the images. Object Segmentation in Images. The desired groundtruth in this category is pixel-level annotations of various objects/areas of interest (e.g., people, cars, the sky). This template is similar to the object detection template except that it lets workers annotate arbitrary shapes by drawing a set of polygons (Figure 5.8). Object Tracking in Videos. The desired groundtruth in this category, an extension of object detection to videos, requires bounding boxes for each distinct object/event of interest in successive frames of a video clip. This groundtruth can be used to train object trackers. Satyam users can select (Figure 5.9) the video tracking job category, and specify the object classes that need to be tracked, instructions to workers on how to track them, what frame rate the video should be annotated at, and polygons that delineate areas of interest within frames. Workers are presented (Figure 5.9) with a short video sequence, together with the categories of interest, and can annotate bounding boxes for each object on each frame of the video. For annotation, we have modified an existing open source video annotation tool [224] and integrated it into Satyam. Job Rendition Components. When a user wishes to initiate ground truth collection, she uses the Job Submission Portal to select a job category template, and fills in the parameters required for that template. Beyond the category specific parameters described above, users provide a cloud storage location containing the images or videos to be annotated, and indicate the price they are willing to pay. After the user submits 81 Figure 5.10: Amazon MTurk HITs Web Portal the job specification, the Portal generates a globally unique ID (GUID) for the job, and stores the job description in the Job-Table. Then, the following components perform job rendition. Pre-processor. After a job is submitted via the Job Submission portal, the images/video clips might need to be pre-processed. In our current implementation, Satyam supports preprocessing for video annotations. Specifically, large videos (greater than 3 second duration) are broken into smaller chunks (with a small overlap between successive chunks to facilitate reconstruction or stitching, see below) to diminish cognitive load on workers. They are then downsampled based on user’s requirements, and converted into a browser- friendly format (e.g., MP4). Task Generator. This component creates a Satyam-task for each image or video chunk. A Satyam-task encapsulates all the necessary information (image/video URI, user customizations, the associated Job-GUID, etc.) required to render a web-page for the image/video clip. The Satyam-task is stored as a JSON string in the Task-Table. The Task Table stores additional information regarding the task, such as the number of workers who have attempted it. Task Web-UI Portal. An AMT worker sees HITs listed by the title of the template and the price promised for completing the HIT (Figure 5.10). (At any given instant, Satyam can be running multiple Satyam-jobs for each supported template). When the worker accepts a HIT, she is directed to the Satyam 82 Task Web-UI Portal, which dynamically generates a web page containing one or more Satyam-tasks. For example, Figure 5.9 shows a Web-UI page for the tracking templates. The generated web page appears as an IFrame within the AMT website. When the worker submits the HIT, the results are entered into the Result-Table and AMT is notified of the HIT completion. When dynamically generating the web page, Satyam needs to determine which Satyam-tasks to present to the worker. Listing HITs only by task portal and by price allows delayed binding of a worker to Satyam- tasks. Satyam uses this flexibility to (a) achieve uniform progress on Satyam-tasks and (b) avoid issuing the same task to the same worker. When a worker picks a HIT for templateT and pricep, Satyam selects that Satyam-task with the sameT andp which has been worked upon the least (using a random choice to break ties). There is on exception to this least-worked-on approach. Satyam may need to selectively finish aggregating a few tasks to gather statistics for dynamic price adjustment (described later). In such instances, the least-worked-on mechanism and randomization is restricted to a smaller subgroup rather than the whole task pool, so that the subgroup completes quickly. To avoid issuing the same task to the same worker, Satyam can determine, from the task table, if the worker has already worked on this task (it may present the same task to multiple workers to improve result quality, §5.3). A single HIT may contain multiple Satyam-tasks, so Satyam repeats this procedure until enough tasks have been assigned to the HIT. Groundtruth Compilation. Once all the tasks corresponding to a job have been purged (§5.4), this component compiles all the aggregated results corresponding to this job into a JSON file, stores that file at a user-specified location and notifies the user. Before ground-truth compilation, Satyam may need to post-process the results. Specifically, for video-based job categories like tracking, Satyam must stitch video chunks together to get one seamless groundtruth for the video. We omit the details of the stitching algorithm, but it uses the same techniques as the groundtruth-fusion tracking algorithm (§5.3.2) to associate elements in one chunk with those in overlapped frames in the next chunk. 83 Figure 5.11: Quality Control Loop in Satyam Figure 5.12: Example groundtruth fusion in Multi- object Detection Figure 5.13: Groundtruth fusion in Multi-object Tracking is a 3-D ex- tension of Multi-object detection 5.3 Quality Control Satyam’s quality control relies on the wisdom of the crowds [212]: when a large enough number of non-colluding workers independently agree on an observation, it must be “close” to the groundtruth. To achieve this Satyam solicits groundtruth for the same image/video clip from multiple workers and only accepts elements of the groundtruth that have been corroborated by multiple workers. For instance, in a detection task with several bounding boxes, only those, for which at least 3 workers have drawn similar bounding boxes, are accepted. Figure 5.11 depicts Satyam’s quality control loop. One instance of the loop is applied to each Satyam- task. Satyam first sends the same task to n min workers to obtain their results. n min depends on the job category and is typically higher for more complex tasks (described in more detail below). In the groundtruth-fusion step, Satyam attempts to corroborate and fuse each groundtruth element (e.g., bounding box) using a job category specific groundtruth-fusion algorithm. If the fraction of corroborated elements in an image/video clip is less than the coverage threshold (η cov ), Satyam determines that more results need to be solicited and relaunches more HITs, one at a time. For some images/videos, even for humans, agreeing on groundtruth maybe difficult. For such tasks we place a maximum limitn max (20 in our current implementation) on the number workers we solicit groundtruth from. The task is marked “aggregated” and removed from the task list either if we reach the maximum limit or if the fraction of corroborated elements exceedsη cov . 84 5.3.1 Dominant Compact Cluster All the groundtruth-fusion algorithms in Satyam are based on finding the Dominant Compact Cluster (DCC) which represents the set of similar results that the largest number of workers agree on. If the number of elements in the dominant compact set is greater thann corr , the groundtruth for that element is deemed as corroborated. Definition. Suppose thatn workers have generatedn versions of the groundtruthE 1 ,E 2 ,...,E n for a particular element in the image/video (as in Figure 5.12 where each of the 4 workers has drawn bounding box around the orange car). For each job category, we define a distance metricD(E i ,E j ) that is higher the more dissimilarE i andE j are. A fusion functionF fusion (E 1 ,E 2 ,··· ,E k ) =E fused specifies how different versions of the groundtruths can be combined into one (e.g., by averaging multiple bounding boxes into one). All groundtruth-fusion algorithms start by clusteringE 1 ,E 2 ,··· ,E n based onD while guaranteeing that none of the elements of its cluster is farther thanτ distance from the fused element i.e.,D(E fused ,E k )<τ for allE k within a cluster.τ, the compactness constraint, ensures that the clusters do not have any results that are too dissimilar from each other. After the clustering, the cluster with the most number of elements is deemed the dominant compact cluster andE fused computed over this cluster is deemed the cluster head. Greedy Hierarchical Clustering to find DCC. Finding DCC is NP-Hard, so we use greedy hierarchical clustering. We start withn clusters, thei th cluster having the one elementE i . At each step, two clusters with the closest cluster heads are merged, provided that the merged cluster does not violate the compactness constraint. The clustering stops as soon as clusters cannot be merged any longer. Variations across different templates. While finding the DCC is common across all groundtruth-fusion algorithms, the specific values and functions such asn min , D(E i ,E j ), n corr , F fusion , η cov andτ are different for each fusion algorithm. In the rest of this section, we describe the various choices we use for these values and functions. 85 5.3.2 Fusion Details Image and Video Classification. For this template, Satyam uses a super-majority criterion, selecting that class for which the fraction of workers that agree on the class exceedsβ∈ (0, 1) (we choseβ = 0.7, §5.5.8). This is equivalent to the DCC algorithm with the distance functionD = 0 if two workers choose the same category and∞ if they do not,τ = 0, andn corr =βn, wheren is the number of results. Counting in Images/Videos. Givenn counts, byn workers, our goal is to robustly remove all the outliers and arrive at a reliable count. We use DCC for this, withD(C i ,C j ) =|C i −C j | whereC i andC j are the counts from thei th andj th workers.F fusion is chosen as the average of all the counts.τ =b(C)c, where C is the average count of the cluster, i.e., two counts are deemed to be similar only if their deviation is less than± fraction of the average count. We chose = 0.1 in our implementation.n min andn max are chosen to be 10 and 20 respectively (§5.5.8). Object Detection in Images. To provide intuition into the groundtruth fusion algorithm for this template we use the example in Figure 5.12, where four workers have drawn bounding boxes around cars in an image. Thej th bounding box drawn by thei th worker is represented byB ij . A worker may not draw bounding boxes for all cars (e.g.,W 2 andW 4 ), and two different workers may draw bounding boxes on the same image in a different order (e.g.,W 1 draws a box around the red car first, butW 3 does it last). Furthermore, workers may not draw bounding boxes consistently:W 3 ’s box around the orange car is off-center and box B 11 is not tightly drawn around the red car. Our fusion algorithm, designed to be robust to these variations. Bounding Box Association. Since different workers might draw boxes in a different order, we first find the correspondence between the boxes drawn by the different workers. In Figure 5.12, this corresponds to grouping the boxes in the three sets G1 ={B11,B22,B33}, G2 ={B12,B31,B41} and, G3 ={B13,B21,B32,B42} where each set has boxes belonging to the same car. We model this problem as a multipartite matching problem where each partition corresponds to the bounding boxes of a worker, and the goal is to match bounding boxes of each worker for the same car. 86 To determine the matching, we use a similarity metric, Intersection over Union (IoU), between two bounding boxes, which is the ratio of intersection of the two bounding boxes to their union. Since the matching problem is NP-Hard, we use an iterative greedy approach. For a total ofN bounding boxes, we start withN sets with one bounding box per set. At each iteration, we merge the two sets with the highest average similarity while ensuring that a set may have only one bounding box from a partition. The algorithm terminates when there are no more sets that can be merged. In the end, each set corresponds to all the boxes drawn by different workers for one distinct object in the image. Applying groundtruth-fusion on each object. Once we know the set of bounding boxes that correspond to each other, we can use DCC for fusion. Let bounding box Bi = x tl i ,y tl i ,x br i ,y br i where (x tl i ,y tl i ) and (x br i ,y br i ) are the top left and bottom right pixel coordinates respectively. We choose D(Bi,Bj) = max(|x tl i −x tl j |,|y tl i −y tl j |,|x br i −x br j |,|y br i −y br j |), ncorr = 3, τ = 15 (pixels), ncov = 0.9, nmin = 5, nmax = 20. The fusion-functionF fusion generates a fused bounding box as the average of each of top-left and bottom-right pixel coordinates of all the bounding boxes being fused. Thus, in order to be similar, none of the corners of the boundaries must deviate by more thanτ pixels along thex ory axis. The minimum number of workers to corroborate each box is 3 and 90% of the boxes need to corroborated before the quality control loop terminates. We arrived at these parameters through a sensitivity analysis (§5.5.8). Object Segmentation in Images. The fusion algorithm used for image segmentation is almost identical to that used for multi-object detection except that bounding boxes are replaced by segments: arbitrary collections of pixels. Thus, while associating segments instead of bounding boxes, the IoU metric is computed by considering individual pixels common to the two segments. ForF fusion , a pixel is included in the fused segment only if it was included in the annotations of at least 3 different workers. We use τ = 1/0.3,n corr = 3,n min = 10,n max = 20 andη cov = 0.9. Object Tracking in Videos. Fusion algorithm for multi-object tracking simply extends that used for multi-object detection to determine a fused bounding volume, a 3-D extension of bounding box (as shown in Figure 5.13). We extend the definition of IoU to a bounding volume by computing and summing intersections and unions over each frame, deemed 3D-IoU. ForF fusion , we average the bounding boxes 87 across users at each frame independently; this is because different workers may start and end the track at different frames. We useτ = 1/0.3,ncorr = 3,nmin = 5,nmax = 20 andηcov = 0.9. 5.3.3 Result Evaluation After all the results for a task have been fused, Satyam approves and pays or rejects each worker’s HIT (§5.4). For image and video classification, Satyam approves all HITs in which the worker’s selected class matches that of the aggregated result. When no class label achieves a super-majority (§5.3.2), it ranks all classes in descending order of the number of workers who selected them, then chooses the minimum number of categories such that the combined number of workers that selected them is a super-majority, and approves all their HITs. For counting, Satyam approves each worker whose counting error is within of the fused count (§5.3.2). For object detection, segmentation and tracking, Satyam approves each worker whose work has contributed to most of the objects in the image/video. Specifically, Satyam approves a worker if the bounding boxes generated by the worker were in more than half of the dominant compact clusters (§5.3.2) for objects in the image. 5.4 HIT management These components manage the interactions between Satyam and AMT such as launching HITs for the tasks, estimating and adapting the price of HITs to match user specifications, filtering under-performing workers, submitting results to the quality control component, and finally, making/rejecting payments for tasks that have completed. HIT Generator. This component creates HITs in AMT using the web-service API that AMT provides [164] and associates these HITs with an entry in the HIT-Table (which also contains pricing metadata, as well as job/task identification). It ensures that every unfinished task in the Satyam-task table has at least one HIT associated with it in AMT. It does this by comparing the number of unfinished tasks in the Task-Table for each GUID and price level against the number of unfinished HITs in the HIT-Table and determines the 88 deficit. Because a single HIT may comprise multiple tasks, Satyam computes the number of extra HITs needed to fill any deficit and launches them. To determine which HITs have been worked on, as soon as a worker submits a HIT, Satyam records this in the HIT-Table. HIT Price Adaptation. Several organizational and state laws require hourly minimum wage payments. Moreover, hourly wages are easier for users to specify. However, payments in AMT are disbursed at the granularity of a HIT. Thus, Satyam must be able to estimate the “reasonable” time taken to do a HIT and translate it to price per HIT based on the desired hourly rate. The time taken for a HIT can vary from a few seconds to several minutes and depends on three factors: (a) the type of the template (e.g., segmentation tasks take much longer than classification tasks); (b) even within the same template, more complex jobs can take longer e.g., scenes with more cars at a busy intersection; (c) finally, different workers work at different rates. To estimate HIT completion times, Satyam instruments the web-UIs provided to workers and measures the time taken by the worker on the HIT. As each job progresses, Satyam continuously estimates the median time to HIT completion per job (considering only approved HITs). It uses this value to adjust the price for each future HIT in this particular job. Using this, Satyam’s price per HIT converges to conform to hourly minimum wage payments (§5.1). Satyam HIT Payments. Once a task is aggregated, deserving workers must be paid. Satyam relies on the fusion algorithms to determine whether a result should be accept-ed or not (§5.3). A single HIT may include multiple Satyam-tasks; Satyam’s HIT Payments component computes the fraction of accepted results in a HIT across all of these tasks and pays the worker if this fraction is above a threshold. Worker Filtering. Worker performance can vary across templates (e.g., good at classification but not segmentation), and across jobs within a given template (e.g., good for less complex scenes but not for more complex ones). To minimize rejected payments, Satyam tracks worker performance and avoids recruiting them for tasks they might perform poorly at. To do this, as Satyam rejects payments to undeserving workers for a certain task, it tracks worker approval rates (using the AMT-supplied opaque workerID) for each job and does not serve HITs to workers that have low approval rates (lower than 50% in our implementation). 89 While serving HITs to workers with past high performance history allows Satyam to be efficient, Satyam must also explore and be able to discover new workers. Thus, Satyam allows workers with good approval rates to work on 80% of the HITs, reserving the rest for workers for whom it does not have any history. As shown in our evaluations 5.5.7, worker filtering results in much fewer overall rejections. Satyam Task Purge. This component, triggered whenever a result is aggregated, removes completed tasks from the Task Table so that they no longer show up in any future HITs. 5.5 Evaluation We have implemented all components (Figure 5.4) of Satyam on Azure. Our implementation is 13635 lines of C# code. Using this, we evaluate Satyam by comparing the fidelity of its groundtruth against public ML benchmark datasets. In these benchmarks, groundtruth was curated/generated by trained experts or by using specialized equipment in controlled settings. To demonstrate Satyam’s effectiveness in a real world deployment we generate a data set by extracting images from four video surveillance streams at major traffic intersections in two US cities. We evaluate Satyam along the following dimensions: (a) The quality of ground truth obtained by Satyam compared with that available in popular benchmark data sets; (b) The accuracy of deep neural networks trained using groundtruth obtained by Satyam compared with those trained using benchmark data sets; (c) The efficacy of fine-tuning in a deployed real-world vision-based system; (d) The cost and time to obtain groundtruth data using Satyam and; (e) The efficacy of our adaptive pricing and worker filtering algorithms (f) The sensitivity of groundtruth-fusion algorithms to parameters. 5.5.1 ML Benchmark Datasets Image Classification (ImageNet-10). We create this dataset by picking all the images corresponding to 10 classes commonly seen in video surveillance cameras from the ImageNet [69] dataset. Our dataset contains 12,482 images covering these classes: cat, dog, bicycle, lorry-truck, motorcycle, SUV , van, female person and male person. 90 Video Classification (JHMDB-10). For this data set we pick all the video clips from the JHMDB [123] data set corresponding to to 10 common human activities: clap, jump, pick, push, run, sit, stand, throw, walk and, wave (a total of 411 video clips). Counting in Images (CAPRK-1). We create this data set by selecting 164 drone images taken from one parking lot from CAPRK [112] (a total of 3,616 cars). Object Detection in Images (KITTI-Object). We create this data set by considering 3 out of 8 classes (cars, pedestrians and cyclists) in the KITTI [93] data set with 8000 images (a total of 20,174 objects.). The groundtruth in KITTI established using LiDAR mounted on the car. Object Segmentation in Images (PASCAL-VOC-Seg). PASCAL-VOC [80] is a standardized image dataset for object classification, detection, segmentation, action classification, and person layout. We create this data set by choosing 353 images from the PASCAL-VOC [80] data set that have segmenta- tion labels, including the groundtruth of both class- and instance-level segmentation, corresponding to a total of 841 objects of 20 different classes. Tracking in Videos (KITTI-Trac). For this dataset we chose all 21 video clips that were collected from a moving car from KITTI [93] (about 8000 frames), but evaluate tracks only for 2 classes – cars and pedestrians. During the pre-processing step, these 21 video clips were broken into 276 chunks of length 3 seconds each with a 0.5 second overlap between consecutive chunks. Traffic Surveillance Camera Stream Data (CAM). We extracted images at 1 frame/minute from the video streams of 4 live HD quality traffic surveillance cameras, over one week (7 days) between 7:00 am and 7:00 pm each day. These cameras are located at major intersections in two U.S cities. We label the dataset corresponding to each of the four cameras as CAM-1, CAM-2, CAM-3 and CAM-4 respectively. 91 Video Image Object Image Classi fica �on Classi fica �on Coun �ng Segmenta �on Dataset JHMDB-10 ImageNet-10 CARPK-1 KITTI-Obj PASCAL-VOC-Seg KITTI-Trac # Objects Annotated 411 12482 3616 20174 841 1845 Precision 99.29% 98.56% 96.92% 99.01% 94.77% 94.61% Recall 99.22% 99.16% N/A 97.13% 94.65% 95.86% Median Latency [hrs] 7.4 8.75 9.5 170.69 72.5 64.2 Latency/Object [sec] 64.82 2.5 9.46 30.46 310.34 125.27 Avg. # Paid Results / Task 2.94 5.38 6.89 8.61 8.26 11.74 Median Time/Task [sec] 8.7 5.67 32.57 46.86 172.5 557 Mean # Objects/Task 1 1 22.04 2.52 2.38 6.68 Median Time/Object [sec] 8.7 5.67 1.48 18.6 72.48 83.38 Person-Seconds/Object 25.58 30.5 10.18 160.11 598.68 978.92 Detec �on Tracking Figure 5.14: Satyam Accuracy, Latency, and Cost Satyam KITTI Satyam KITTI Figure 5.15: Example Results of Satyam Detection from KITTI Raw PASCAL Satyam Figure 5.16: Example Results of Satyam Segmentation from PAS- CAL Satyam KITTI Satyam KITTI Figure 5.17: Example Results of Satyam Tracking from KITTI 5.5.2 Quality of Satyam Groundtruth To demonstrate that Satyam groundtruth is comparable to that in the ML benchmarks, we launched a job in Satyam for each of the six benchmark data sets described in Figure 5.14. Figures 5.15, 5.16 and 5.17 show some examples of groundtruth obtained using Satyam for detection, segmentation and tracking templates respectively. For this comparison, we evaluate match-precision (the degree to which Satyam’s groundtruth matches that of the benchmark) and match-recall (the degree to which Satyam’s workers identify groundtruth elements in the benchmark). Figure 5.14 summarizes Satyam’s accuracy for the various templates relative to the benchmarks. Satyam has uniformly high match-precision (95-99%) and high match-recall (>95%) for the relevant benchmarks. We find that Satyam often deviates from the benchmark because there are fundamental limits achieving accuracy with respect to popular benchmark data sets, for two reasons. First, some of the benchmarks were annotated/curated by human experts and have a small fraction of errors or ambiguous annotations 92 Figure 5.18: Linguistic confusion between van and truck Bicycle Cat Dog Female Person Lorry Truck Male Person Motor cycle SUV Van Bicycle 99.92% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Cat 0.00% 99.86% 0.13% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Dog 0.00% 0.14% 99.87% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Female Person 0.00% 0.00% 0.00% 99.77% 0.00% 0.14% 0.00% 0.00% 0.00% Lorry Truck 0.00% 0.00% 0.00% 0.00% 90.37% 0.00% 0.00% 0.00% 0.00% Male Person 0.00% 0.00% 0.00% 0.23% 0.00% 99.86% 0.00% 0.00% 0.00% Motor cycle 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 99.93% 0.00% 0.00% SUV 0.00% 0.00% 0.00% 0.00% 0.14% 0.00% 0.00% 98.40% 0.94% Van 0.08% 0.00% 0.00% 0.00% 9.49% 0.00% 0.07% 1.60% 99.06% Figure 5.19: Confusion Matrix of Satyam Result on ImageNet-10 Clap Jump Pick Push Run Sit Stand Throw Walk Wave Clap 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Jump 0.00% 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Pick 0.00% 0.00% 97.50% 0.00% 0.00% 0.00% 0.00% 2.13% 0.00% 0.00% Push 0.00% 0.00% 0.00% 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Run 0.00% 0.00% 0.00% 0.00% 97.56% 0.00% 0.00% 0.00% 0.00% 0.00% Sit 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00% 0.00% 0.00% 0.00% Stand 0.00% 0.00% 2.50% 0.00% 0.00% 0.00% 100.00% 0.00% 0.00% 0.00% Throw 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 97.87% 0.00% 0.00% Walk 0.00% 0.00% 0.00% 0.00% 2.44% 0.00% 0.00% 0.00% 100.00% 0.00% Wave 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% Figure 5.20: Confusion Matrix of Satyam Result on JHMDB Figure 5.21: Example of counting error resulting from partially visi- ble cars themselves. Some of the ambiguity, especially in classification, arises from linguistic confusion between class labels (e.g., distinguishing between van and truck). Second, in others that were generated using specialized equipment (e.g., LiDAR), part of the generated groundtruth is not perceivable to human eye itself. In the rest of this section, we describe our methodology for each job category and elaborate on these fundamental limits. Image Classification. Satyam groundtruth for ImageNet-10 has a match-precision of 98.5% and a match- recall of 99.1%. The confusion matrix (Figure 5.19) for all the 10 categories in ImageNet-10, shows that the largest source of mismatch is from 10% of vans in ImageNet being classified as lorry-trucks by Satyam. We found that all of vehicles categorized as vans in ImageNet are in fact food or delivery trucks (e.g., Figure 5.18), indicating linguistic confusion on the part of workers. The only other significant off-diagonal entry in Figure 5.19 at 1.6% results from linguistic confusion between Vans and SUVs. Discounting these two sources of error, Satyam matches 99.9% of the groundtruth. Video Classification. Satyam’s groundtruth for this category set has a match-precision and match recall exceeding 99%. The confusion matrix (Figure 5.20) for the 10 categories in JHMDB-10 reveals only 3 93 mismatches compared to the benchmark groundtruth. We examined each case, and found that the errors resulted from class label confusion or from incorrectly labeled groundtruth in the benchmark: a person picking up his shoes was labeled as standing instead of picking; a person moving fast to catch a taxi was labeled as walking instead of running; and finally a person who was picking up garbage bags and throwing them into a garbage truck was labeled as picking up in JHMDB-10, while Satyam’s label was throwing. Discounting these cases, Satyam matches 100% with the groundtruth. Counting in Images. Satyam’s car counts deviate from the CAPRK-I benchmark’s count groundtruths by 3% (Figure 5.14), which corresponds to an error of 1 car in a parking lot with 30 cars. This arises because of cars that are only partially visible in the image (e.g., Figure 5.21), and workers were unsure whether to include these cars in the count or not. By inspecting the images we found that between 3 and 10% of the cars in each image were partially visible. Object Detection in Images. For quantifying the accuracy for this template, we adopt the methodology recommended by the KiTTI benchmark – two bounding boxes are said to match if their IoU is higher than a threshold. Satyam has a high match-precision of 99% and match-recall of 97% (Figure 5.14). The match-recall is expected to be lower than match-precision: the LiDAR mounted on KITTI’s data collection vehicle can sometimes detect objects that may not be visible to the human eye. Object Segmentation in Images. We use Average Precision (AP) [80] to quantify the accuracy. We use a range of IoUs (0.5-0.95 with steps of 0.05) to compute the average to avoid a bias towards a specific value. Satyam achieves an AP of 90.03%. We also provide a match-precision of 94.77% and a match-recall of 94.56% using an IoU of 0.5. The dominant cause of false positives is missing annotations in the ground-truth. Figure 5.22 shows examples of such missing annotations from PASCAL that Satyam’s users were able to produce. The primary cause of false negatives is that our experiments used a lower value ofη cov than appropriate for this task; we are rectifying this currently. Object Tracking in Videos. A track is a sequence of bounding boxes across multiple frames. Consequently, we use the same match criterion for this template as detection across all the video frames. As seen from (Figure 5.14), Satyam has a match-precision and match-recall of around 95%. To understand why, we 94 Figure 5.22: Example of missing segmentation labels from PASCAL. From left to right: raw image, PASCAL label, Satyam label. Satyam segments a small truck on the top left corner which was not present in the ground truth. Classification Detection Tracking 0 20 40 60 80 100 Accuracy [%] Baseline: Pretrained Fine-tuned: Public Labels Fine-tuned: Satyam Labels Figure 5.23: Training Perfor- mance of Satyam Classification Detection Tracking 0 20 40 60 80 100 Accuracy [%] Baseline: Pretrained Fine-tuned: Satyam Labels Figure 5.24: End to End Training using Satyam Labels 10 1 10 2 10 3 10 4 Seconds 0 0.2 0.4 0.6 0.8 1 CDF Image Classification Video Classification Counting Detection Segmentation Tracking Figure 5.25: CDF of Time Spent Per Task of All Job Categories explored worker performance at different positions in the chunk: we found that, as workers get to the end of a chunk, they tend not to start tracking new objects. Decreasing the chunk size and increasing the overlap among consecutive chunks would increase accuracy, at higher cost. 5.5.3 Re-training Models using Satyam A common use case for Satyam is fine-tuning ML models for improving their performance using data specific to a deployment. In this section, we validate this observation by showing that (Figure 5.23): a) re-training ML models using Satyam groundtruth outperforms off-the-shelf pre-trained models, and b) models retrained using Satyam groundtruth perform comparably with models retrained using benchmarks. When re-training and testing a model, either with Satyam or benchmark groundtruth, we use standard methodology to train on 80% of the data, and test on 20%. In all cases, we retrain the last layer using accepted methodology [72, 193]. 95 Image Classification. For this job category, we evaluate retraining a well-known state-of-the-art image classification neural network, Inception (V3) [213]. The original model was pre-trained on ImageNet-1000 [69] for 1000 different classes of objects. Using this model as-is on ImageNet-10 yields a classification accuracy (F1-score [98]) of about 60% (Figure 5.23). Retraining the models using the ImageNet-10 groundtruth increases their accuracy to 94.76%, while retraining on Satyam results in an accuracy of 95.46%. Object Detection in Images. For this category, we evaluate YOLO [191], pre-trained on the MS-COCO dataset. Our measure of accuracy is the mean average precision [81], a standard metric for object detection and localization that combines precision and recall by averaging precision over all recall values. The pre-trained YOLO model has high (80%) mean average precision, but retraining it using KITTI-Object increases this to 90.1%. Retraining YOLO using Satyam groundtruth matches KITTI-Object’s performance, with a mean average precision of 91.0% (Figure 5.23). Tracking in Videos. As of this writing, the highest ranked open-source tracker on the KITTI Tracking Benchmark leaderboard is MDP [49], so we evaluate this tracker (with YOLO as the underlying object detector) using the standard Multi-Object Tracking Accuracy (MOTA [48]) metric, which also combines precision and recall. MDP using YOLO-CoCo’s detections achieves a MOTA of 61.83% as depicted in Figure 5.23 but fine-tuning Yolo’s last layer using the labels from KITTI and Satyam improve MOTA to 78% and 77.77% respectively. Further investigation reveals that the improvement in MoTA from fine-tuning was primarily due to improvement in recall – while precision was already high (98%) before fine-tuning, recall was only 63.54%. After fine-tuning recall improved to 81.70% for KITTI and 83.24% for Satyam. 5.5.4 Satyam In Real-World Deployments In order to evaluate the impact of using Satyam in the real world, we extracted images at 1 frame/minute from the video streams of 4 live HD quality traffic surveillance cameras (labeled CAM-1 to CAM-4), over one week (7 days) between 7:00 am and 7:00 pm each day. These cameras are located at major intersections in two U.S cities. 96 Cam 1 Cam 2 Cam 3 Cam 4 0 20 40 60 80 100 Average Precision [%] YOLO-COCO YOLO-Finetuned Figure 5.26: Improvement of performance with fine-tuned YOLO. 0 20 40 60 80 Seconds 0 100 200 300 400 Counting: CARPK Counting: KITTI Figure 5.27: Histogram of Time Per Counting Task over Different Datasets We now show that using Satyam groundtruth to fine-tune ML models can result in improved classifica- tion, detection, and tracking performance. For this, we use the CAM dataset, which has surveillance camera images from four intersections, to obtain ground-truth with Satyam, then re-trained YOLO-CoCo [72, 193] with 80% of the ground-truth and tested on the remaining 20%. Satyam re-training can improve YOLO-CoCo performance uniformly across the four surveillance cameras (Figure 5.26). The average precision improves from 52-61% for the pre-trained models to 73-88% for the fine-tuned models – an improvement of 20-28%. This validates our assertion that camera fine-tuning will be essential for practical deployments, motivating the need for a system like Satyam. Figure 5.24 demonstrates that these benefits carry over to other job categories as well and shows fine-tuning Inception v3 for classification for one of the cameras, CAM-3. To compute this result, we used our groundtruth data from CAM-3 for the detection task, where workers also labeled objects, then trained Inception to focus on one object type, namely cars. While the pre-trained Inception model works poorly on CAM-3, fine-tuning the model results in an almost perfect classifier. Similarly, fine-tuning also results in an almost 40% improvement in the MOTA metric for the tracker. 5.5.5 Time-to-Completion and Cost Figure 5.14 also shows the median time to complete an entire job, which ranges from 7 hours to 7 days. From this, we can derive the median latency per object, which ranges from 2.5 seconds/image for image classification to 125 seconds/object for tracking. That figure also shows the cost of annotating an object 97 0 200 400 600 800 Result Index 0 0.05 0.1 0.15 0.2 Price($/Task) Price: CARPK Target Price: CARPK Price: KITTI Target Price: KITTI Figure 5.28: Adaptive Pricing on Counting Task Web-UI Template Original A �er Filtering Image Classi fica �on 89.0% 90.3% Video Classi fica �on 88.7% 91.4% Coun �ng in Images 92.4% 94.0% Detec �on in Images 75.0% 82.9% Segmenta �on in Images 65.8% 81.4% Tracking in Video 68.1% 86.6% Figure 5.29: Approval Rates for various Satyam templates in person-seconds/object: the actual dollar figure paid is proportional to this (§5.5.6). By this metric (Figure 5.14), image and video classification, and counting cost few tens of person-seconds/object while detection, segmentation, and tracking require 160, 599, and 978 person-seconds respectively. 5.5.6 Price Adaptation Figure 5.25 is a CDF of the times taken by workers for all the various job categories in our evaluation. It clearly shows that the time taken to complete a task can vary by 3 orders of magnitude across our job categories. Figure 5.27 depicts the pdf of the times taken for the same category – counting task – but for two different data sets i.e., CARPK-1 and KITTI-Obj. KITTI-Obj has around 10 vehicles on average, and CARPK around 45 in each image, and the distribution of worker task completion times varies significantly across these datasets. (As an aside, both these figures have a long tail: we have seen several cases where workers start tasks but sometimes finish it hours later). These differences motivate price adaptation. To demonstrate price adaptation in Satyam, we show the temporal evolution of price per HIT for CARPK-1 and KITTI-Obj in Figure 5.28. HIT price for KITTI-Obj converges within 200 results to the ideal target value (corresponding to median task completion time). CARPK-1 convergence is slightly slower due to its larger variability in task completion times (Figure 5.27). 98 90 95 100 Precision Majority = 0.5 Majority = 0.7 Majority = 0.9 2 3 4 5 6 Minimum Results to Aggregate 0 5 10 Avg. Paid/Task 0 5 10 Median Latency (hr) Figure 5.30: Accuracy, Latency and Cost: Image Classification Figure 5.31: Accuracy, Latency and Cost: Detection Figure 5.32: Accuracy, Latency and Cost: Tracking 5.5.7 Worker Filtering To evaluate the efficacy of worker filtering ran Satyam with and without worker filtering turned on for each of the templates. Figure 5.29 shows that for classification, counting and detection the approval rate is already quite high ranging close to 90% and thus worker filtering brings about a modest increase in approval rates. For more involved tasks such as tracking and segmentation, the approval rates show a dramatic increase from 60% to over 80%. 5.5.8 Parameter Sensitivity Satyam’s groundtruth fusion and result evaluation algorithms have several parameters and Figure 5.14 presents results for the best parameter choice. We have analyzed the entire space of parameters to determine the parameters that Satyam performance most crucially depends upon in terms of accuracy, latency and cost. Image classification is sensitive only to two parameters:n min , the minimum number of results before aggregation can commence, and β, the fraction determining the super-majority. The upper graph in Figure 5.30 shows how classification accuracy varies as a function of these two parameters. Because the cost and latency of groundtruth collection varies with parameters, the lower graph shows the cost (blue bars) and the latency (red bars) for each parameter choice. From this, we can see that whenn min ≥ 3 andβ≥ 0.7 the accuracy does not improve significantly, however, cost and latency increase. This indicates thatn min = 3 andβ = 0.7 are good parameter choices, with high accuracy, while having moderate cost and latency. 99 We have conducted similar analyses for video classification, counting, object detection (Figure 5.31), segmentation, and tracking (Figure 5.32). The key conclusions are: (a) All job categories are sensitive to n min , the minimum number of results before Satyam attempts to aggregate results; (b) Each category is sensitive to one other parameter. For classification, this is theβ parameter that determines the super-majority criterion. For counting, it is the error tolerance. For detection and tracking, it isη cov , the fraction of corroborated groundtruth elements; and (c) In each case, there exists a parameter settings at which provides good groundtruth performance at moderate cost and latency. 5.6 Related Work Image recognition using crowdtasking. ImageNet training data for classification was generated using AMT, and uses majority voting for consensus [69]. Prior work [211] has also shown crowdtasking to be successful for detection: unlike Satyam, in this work, quality control is achieved by using workers to rate other workers, and majority voting picks the best bounding box. These use one-off systems to automate HIT management and consensus, but do not consider payment management. Satyam achieves comparable performance to these systems but supports more vision tasks. Third party commercial crowdtasking systems exist to collect groundtruth for machine vision [85, 208]. Other approaches have developed one-off systems built on top of AMT for more complex vision tasks, including feature generation for sub-class labeling [70], and sentence-level textual descriptions [143, 173]: more generally, future machine vision systems will need annotated groundtruth for other complex annotations including scene characterization, activity recognition, visual story-telling [140] and we have left it to future work to extend Satyam to support these. Crowdtasking cost, quality, and latency. Prior work has extensively used multiple worker annotations and majority voting to improve quality [69, 211]. For binary classification tasks in a one-shot setting, lower cost solutions exist to achieve high quality [132] or low latency [142]. For top-k classification (e.g., finding thek least blurred images in a set) several algorithms can be used for improving crowdtasking consensus [242]. Other work has explored this cost-quality tradeoff [91] in different crowd-tasking settings: de-aliasing entity descriptions [226, 134], or determining answers to a set of questions [135]. Satyam 100 devises novel automated consensus algorithms for image recognition tasks based on the degree of pixel overlap between answers. Crowdtasking platforms. Many marketplaces put workers in touch with requesters for freelance work [104, 86, 36], for coders [216], for software testing [163, 223], or for generic problem solving [120]. Satyam adds automation on top of an existing generic marketplace, AMT. Other systems add similar kinds of automation, but for different purposes. Turkit [152] and Medusa [189] provide an imperative high-level programming language for human-in-the-loop computations and sensing respectively. Collaborative crowdsourcing [119] automates the decomposition of more complex tasks into simpler ones, and manages their execution. 101 Chapter 6 Hybrid Human-Machine Labeling to Reduce Annotation Cost Groundtruth data is crucial for training and testing ML models. Generating accurate ground-truth was cumbersome until the emergence, in recent years, of commercial cloud-based human annotation services [3, 11, 8]. Users of these services submit their datasets and receive, in return, annotations on each data item in the dataset. Because these services typically employ humans to generate groundtruth, annotation costs can be prohibitively high, especially for large data sets. In this chapter, we explore using a hybrid human- machine approach to reduce annotation costs (in $) where only a subset of the data items are annotated by humans and the rest by a machine learning model trained on this annotated data. 1 Given that the accuracy of a model trained on a subset of the data set will typically be inferior to that of human annotators, using such a model will impact annotation quality. Moreover, as we show in evaluation (§6.4), for data sets that are hard to classify, the cost of training (in $) a model to a high accuracy might itself be counter productive; it may be more cost effective to train a model to a lower accuracy and use it to label a small subset of data items while labeling the rest of the data items using human labeling. Consequently, we ask the question: How can we design hybrid human-machine annotation scheme that minimizes the overall cost of annotation (including the cost of training the model) while providing a guarantee that the overall error rate is lower than a pre-specified acceptable value (e.g., 5%) compared to human annotations? 1 Amazon Sagemaker reportedly uses human-machine labeling for groundtruth, but details of their approach are not available. 102 As a model is trained over more annotated data, the corresponding gains per unit training cost diminish, making it less cost effective to train further. Thus, a key question to answer is, “when is it optimal to stop training”? This training cost versus accuracy tradeoff differs significantly from data set to data set, as well as the specifics of the classifier (DNN architecture) being used to train, which is, typically, not known apriori. While some data sets may be “hard” to train, others are “easier”. Similarly, while complex DNN models may provide a high accuracy, their training costs may be too high and potentially offset the gains obtained through machine generated annotations. Thus, a key challenge is to model this tradeoff, and dynamically (on the fly) refine this model, to predict and make the optimal choices for human-machine labeling. Since the training accuracy versus cost tradeoff depends on the particular subset of the data items the classifier is trained on, our scheme must carefully select and obtain human annotations for the most “effective” subset of the data items that can be used to train the classifier. Further, while the trained classifier may not be accurate over the entire data set, it many still be able to accurately classify a subset of the data. Thus, another challenge is to determine which of the remaining data items can be “safely” labeled using the classifier without violating the quality constraints. In this chapter we propose a novel technique, MCAL (Minimum Cost Active Labeling), that addresses these challenges and is able to minimize annotation cost across data sets with widely different characteristics and different human annotation services that offer widely different pricing. Further, given the choice of selecting from set of predefined DNN architectures (and pre-specified hyper-parameters), MCAL is also able to dynamically select the most cost-effective architecture for annotating the data. MCAL adopts active learning [202] to identify the data items that are potentially most effective for training a given classifier. Thus, as with active learning, MCAL is iterative, and in each iteration it trains the classifier using human annotated data obtained thus far. It then uses inferences on the rest of the un-annotated data to identify a subset of samples that are likely to be most effective in training the classifier in the subsequent iteration. From this point, MCAL differs from standard active learning: unlike active learning that aims to train a specific classifier to high accuracy with minimum annotations possible, MCAL aims to minimize overall cost of annotating an entire data set. 103 While performing active learning, MCAL also simultaneously measures training costs and accuracy to constructs and refines models that predict the cost-accuracy tradeoff (§6.2). MCAL learns a set of models to predict the accuracy as a function of the size of training data for various architectures. It then uses these models to predict the error rate for different numbers of machine generated labels and hence the savings due to machine labeling. In order to predict the AL training cost, MCAL models it as a quadratic function of the training data size and estimates its parameters. Both these predictive functions are then used to model the overall cost as a function of training data size. This allows MCAL to optimize for and determine various choices such as the right time to stop training as well as choosing the right DNN architecture. In order to identify the data items that can be safely machine labeled, MCAL uses the constructed models and active learning principles to select the largest subset that will not violate quality constraints. MCAL jointly estimates these two quantities to minimize overall cost while satisfying the error constraint. To this end, MCAL makes the following contributions: • It casts the minimum-cost labeling problem in an optimization framework (§6.1) that precisely describes how to jointly select which samples to human label and which to machine label. • This framing requires two models, one for cost, and the other for the relationship between each combination of the number of samples to human label and those to machine label, and the overall error rate (§6.2). For the former, MCAL assumes that total training cost at each step is proportional to training set size (and derives the cost model parameters using profiling on real hardware). For the latter, MCAL leverages the growing body of literature suggesting that a truncated power-law governs the relationship between model error and training set size [84, 59, 111, 127, 197]. • The MCAL algorithm (§6.3) iteratively refines the power-law parameters at each step, then does a fast search for the combination of human-labeled and machine-labeled samples that minimizes the total cost. When the user supplies multiple candidate architectures for the classifier, the algorithm trains each classifier up to the point where it is able to confidently predict which architecture can achieve the lowest overall cost. 104 Figure 6.1: Differences between MCAL and Active Learning. Active learning outputs an ML model using few samples from the data set. MCAL completely annotates and outputs the dataset. MCAL must also use the ML model to annotate samples reliably (red arrow) unlike active learning schemes. Evaluations (§6.4) of MCAL on various popular benchmark datasets of different levels of difficulty (Fashion-MNIST [233], CIFAR-10 [145] and CIFAR-100 [145]) show that it achieves lower than the lowest-cost labeling achieved by an active learning strategy that trains a classifier to label the dataset. It selects a strategy that matches the complexity of the data set and the classification task. For example, it labels the Fashion dataset, the easiest to classify, using a trained classifier. At the other end, it chooses to label CIFAR-100 using humans almost completely; for this data set, it estimates training costs to be prohibitive. Finally, it labels a little over half of CIFAR-10 using a classifier. MCAL is up to 6× cheaper for some data sets compared to human labeling all images. It is able to achieve these savings, in part, by carefully determining active learning batch sizes while accounting for training costs; cost savings due to active learning range from 20-32% for Fashion and CIFAR-10. MCAL borrows from active learning, but is different in several respects (as illustrated in Figure 6.1). Active learning iteratively minimizes labeling costs by selecting the most informative samples to label at each step of its iteration; MCAL uses the same metrics to decide which samples to select. However, the two approaches differ in their goals; active learning seeks to train a classifier with a given target accuracy, while MCAL attempts to label a complete dataset within a given error bound. Active learning does not consider training costs, as MCAL does. For this reason, MCAL jointly selects the best combination, at each step, of 105 number of samples to human and machine label. Finally MCAL must reduce cost by using the ML model to annotate. 6.1 Problem Formulation In this section, we formalize the intuitions presented above. LetX be the dataset to be labeled with a target error rate bound ε. Suppose that a classifierD(B) is trained using human generated labels for B ⊂ X. Let the error rate ofD(B) over the remaining unlabeled data usingD(B) be (X B). If D(B) were used to generate labels for this remaining data, the overall groundtruth error rate forX would be, (X) = (1−|B|/|X|)(X B) 2 . If (X)≥ ε, then, this would violate the maximum error rate requirement. However,D(B) might still be able to generate accurate labels for a carefully chosen subsetS(D,B)⊂ X B (e.g., comprising only those thatD(B) is very confident about). After generating labels forS(D,B) usingD(B), labels for the remainingX B\S(D,B) can be once again generated by humans. The overall error rate of the generated groundtruth then would be,{(|S(D,B)|)/(|X|)}(S(D,B)). (S(D,B)) is the error rate of generating labels overS(D,B) usingD(B) and is, in general, higher for larger|S(D,B)|. LetS ? (D,B) be the largest possibleS(D,B) that ensures that the overall error rate is less thanε. Then, overall cost of generating labels in this manner is: C = (|X S ? (D,B)|)·C h +C t (D(B)) (6.1) where,C h is the cost of human labeling for a single data item andC t (D(B)) is the total cost of generating D(B) including the cost of findingB and trainingD(B) but not including the human labeling cost|B|C h . 2 To evaluate machine labeling accuracy compared to human beings, assuming human labeling has 0% error forB. 106 The key contribution in this chapter is min-cost active labeling (MCAL), an approach that minimizesC as follows 3 : C ? = argmin S ? (D,B),B C s.t. (|S ? (D,B)|)/(|X|)(S ? (D,B))<ε MCAL iteratively generatesB. In each iteration, it ranks dataX\B using a functionM(.) that measures their “informativeness”, based onD(B)’s classification uncertainty (e.g., entropy [65] or margin [199, 124, 202]). MCAL then obtains human labels for theδ (batch size) most informative ones, adds them toB, and (repeatedly) re-trainsD usingB. Subsequent sections describe MCAL in detail. 6.2 Cost Prediction Models MCAL must determine the optimal value ofB andS ? (D,B) that minimizesC. In order to make optimal choices for various parameters, MCAL must be able to predictC as a function of the choices.C in turn depends on|S ? (D,B)| andC t (D(B)) (Equation 6.1). Thus, MCAL actually constructs two predictors, one each for|S ? (D,B)| andC t (D(B)). 6.2.1 Predicting|S ? (D,B)| as a function of|B| After training the classifierD on the human labeled dataB generated thus far, MCAL sorts each unlabeled data item inX B usingM(.), a measure of the item’s informativeness. (An example ofM(.) is the margin metric [199, 124, 202] which is the difference in classifier scores between the highest ranked and second highest ranked class labels). Let the setS θ (D(B)) contain theθ·|X B| least informative data items (θ∈ (0, 1)); these are the items thatD is most confident about. Thus,S ? (D,B) corresponds to anS θ (D(B)) for a maximal valueθ ? that does not violate the overall groundtruth accuracy constraint (|S ? (D,B)|)/(|X|)(S(D,B))<ε. MCAL constructs a predictor for (S θ (D(B))) and uses it to predict |S ? (D,B)| by searching forθ ? . 3 This formulation assumes an error constraint; however, MCAL can be generalized to other constraints such a fixed budget, see §6.3 107 |B| 2000 4000 8000 16000 Error ǫ(S θ (D(B))) 10 -4 10 -3 10 -2 10 -1 1 Truncated Power Law Power Law ǫ(S 5% (D(B))) ǫ(S 50% (D(B))) ǫ(S 20% (D(B))) Figure 6.2: Relationship between training set size and generaliza- tion error ((S θ (D(B)))), fit to a power law and a truncated power law, for CIFAR-10 using RESNET18 for variousθ. θ=5% θ=20% θ=50% Error ǫ(S θ (D(B)))[%] 0 0.2 0.4 0.6 0.8 1 1.2 δ=1%|X| δ=2%|X| δ=4%|X| δ=8%|X| Figure 6.3: Dependence of (S θ (D(B))) on δ is ”small” es- pecially towards the end of active learning. Here, (|B|| = 16, 000) for CIFAR-10 using RESNET18. |B| 2000 4000 8000 16000 Error ǫ(S θ (D(B))) 10 -4 10 -3 10 -2 10 -1 1 ǫ(s 50% (D(B))) 3 points 5 points 7 points 9 points 15 points Figure 6.4: Error prediction im- proves with with increasing num- ber of error estimates for CIFAR- 10 using RESNET18. To predict (S θ (D(B))), we leverage recent empirical work (§6.5) that observes that, for many tasks and many models, the generalization error vs. training set size is well-modeled by a power-law [111, 127, 197, 84, 59] of the form (S θ (D(B))) = α θ |B| −γ θ . However, it is well-known that most power- laws experience a fall-off [53] at high values of the independent variable. To model this, we use an upper-truncated power-law [53]: (S θ (D(B))) =α θ |B| −γ θ e − |B| k θ (6.2) whereα θ ,γ θ ,k θ are the power-law parameters. This model better represents the generalization vs. training error relationship than a power-law. In Figure 6.2, we use active learning on the CIFAR-10 [145] data set, with RESNET18 [107]. We fit both a power-law and a truncated power-law. As Figure 6.2 shows, the truncated power-law is able to better predict the generalization error at larger values of|B|. We validate this observation on CIFAR-100 for three different models in §6.4.5. (S θ (D(B))) is expected to increase monotonically withθ as increasingθ has the effect of adding data thatD is progressively less confident about. Lacking a parametric model for this dependence, to findθ ? , we generate power-law models (S θ (D(B))) for various discrete values ofθ∈ (0, 1) as described in §6.3. θ ? for a givenB is then predicted by searching across the predicted (S θ (D(B))) corresponding to the discrete values ofθ. 108 6.2.2 Modeling Active Learning Training Costs Active learning [202] iteratively obtains human labels forδ most informative items ranked usingM(.) and adds them toB. It then retrains the classifierD using the entire setB. A smallerδ typically makes the active learning more effective: it allows for achieving a lower error for a potentially smallerB) through more frequent sampling, but also significantly increases the training cost due to frequent re-training ofD. Choosing an appropriateδ is thus an important aspect of minimizing overall cost. The training cost (in $) depends on the training time, which in turn is proportional to the data size (|B|) and the number of epochs used to train the model (each epoch running over the entireB). A common strategy in active learning approaches is to use a fixed number of epochs per iteration, so the training cost in each iteration is proportional to the|B|. Since in each iterationδ new data samples are added toB, the total training cost accumulated over all the previous and current iterations is: C t (D(B)) =k|B| |B| δ + 1 (6.3) wherek is the iteration index. In §6.4, we validate this quadratic dependence of training cost on|B| for differentδ. MCAL does not depend on the specific form of the training cost, so can accommodate other cost models (e.g., if the number of epochs is proportional to|B| in which caseC t (D(B)) can have a cubic dependency on|B|). While (S θ (D(B))) also depends onδ in theory, in practice this dependence is insignificant relative to C t (D(B)). To illustrate this, Figure 6.3 depicts values of (S θ (D(B))) for variousθ for CIFAR-10 using RESNET18, across a range of values ofδ. This variation is less than 1% especially for smaller values ofθ. 6.3 The MCAL Algorithm MCAL is described in Algorithm 2. It takes as input an active learning metricM(.), the data setX, the specific classifierD (e.g., RESNET18) and the parametric models for training cost (e.g., Equation 6.3) and for error rate as a function of training size (e.g., the truncated power law in Equation 6.2). The 109 Algorithm 2 MCAL: Minimum-Cost Active Labeling Input: An active learning metricM(.), a classifierD, set of unlabeled images (X), a parametric model to predictC t (D(B)) and a parametric model to predict (S θ (D(B))) 1: Obtain human generated labels for a randomly sampled test setT⊂X, and letX =X T . 2: Obtain human generated labels for a randomly sampled data itemsB 0 ⊂X,|B 0 | =δ 0 . 3: TrainD(B 0 ) and test the classifier overT 4: Record training costC t (D(B 0 )) 5: forθ∈{θ min ,...,θ max } do 6: Estimate T S θ (D(B 0 )) usingT andM(.) 7: end for 8: Initialization:C ? new = 0,C ? old = 0,δ =δ 0 ,i = 1,B opt =B 0 9: whileC ? <C(B opt +δ) do 10: Obtain human generated labels forb i ⊂X B i−1 comprising|b i | =δ most informative samples ranked usingM(.) 11: B i =B i−1 ∪b i 12: TrainD(B i ) and test the classifier overT 13: RecordC t (D(B i )) and estimateC t (D(B)) usingh|B k |,C t (D(B k ))i,∀k 14: forθ∈ [θ min ,··· ,θ max ] do 15: Estimate T S θ (D(B i )) usingT andM(.) 16: Estimate and update the error model parameters (α θ ,γ θ ,k θ from Equation 6.2) for(S θ (D(B))) usingh|B k |, T S θ (D(B k )) i,∀k 17: end for 18: FindC ? new =C ? ,B opt as described in Section 6.2 19: if (|C ? new −C ? old |)/|C ? new |< Δ then 20: argmin N δ opt = (|B opt |−|B i |)/N,s.t.C<C ? (1 +β) 21: δ =δ opt 22: end if 23: C ? old =C ? new 24: i =i + 1 25: end while 26: UseD(B opt ) andM(.) to findS ? (D,B opt ) 27: Annotate the residualX\B\S ? (D,B opt ) algorithm operates in two phases. In the first phase, it uses estimates obtained during active learning to learn the parameters of the truncated power-law model for variousθ and the cost measurements to learn the parameters for training cost model. In the second phase, having the models, it can estimate and refine S ? (D,B) andB that produce the optimal costC ? . It can also estimate the optimal batch sizeδ opt for this cost. It terminates when adding more samples toB is counter productive. It trains a classifier to label S ? (D,B) and use human labels for the remaining unlabeled samples. The first four steps perform initialization. The first step (Line 1) randomly selects a test setT (|T| = 5% of|X|) and obtains human labels to test and measure the performance ofD. Line 2 initializesB =B 0 by randomly selectingδ 0 (1% ofX in our implementation) samples fromX and obtaining human labels 110 for these. Line 3 trainsD usingB 0 and usesT to estimate the generalization errors T S θ (D(B 0 )) for various values ofθ∈ (0, 1) (we chose in increments of 0.05{0.05, 0.1,··· , 1}), usingT andM(.). After these initial steps, the main loop of min-cost active labeling begins (Line 9). In each step, as with standard active learning, MCAL selectsδ most informative samplesM(.), obtains their labels and adds them toB (Line 11), then trainsD on them (Line 12). The primary difference with active learning is that MCAL, in every iteration, estimates the model parameters for(S θ (D(B))) andS θ (D(B)) (Line 13, 16), then uses these to estimateC ? andB opt (Line 18). At the end of this step, MCAL can answer the question: “How many human generated labels must be obtained intoB to trainD, in order to minimizeC?” (§6.2). The estimated model parameters forC t (D(B)) and (S θ (D(B))) may not be stable in the first few iterations (in our experience, 3 when using the truncated power law) given limited data for the fit. To determine if the model parameters are stable, MCAL compares the estimatedC ? (in dollars) obtained from the previous iteration to the current. If the difference is small (≤ 5%, in our implementation), the model is considered to be stable for use (Line 19). After the predictive models have stabilized, we can rely on the estimates ofB opt , the final number of labels to be obtained intoB, and consequently the remaining number of samples neededB opt B i . At this point MCAL adjustsδ (Line 21) to reduce the training cost when it is possible to do so. MCAL can do this because it targets relatively high accuracy forD(B). For these high targets, it is important to continue to improve model parameter estimates (e.g., the parameters for the truncated power law), and active learning can help achieve this. Figure 6.4 shows how the fit to the truncated power law improves as more points are added. Finally, unlike active learning, MCAL adaptsδ to achieve lower training cost. Figure 6.3 shows that, for most values ofθ, the choice ofδ does not affect classifier accuracy significantly. While, the choice of active learning batch size does not affect the final classifier accuracy, but it can significantly impact training cost (§6.2). This loop terminates when total cost obtained in a step is higher than that obtained in the previous step. At this point, MCAL simply trains the classifier using the last value ofB opt , then human labels any remaining unlabeled samples (Lines 26, 27). 111 Extending MCAL to selecting the cheapest DNN architecture. In what we have described so far, we have assumed that min-cost labeling is given a candidate DNN architecture forD. However, it is trivial to extend MCAL to the case when the data set curator supplies a small numberm (typically 2-4) of candidate classifier architectures{D 1 ,D 2 ,···}. In this case, MCAL can generate separate prediction models for each of the classifiers and pick the one that minimizesC once the model parameters have stabilized. This does not inflate the cost significantly since the training costs until this time are over small sizes ofB and not significant. Accommodating a budget constraint. Instead of a constraint of labeling error, MCAL can be modified to accommodate other constraints, such as a limited total budget. Its algorithm can search for the lowest error satisfying a given budget constraint. Specifically, with the same model for estimation of network error and total cost (§6.2), instead of searching theS ? (D,B) for minimum total cost while error constraint is satisfied (line 18 of Algorithm 2), we can search for minimum estimated error while the total cost is within budget. The difference is: in the former case, we can always resort to human labeling when error constraint cannot be satisfied; in the latter case, one can only sacrifice the total accuracy by stopping the training process and taking the model’s output when the money budget is too low. 6.4 Evaluation In this section, we evaluate the performance of MCAL over three popular classification data sets: Fashion- MNIST [233], CIFAR-10 [145] and CIFAR-100 [145]. We chose these three data sets to demonstrate that MCAL can work effectively across different difficulty levels, Fashion-MNIST being the “easiest” and CIFAR-100 the “hardest” among the three. We use three popular DNN architectures RESNET50, RESNET18 [107], and CNN18 (RESNET18 without the skip connections). These architectures span the range of architectural complexity with differing training costs and achievable accuracy. CNN18 has a very low training cost but yields lower accuracy while RESNET50 has a very high training cost and typically yields the highest accuracy. This allows us to demonstrate how MCAL can effectively select the most cost efficient architecture among available choices. We also use two different human labeling services: Amazon 112 labeling services [3] at 0.04 USD/image and Satyam [183] at 0.003 USD/image. Satyam labels images 10× cheaper by leveraging untrained inexpensive workers. This allows us to demonstrate how MCAL adapts to changing human labeling costs. Finally, model fitting and inference costs in our evaluation are negligible compared to training and human labeling costs. MCAL uses a popular active learning metric (margin [199, 124, 202]) to rank and select samples for all our results in this section. At each active learning iteration, it trains the model over 200 epochs with a 10× learning rate reduction at 80, 120, 160, 180 epochs, and a mini-batch-size of 256 samples [13]. We have left it to future work to incorporate hyper-parameter search into the optimization. DNN CNN-18 4 RESNET18 RESNET50 Cost 0.00007 0.0003 0.0009 Table 6.1: Training Costs (USD/Image) for various DNNs Table 6.1 depicts the training cost per image for three different DNNs trained on CIFAR-10. We use a virtual machine with 4 NVIDIA K80 GPUs at 3.6 USD/hr and maintain over 90% utilization on each GPU during the training process. In all experiments, unless otherwise specified, the overall labeling accuracy requirementε was set at 5%. 6.4.1 Reduction in Labeling Costs using MCAL MCAL automatically makes three key decisions to minimize overall labeling cost. It a) selects the subset of images that the classifier should be trained on (|B| opt ), b) adaptsδ across active learning iterations to keep training costs in check, and c) selects the best DNN architecture from among a set of candidates (CNN18, RESNET18 and RESNET50). In this section, we demonstrate that MCAL provides significant overall cost benefits at the expense ofε (5%) degradation in label quality. Further, it even outperforms naive active learning assisted by an oracle that helps choose the optimal fixedδ value. Figure 6.5 depicts the total labeling costs incurred when using Amazon labeling services for three different schemes: i) when humans are used to label the entire data set using Amazon labeling services, ii) MCAL forε = 5%, and iii) active learning with an oracle to chooseδ for the DNN architectures CNN18, RESNET18 and RESNET50. 113 Fashion CIFAR-10CIFAR-100 Cost [$] 0 1000 2000 3000 4000 5000 Human Labeling MCAL Oracle AL CNN18 Oracle AL RESNET18 Oracle AL RESNET50 Figure 6.5: Total cost of labeling for various data sets, for i) Human labeling, ii) MCAL and iii) Oracle assisted AL for various DNN architectures. Table 6.2 lists the numerical values of the costs (in $) for human labeling and MCAL. To calculate the total labeling error, we compare the machine labeling results onS ? (D,B opt ) and human labeling results on X\S ? (D,B opt ) against the groundtruth. The human labeling costs are calculated based on the prices of Amazon labeling services [3] and Satyam [183]. Data Set Labeling |B| |X| |S| |X| DNN Selected Labeling Human MCAL Service Error Labeling Labeling Cost ($) Cost ($) Savings Fashion Amazon 6.1% 85.0% RESNET18 4.0% 2800 400 86% Satyam 8.4% 85.0% RESNET18 4.0% 210 29 86% CIFAR-10 Amazon 22.2% 65.0% RESNET18 2.4% 2400 792 67% Satyam 27.0% 65.0% RESNET18 2.4% 180 63 65% CIFAR-100 Amazon 32.0% 10.0% RESNET18 0.4% 2400 1698 29% Satyam 57.6% 20.0% RESNET18 1% 180 139 23% Table 6.2: Summary of Results Cost Saving Compared to Human Labeling. From Figure 6.5 and Table 6.2, MCAL provides an overall cost saving of 86%, 67% and 30% for Fashion, CIFAR-10 and CIFAR-100 respectively. As expected, the savings depend on the difficulty of the classification task. The “harder” the dataset the higher the savings. Table 6.2 also shows the number of samples inB used to trainD, as well as the number of samples|S| labeled usingD. For Fashion, MCAL labels only 6.1% of the data to train the classifier and uses it to label 114 AL Batch Size (δ) [% of |X|] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total Cost to Label [$] 0 500 1000 1500 2000 2500 3000 CNN18 RESNET18 RESNET50 Human Labeling MCAL Figure 6.6: Performance of MCAL compared to active learning with different batch sizesδ on Fashion using Amazon labeling. AL Batch Size (δ) [% of |X|] 0 5 10 15 20 Total Cost to Label [$] 500 1000 1500 2000 2500 CNN18 RESNET18 RESNET50 Human Labeling MCAL Figure 6.7: Performance of MCAL compared to active learning with different batch sizesδ on CIFAR- 10 using Amazon labeling. AL Batch Size (δ) [% of |X|] 0 5 10 15 20 Total Cost to Label [$] 0 1000 2000 3000 4000 5000 CNN18 RESNET18 RESNET50 MCAL Human Labeling Figure 6.8: Performance of MCAL compared to active learning with different batch sizesδ on CIFAR- 100 using Amazon labeling. 0 5 10 15 AL Batch Size [% of Data Set] 0 20 40 60 80 100 Total AL Training Cost [$] CNN18 RESNET18 RESNET50 Figure 6.9: AL training cost (in $) as a function of batch size (δ) for Fashion Data Set using CNN18, RESNET18 and RESNET50. 0 5 10 15 20 AL Batch Size [% of Data Set] 0 200 400 600 800 1000 Total AL Training Cost [$] RESNET50 CNN18 RESNET18 Figure 6.10: AL training cost (in $) as a function of batch size (δ) for CIFAR-10 Data Set using CNN18, RESNET18 and RESNET50. 0 5 10 15 20 AL Batch Size [% of Data Set] 0 500 1000 1500 2000 Total AL Training Cost [$] RESNET18 CNN18 RESNET50 Figure 6.11: AL training cost (in $) as a function of batch size (δ) for CIFAR-100 Data Set using CNN18, RESNET18 and RESNET50. 85% of the data set. For CIFAR-10, it trains using 22% of the data set and labels about 65% of the data using the classifier. CIFAR-100 requires more data to train the classifier to a high accuracy so is able to label only 10% of the data using the classifier. Table 6.2 shows that MCAL, for each data set and labeling service, is able to achieveε< 5%. Cost Savings from Naive Active Learning. In the absence of MCAL, how much cost savings would one obtain using naive Active Learning? As described in §6.2, the overall cost of AL depends on the batch size δ, the DNN architecture used for classification, as well as how “hard” it is to classify the data set. In order to examine these dependencies, we performed active learning using values ofδ between 1% to 20% of|X|, to label each data set using the different DNN architectures until the desired overall labeling error constraint was met. The contribution to overall labeling error is zero for human annotated images and dictated by 115 the classifier performance for machine labeled images. Figure 6.6, Figure 6.7 and Figure 6.8 depict the dependence of overall labeling cost as a function ofδ for each data set for each of the DNN architectures. MCAL v.s. AL. Figures 6.6, 6.7 and 6.8 show the overall cost (sum of human labeling and training cost) for each of the three data sets using different DNN architectures. The optimal value ofδ is indicated by a circle in each of the figures. Further, the cost of labeling using humans only as well as MCAL cost is indicated using dashed lines for reference. As shown in the figure, MCAL outperforms AL even with optimal choice ofδ. The choice ofδ can significantly effect the overall cost, by up to 4-5× for hard data sets. Training Cost. Figures 6.9, 6.10 and 6.11 depict the AL training costs for each of the data sets using three different DNN architectures. The training cost can vary significantly withδ. While for Fashion, there is a 2× reduction in training costs, it is about 5× for CIFAR-10 and CIFAR-100 data sets. Dependence onδ. Figure 6.12, depicts the fraction of images that were machine labeled for the different data sets and DNN architectures as a function ofδ (as a % of data set size). As seen in Figure 6.12, lowerδ values allow AL to adapt at a finer granularity, resulting in a higher number of machine labeled images. Increasingδ from 1% to 15% results in a 10-15% fewer images being machine labeled. It is also evident from Figure 6.12, that Fashion is the “easiest” to train while CIFAR-100 is the “hardest”. Dependence on DNN Architecture. While a larger DNN architecture has the potential for higher accuracy, its training costs may be significantly higher and potentially offset savings due to machine labeling. As seen from Figures 6.6–6.12, even though RESNET50 is able to achieve a higher quality of prediction and consequently machine label a larger fraction of images, its high training cost offsets these gains. CNN18 on the other hand incurs much lower training costs, however, its poor performance leads to few images being machine labeled. RESNET18 provides for a better compromise resulting in overall lower cost. Dependence on Dataset. As seen from Figures 6.6–6.12, the training costs as well as potential gains from machine labeling depend on the difficulty of the data set. While, using a smallδ (2%) is beneficial for Fashion, the number is larger for CIFAR-10. For CIFAR-100, using active learning does help at all as the high training costs overwhelm benefits from machine labeling. 116 0 5 10 15 20 Batch Size [% of Data Set] 0 10 20 30 40 50 60 70 80 90 100 Classifier Predicted Images [%] Fashion-CNN18 Fashion-RESNET50 Fashion-RESNET18 CIFAR10-CNN18 CIFAR10-RESNET50 CIFAR10-RESNET18 CIFAR100-CNN18 CIFAR100-RESNET18 CIFAR100-RESNET50 CIFAR100 Fashion CIFAR10 Figure 6.12: Fraction of machine labeled images ( |S ∗ (D(B))| |X| ) using Naive AL for different fixedδ values. DNN Architecture Data Set Labeling CNN-18 RESNET18 RESNET50 Service δopt Cost ($) Savings δopt Cost ($) Savings δopt Cost ($) Savings Fashion Amazon 1.7% 438.4 84.3% 6.7% 429.3 84.6% 3.3% 452.4 83.8% Satyam 1.7% 49.9 76.3% 6.7% 40.8 75.3% 3.3% 63.9 69.6% CIFAR-10 Amazon 10% 1577.2 34.3% 6.7% 891.1 62.9% 10% 1128.7 53.0% Satyam 16.7% 235.8 -31.0% 6.7% 129.6 28.0% 10% 149.6 16.9% CIFAR-100 Amazon 13.3% 2520.8 -5% 16.7% 2184.6 9.0% 13.3% 2915.8 -21.5% Satyam 16.7% 407.0 -126.1% 16.7% 297.6 -65.3% 16.7% 805.8 -347.6% Table 6.3: Oracle Assisted Active Learning Summary. Table 6.3 provides numerical values of the optimal choices forδ and the optimal cost savings obtained for various architectures. Comparing these values with Table 6.2, we conclude that: MCAL outperforms Naive AL across various choices of DNN architectures andδ by automatically picking the right architecture, adaptingδ suitably, and selecting the right subset of images to be human labeled. 6.4.2 Effect of cheaper labeling costs Intuitively, with cheaper labeling costs MCAL should use more human labeling to train the classifier. This in turn should enable a larger fraction of data to be labeled by the classifier. To validate this, we used the Satyam [183] labeling service that incurs a 10× lower labeling cost compared to Amazon’s labeling 117 AL Batch Size (δ) [% of |X|] 0 5 10 15 Total Cost to Label [$] 0 50 100 150 200 250 CNN18 RESNET18 RESNET50 Human Labeling MCAL Figure 6.13: Performance of MCAL compared to active learn- ing with different batch sizesδ on Fashion using Satyam labeling AL Batch Size (δ) [% of |X|] 0 5 10 15 20 Total Cost to Label [$] 0 100 200 300 400 500 600 700 800 900 1000 CNN18 RESNET18 RESNET50 Human Labeling MCAL Figure 6.14: Performance of MCAL compared to active learn- ing with different batch sizesδ on CIFAR-10 using Satyam labeling AL Batch Size (δ) [% of |X|] 0 5 10 15 20 Total Cost to Label [$] 0 500 1000 1500 2000 2500 3000 CNN18 RESNET18 RESNET50 Human Labeling MCAL Figure 6.15: Performance of MCAL compared to active learn- ing with different batch sizesδ on CIFAR-100 using Satyam labeling service. The effect of this reduction is most evident for CIFAR-100 in Table 6.2 as MCAL chooses to train the classifier using 57.6% of the data (instead of 32% using Amazon labeling service). This increases the classifier’s accuracy allowing it to label 20% of the dataset (instead of 10% using Amazon labeling service). For other datasets, the differences are less dramatic (they use 2.5-5% more data to train the classifier). For these datasets, there is no change in (|S|)/(|X|) because our resolution in this dimension is limited to 5% since we changeθ in increments of 5%. Figures 6.13, 6.14, and 6.15 depict the effect of using various choices ofδ on overall cost for various classifiers and data sets using using Satyam as the labeling service. As seen in these figures, the lower labeling cost alters the tradeoff curves. The figures also depict the corresponding MCAL cost as well as the human labeling cost for reference. The numerical values of the optimalδ as well as the corresponding cost savings are provided in Table 6.3. As seen from these results, MCAL achieves a lower overall cost compared to all these possible choices in this case as well. 6.4.3 Gains from Active Learning In this section we ask the question, “does active learning in fact provide benefit in MCAL? What if we used random sampling at each iteration in MCAL instead of selecting samples using active learning?” We repeat our experiments, using MCAL, but without using the active learning metric to select the samples. For this experiment as well, the overall labeling error requirement was set asε< 5%. 118 Fashion CIFAR-10 CIFAR-100 0 500 1000 1500 2000 2500 Total Cost to Label [$] MCAL With AL MCAL Without AL (-1.6%) (19.1%) (21.4%) Figure 6.16: Gains due to use of active learning in MCAL using Amazon Labeling forε = 0.05 Fashion CIFAR-10 CIFAR-100 0 50 100 150 200 Total Cost to Label [$] MCAL With AL MCAL Without AL (31.7%) (25.5%) (7.9%) Figure 6.17: Gains due to use of active learning in MCAL using Satyam Labeling forε = 0.05 Fashion CIFAR-10 CIFAR-100 0 500 1000 1500 2000 Total AL Training Cost [$] = 5% = 10% (30.3%) (10.5%) (14.9%) Figure 6.18: Cost savings obtained by relaxing quality requirement from ε = 0.05 to ε = 0.1 using Amazon Labeling Fashion CIFAR-10 CIFAR-100 0 50 100 150 Total Cost to Label [$] = 5% = 10% (32.3%) (10.7%) (13.9%) Figure 6.19: Cost savings obtained by relaxing quality requirement from ε = 0.05 to ε = 0.1 using Satyam Labeling Figures 6.16 and 6.17 depict the overall labeling cost with and without using AL for the three data sets using Amazon and Satyam respectively. The percentage cost gains are depicted in brackets. While Fashion and CIFAR-10 show a gain of about 20% for both data sets, the gains are low in the case of CIFAR-100. This is because most of the images in that data set were labeled by humans and active learning did not have an opportunity to improve significantly. The gains are higher with the Satyam service, since training costs are relatively higher in that cost model: active learning accounted for 25-31% for Fashion and CIFAR-10’s costs, and even CIFAR-100 benefited from this. 119 6.4.4 Effect of Relaxing Accuracy Requirement In this section we examine the quality-cost tradeoff by relaxing the accuracy target from 95% to 90% to quantify its impact on additional cost savings. Figures 6.18 and 6.19 depict the labeling costs due to this reduction in accuracy requirement for labeling each of the data sets on both Amazon and Satyam labeling services. Fashion achieves 30% cost reductions by reducing the accuracy target by 5%; many more images are labeled by the classifier. CIFAR-10 and CIFAR-100 also show 10-15% gains. Data Set |B| |X| |S| |X| DNN Selected Labeling Accuracy Fashion 4.4% 90.0% RESNET18 91.9% CIFAR-10 25.9% 75.0% RESNET18 94.7% CIFAR-100 64.0% 25.0% RESNET18 98.4% Table 6.4: Values of optimal choices discovered by MCAL as a fraction of|X| forε = 10% Table 6.4 depicts the fraction of images that were machine-labeled by the classifier ( |S| |X| ) and the number of samples used to train it ( |B| |X| ) for each of the data sets. As seen by comparing with Table 6.2, for Fashion it predicts more images by using a smaller number of training images. For CIFAR-10 and CIFAR-100, it makes a different decision – to use more training images and increase the classifier accuracy to enable more images to be labeled by the classifier. Further, RESNET18 continues to be the optimal architecture for all three data sets. As seen from Table 6.4, MCAL ensures the accuracy target of 90% for all the data sets. Data Set Labeling Service Cost Savings w.r.t Human Labeling Fashion Amazon 88.9% Satyam 86.9% CIFAR-10 Amazon 70.5% Satyam 68.9% CIFAR-100 Amazon 39.1% Satyam 34.3% Table 6.5: Values of optimal choices discovered by MCAL as a fraction of|X| forε = 10% Table 6.5 captures the saving with respect to to human labeling while using an accuracy guarantee of 90%. As seen by comparing against the corresponding values in Table 6.2, as expected the cost savings 120 Figure 6.20: Power-law and Trun- cated Power-law fits on CIFAR- 100 using CNN18 Figure 6.21: Power-law and Trun- cated Power-law fits on CIFAR- 100 using RESNET18 Figure 6.22: Power-law and Trun- cated Power-law fits on CIFAR- 100 using RESNET50 increase by relaxing the accuracy guarantee. However, the savings do not dramatically increase indicating that most of the cost gain comes in reducing the accuracy requirement to 95% from 100%. 6.4.5 Efficacy of Truncated Power Law Figure 6.20, Figure 6.21, Figure 6.22 show the power-law and truncated power-law fits for CIFAR-100 on three different architectures. All results are forθ = 50%. In all cases, a truncated power law fits better, and, while using more points gives higher accuracy and better prediction, even a few samples are sufficient to get stable and precise prediction. 6.5 Related Work Active learning [202] aims to reduce labeling cost in training a model, by iteratively selecting the most informative samples for labeling. Early work focused on designing metrics for sample selection based on margin sampling [199, 124, 202], region-based sampling [62], max entropy [65] and least confidence [64]. Recent work has focused on developing metrics tailored to specific tasks, such as classification [201], detection [52, 131], and segmentation [138, 237], or for specialized settings such as when costs depend upon the label [144], or for a hierarchy of labels [114]. Other work in this area has explored variants of the problem of sample selection: leveraging model assertions [130], model structure [229], cheaper proxy models [60], using model ensembles to improve sampling efficacy [47], incorporating both uncertainty and diversity [44], or using self-supervised mining of samples for active learning to avoid data set skew [228]. 121 MCAL uses active learning for training the classifier D (§6.1) and can accommodate multiple sample selection metricsM(.). More recent work has explored techniques to learn active learning strategies, using reinforcement learning [139] or one-shot meta-learning [61]. However, with the exception of [203] which designs a lightweight model to reduce the iterative training cost incurred in active learning, we have not found any work that takes training cost into account when developing an active learning strategy. Because active learning can incur significant training cost, MCAL includes training costs in its formulation. Training cost figures prominently in the literature on hyper-parameter tuning, especially for architecture search. Prior work has attempted to predict learning curves to prune hyper-parameter search [137], develop effective search strategy within a given budget [155], or build a model to characterize maximum achievable accuracy on a given dataset to enable fast triage during architecture search [121], all with the goal of reducing training cost. The literature on active learning has recognized the high cost of training, especially for large datasets and explored using larger batch sizes to reduce the number of training iterations [201] or using cheaper models [60] to select samples. MCAL solves a different problem (dataset labeling) and explicitly incorporates training cost in reasoning about which samples to select. Also relevant is the empirical work that has observed a power-law relationship between generalization error and training set size [46, 111, 127, 197, 84, 59] across a wide variety of tasks and models. MCAL builds upon this observation, and learns the parameters of a truncated power-law model with as few samples as possible for active learning, 122 Chapter 7 Conclusion From sensing to control, perception is the first component of the processing pipeline and key to reliable autonomous driving systems. This dissertation presents networked cooperative perception, which breaks the traditional boundaries of single-vehicle-based solutions. The research presented in this dissertation demonstrates the benefits of cooperative perception to autonomous driving via end-to-end evaluations. Specifically, it presents algorithms, systems and architectures that efficiently and accurately augment vehicular reality using shared sensors, coordinate and optimally schedule vehicle sensor sharing at scale, automates massive high-quality visual sensor annotation collection for machine perception training, and develops a human-machine labeling framework to reduce total cost. The contributions of this dissertation can also be summarized from the following angles: • From the networking perspective, this dissertation addressed key challenges in enabling clustering and coordination in a highly dynamic vehicular network where throughput demand is high and latency requirement is stringent. • From the vehicle sensing perspective, the dissertation overcomes the barrier of occlusion and sensing range (one of the significantly challenging problems being tackled in the computer vision community), provides a new extended perspective for perception. 123 • From the autonomous driving perspective, the work presented shifts the design paradigm from single-vehicle-based human behavior imitation, to multi-vehicle cooperative perception, paving the way to a collaborative driving future of edge-assisted robot swarms. • From the perspective of machine learning for vision, the dissertation democratizes groundtruth acquisition, providing tools and access to building custom datasets for researchers and practitioners at a significantly lowered cost. Broader Impact: The research presented in this dissertation has generated a broader societal impact. Cooperative perception was developed in collaboration with General Motors. The technology has been transferred with two global patents granted [185, 38]. In academia, the work has been followed up from various perspectives such as edge-assisted vehicle cooperative perception [57, 43, 66, 219], directional vehicular connectivity using millimeter wave V2X [230], vehicular fog computing [244, 92], and vehicular networking security [90] and privacy [234]. In addition, the research also impacted the augmented reality community, led to the exploration of multi-user augmented reality systems [154, 175, 241, 41, 105]. Autonomous driving vehicles are generating terabytes of data in field tests to further improve the reliability and lower the disengagement rate. Data labeling is often prohibitively expensive, which can limit the generalization of machine learning algorithms. More broadly, it limits transfer learning from reaching a wider range of new settings and applications. By reducing the overall dataset building cost, Satyam and Active Labeling ease the access to groundtruth acquisition. These systems were developed and deployed in collaboration with Microsoft, where three teams, including Azure ML and internal incubation, have put significant efforts behind to push the system to become publicly available as an online service. We see research opportunities in applying Satyam and Active Labeling to production systems for building novel datasets and inventing capabilities not previously demonstrated. 124 7.1 Future Directions This dissertation explores the first few key steps towards enabling cooperative perception. The architecture design, the algorithm frameworks, and the specific techniques generalize along several dimensions. First, the cooperative perception architecture is not dependent on sensing modalities. This dissertation shows extended vision built from both stereo cameras as well as LiDARs. In fact, the architecture embraces sensor fusion across different sensing modalities by providing common semantics for sharing and the interpretation of different data representations. Second, the techniques developed to identify and extract relevant objects to be exchanged extends beyond vehicles and pedestrians. They apply to any form of objects, with or without communication capability, and is not dependent on any pre-defined vocabulary. For example, traffic cones for road construction, obstacles left on highway are extremely critical to be shared ahead of time for vehicles around to be aware of. Lastly, the cooperative perception scheme is agnostic to the specific machine vision tasks, for which the shared sensor data is used. And to support that, the Satyam annotation platform is designed to accommodate all popular machine vision tasks, such as detection, tracking, and segmentation. The collaborative perspective extends beyond the sensing and perception problems addressed in this dissertation. The fact that vehicles can “see” the world from several vantage points inspires many brand new approaches to several aspects of autonomous driving (e.g., robotics, computer vision, machine learning, transportation infrastructure, security, etc.). 1) Standalone Agents vs. Collaborative driving: in addition to providing extended visibility of the environment, sensor sharing also enables interaction among autonomous agents themselves based on the consensus of knowledge of the environment, such that the agents can coor- dinate and collaborate with each other in trajectory planning to avoid collision and improve transportation efficiency. 2) Infrastructure for Human Drivers vs. Autonomous Agents: Current road infrastructure is designed for human drivers. In order to operate safely, autonomous agents need to handle unnecessary recognition overhead, form dashed vs. solid yellow lines, to partial occlusions and unpredictable driving intentions. Cooperative perception sheds lights on what an infrastructure for autonomous agents can be 125 in the future, where each vehicle enjoys a full transparent perspective without occlusion, dynamics are detected and shared by V2V communication, unnecessary passive recognition of the visual traffic signs and regulators designed for humans can be actively delivered or even replaced by the infrastructure in other modalities. 3) Swarm Security: cooperative perception also brings new challenges to security. In addition to preventing an adversarial from taking control of a particular vehicle, enabling the trust among vehicle communications is crucial to realizing cooperative perception. Fundamentally, this dissertation lays the foundation of cooperative perception, which provides an alternative to designing autonomous driving systems from the human driver’s perspective. The dissertation contributes to the on-going joint effort towards more robust and efficient autonomous driving and intelligent transportation systems. 126 Reference List [1] 5G Automotive Association. https://5gaa.org/. [2] Acm mobisys 2018. [3] Amazon sagemaker data labeling. https://docs.aws.amazon.com/sagemaker/ latest/dg/sms-data-labeling.html. [4] Carla autonomous driving challenge. [5] Cloud enhanced open software defined mobile wireless testbed for city-scale deployment (cosmos). [6] Connected Ann Arbor. http://www.mtc.umich.edu/deployments/ connected-ann-arbor. [7] Data annotation: The billion dollar business behind ai breakthroughs. [8] Figure eight data annotation platform. https://www.figure-eight.com/. [9] The fourseasons dataset. [10] Get Under the Hood of Parker, Our Newest SOC for Autonomous Vehicles. https://blogs. nvidia.com/blog/2016/08/22/parker-for-self-driving-cars/. [11] Google ai platform data labeling service. https://cloud.google.com/data-labeling/ docs/. [12] ismartways performance measurement. [13] Keras Documentation: Training Resnet on CIFAR-10. https://keras.io/examples/ cifar10_resnet/. [14] LTE-Advanced Is the Real 4G. http://spectrum.ieee.org/telecom/standards/ lte-advanced-is-the-real-4g. [15] LTE Direct Overview. http://s3.amazonaws.com/sdieee/205-LTE+Direct+IEEE+ VTC+San+Diego.pdf. [16] Microsoft ignite 2019. [17] NVidia Drive PX 2. http://www.nvidia.com/object/drive-px.html. [18] The MobileBroadband LTE-Advanced Standard. http://www.3gpp.org/technologies/ keywords-acronyms/97-lte-advanced. [19] The MobileBroadband LTE Standard. http://www.3gpp.org/technologies/ keywords-acronyms/98-lte. [20] Two Second Rule. https://en.wikipedia.org/wiki/Two-second_rule. 127 [21] U.S. details plans for car-to-car safety communications. http:// www.autonews.com/article/20140818/OEM11/140819888/u.s. -details-plans-for-car-to-car-safety-communications. [22] Vehicle Stopping Distance and Time. https://nacto.org/docs/usdg/vehicle_ stopping_distance_and_time_upenn.pdf. [23] Velodyne 64-beam lidar. [24] Velodyne LiDAR HDL-64E Datasheet. http://velodynelidar.com/docs/datasheet/ 63-9194\%20Rev-E\_HDL-64E_S3\_Spec\%20Sheet_Web.pdf. [25] ZED Stereo Camera Datasheet. https://www.stereolabs.com/zed/specs/. [26] Ieee standard for information technology– local and metropolitan area networks– specific requirements– part 11: Wireless lan medium access control (mac) and physical layer (phy) specifica- tions amendment 6: Wireless access in vehicular environments. IEEE Std 802.11p-2010 (Amendment to IEEE Std 802.11-2007 as amended by IEEE Std 802.11k-2008, IEEE Std 802.11r-2008, IEEE Std 802.11y-2008, IEEE Std 802.11n-2009, and IEEE Std 802.11w-2009), pages 1–51, July 2010. [27] 18 awesome innovations in the new merceds e-class, 2016. [28] Tr 36.785, 2016, vehicle to vehicle (v2v) services based on lte sidelink; user equipment (ue) radio transmission and reception. Technical report, 3GPP, 2016. [29] Tr 36.885, 2016, study on lte-based v2x services. Technical report, 3GPP, 2016. [30] DARPA Grand Challenge 2007, 2018. http://archive.darpa.mil/grandchallenge/. [31] Toyota’s v2v move shows industry still interested in cars talking to each other, 2018. [32] V2V Safety Technology Now Standard on Cadillac CTS Sedans, 2018. http://media. cadillac.com/media/us/en/cadillac/news.detail.html/content/Pages/ news/us/en/2017/mar/0309-v2v.html. [33] Amazon Mechanical Turk. https://www.mturk.com/, 2019. [34] Ann arbor connected vehicle test environment, 2019. [35] Baidu. https://www.baidu.com/, 2019. [36] 99designs, 2018. https://99designs.com/. [37] Fawad Ahmad, Hang Qiu, Fan Bai, and Ramesh Govindan. CarMap: Crowdsourced Feature-Maps for Automobiles. under submission, 2019. [38] Fawad Ahmad, Hang Qiu, Ramesh Govindan, Donald K Grimm, and Fan Bai. Method and apparatus for a context-aware crowd-sourced sparse high definition map, 2020. [39] Fawad Ahmad, Hang Qiu, Xiaochen Liu, Fan Bai, and Ramesh Govindan. Quicksketch: Building 3d representations in unknown environments using crowdsourcing. In 2018 21st International Conference on Information Fusion (Fusion), pages 2314–2321. IEEE, 2018. [40] Mani Amoozadeh, Hui Deng, Chen-Nee Chuah, H Michael Zhang, and Dipak Ghosal. Platoon management with cooperative adaptive cruise control enabled by vanet. Vehicular Communications, 2(2). 128 [41] Kittipat Apicharttrisorn, Bharath Balasubramanian, Jiasi Chen, Rajarajan Sivaraj, Yi-Zhen Tsai, Rittwik Jana, Srikanth Krishnamurthy, Tuyen Tran, and Yu Zhou. Characterization of multi-user augmented reality over cellular networks. In 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), pages 1–9. IEEE, 2020. [42] Argo AI. https://www.argo.ai/. [43] Eduardo Arnold, Mehrdad Dianati, Robert de Temple, and Saber Fallah. Cooperative perception for 3d object detection in driving scenarios using infrastructure sensors. IEEE Transactions on Intelligent Transportation Systems, 2020. [44] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2020. [45] David Astély, Erik Dahlman, Anders Furuskär, Ylva Jading, Magnus Lindström, and Stefan Parkvall. Lte: The evolution of mobile broadband. Comm. Mag., 47(4), April 2009. [46] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In The European Conference on Computer Vision (ECCV), September 2018. [47] William H. Beluch, Tim Genewein, Andreas Nürnberger, and Jan M. Köhler. The power of ensembles for active learning in image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. [48] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: The clear mot metrics. J. Image Video Process., 2008:1:1–1:10, January 2008. [49] D. P. Bertsekas. Dynamic Programming and Optimal Control - Vol I. Athena Scientific, 2005. [50] Cheng Bo, Xiang-Yang Li, Taeho Jung, Xufei Mao, Yue Tao, and Lan Yao. Smartloc: Push the limit of the inertial sensor based metropolitan localization using smartphone. In Proceedings of the 19th Annual International Conference on Mobile Computing and Networking, MobiCom ’13, 2013. [51] M. Boban, W. Viriyasitavat, and O. Tonguz. Modeling vehicle-to-vehicle line of sight channels and its impact on application-layer performance. In ACM VANET 2013, June 2013. [52] Clemens-Alexander Brust, Christoph Käding, and Joachim Denzler. Active learning for deep object detection. Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2019. [53] Stephen M. Burroughs. Upper-truncated power laws and self-similar criticality in geophysical processes. PhD thesis, University of South Florida, July 2001. [54] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet: Learning to predict intention from raw sensor data. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, Proceedings, volume 87 of Proceedings of Machine Learning Research, pages 947–956. PMLR, 2018. [55] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2722–2730, Dec 2015. [56] Dongyao Chen, Kyong-Tak Cho, Sihui Han, Zhizhuo Jin, and Kang G. Shin. Invisible sensing of vehicle steering with smartphones. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’15, 2015. 129 [57] Qi Chen. F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds, 2019. [58] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In IEEE CVPR 2017. IEEE, June 2017. [59] Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, and Synho Do. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?, 2015. [60] Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. [61] Gabriella Contardo, Ludovic Denoyer, and Thierry Artieres. A meta-learning approach to one-step active learning, 2017. [62] Corinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Ningshan Zhang. Adaptive region-based active learning, 2020. [63] Cruise. https://www.getcruise.com/. [64] Aron Culotta and Andrew Mccallum. Reducing labeling effort for structured prediction tasks. volume 2, pages 746–751, 01 2005. [65] Ido Dagan and Sean P. Engelson. Committee-based sampling for training probabilistic classifiers. In In Proceedings of the Twelfth International Conference on Machine Learning, pages 150–157. Morgan Kaufmann, 1995. [66] Bin Dai, Fanglin Xu, Yuanyuan Cao, and Yang Xu. Hybrid sensing data fusion of cooperative perception for autonomous driving with augmented vehicular reality. IEEE Systems Journal, 2020. [67] Tanmoy Das, Lu Chen, Rupam Kundu, Arjun Bakshi, Prasun Sinha, Kannan Srinivasan, Gaurav Bansal, and Takayuki Shimizu. Corecast: Collision resilient broadcasting in vehicular networks. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’18, pages 217–229, New York, NY , USA, 2018. ACM. [68] Andrew J. Davison, Ian D. Reid, Nicholas D. Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE Trans. Pattern Anal. Mach. Intell., 29(6), June 2007. [69] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, June 2009. [70] Jia Deng, Jonathan Krause, and Li Fei-Fei. Fine-grained crowdsourcing for fine-grained recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, pages 580–587, Washington, DC, USA, 2013. IEEE Computer Society. [71] Samples of Object Detection Groundtruth Using Satyam. https://www.dropbox.com/sh/ qs8peao1k88a9ya/AADMFJ6EL7yE7WKZEd5bskmza?dl=0, 2018. [72] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013. [73] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. In Proceedings of the 1st Annual Conference on Robot Learning, pages 1–16, 2017. 130 [74] Falko Dressler, Hannes Hartenstein, Onur Altintas, and Ozan Tonguz. Inter-vehicle communication: Quo vadis. IEEE Communications Magazine, 52(6):170–177, 2014. [75] D.-Z. Du, K.-I Ko, and X. Hu. Design and Analysis of Approximation Algorithms. Springer, 2012. [76] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110, 2006. [77] Amine Elhafsi, Boris Ivanovic, Lucas Janson, and Marco Pavone. Map-predictive motion planning in unknown environments. arXiv preprint arXiv:1910.08184, 2019. [78] J. Engel, J. Stueckler, and D. Cremers. Large-scale direct slam with stereo cameras. 2015. [79] ETSI. Intelligent Transport Systems (ITS); Vehicular Communications; Basic Set of Applications; Part 2: Specification of Cooperative Awareness Basic Service. Technical Report REN/ITS-0010019, 2014. [80] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, January 2015. [81] Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. Int. J. Comput. Vision, 88(2):303–338, June 2010. [82] Michael Feld and Christian Müller. The automotive ontology: Managing knowledge inside the vehicle and sharing it between cars. In Proceedings of the 3rd International Conference on Automotive User Interfaces and Interactive Vehicular Applications, AutomotiveUI ’11, pages 79–86, New York, NY , USA, 2011. ACM. [83] M. Ferreira, P. Gomes, M. K. Silveria, and F. Vieira. Augmented reality driving supported by vehicular ad hoc networking. In IEEE ISMAR 2013, October 2013. [84] Rosa Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, and Long Ngo. Predicting sample size required for classification performance. BMC medical informatics and decision making, 12:8, 02 2012. [85] Figure Eight, 2018. https://www.figure-eight.com/. [86] Freelancer, 2018. https://www.freelancer.com/info/how-it-works. [87] L. Gallo and J. Harri. Short paper: A lte-direct broadcast mechanism for periodic vehicular safety communications. In IEEE Vehicular Networking Conference 2013, pages 166–169. IEEE, Dec 2013. [88] Tarak Gandhi and Mohan Trivedi. Parametric ego-motion estimation for vehicle surround analysis using an omnidirectional camera. Machine Vision and Applications, 16(2):85–95, 2005. [89] Tarak Gandhi and Mohan M Trivedi. Motion based vehicle surround analysis using an omni- directional camera. In Intelligent Vehicles Symposium, 2004 IEEE, pages 560–565. IEEE, 2004. [90] Arunkumaar Ganesan. Using Sensor Redundancy in Vehicles and Smartphones for Driving Security and Safety. PhD thesis, 2020. [91] H. Garcia-Molina, M. Joglekar, A. Marcus, A. Parameswaran, and V . Verroios. Challenges in data crowdsourcing. IEEE Transactions on Knowledge & Data Engineering, 28(4):901–911, April 2016. [92] Julien Gedeon, Florian Brandherm, Rolf Egert, Tim Grube, and Max Mühlhäuser. What the fog? edge computing revisited: Promises, applications and future challenges. IEEE Access, 7:152847–152878, 2019. 131 [93] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [94] A. Gern, R. Moebus, and U. Franke. Vision-based lane recognition under adverse weather conditions using optical flow. In Intelligent Vehicle Symposium, 2002. IEEE, volume 2, pages 652–657 vol.2, June 2002. [95] P. Gomes, F. Vieira, and M. Ferreira. The see-through system: From implementation to test-drive. In 2012 IEEE Vehicular Networking Conference (VNC), pages 40–47. IEEE, Nov 2012. [96] P. Gomes, F. Vieira, and M. Ferreira. The see-through system: From implementation to test-drive. In 2012 IEEE Vehicular Networking Conference (VNC), pages 40–47, Nov 2012. [97] Google Self-Driving Car Project Monthly Report September 2016. https://static. googleusercontent.com/media/www.google.com/en//selfdrivingcar/ files/reports/report-0916.pdf. [98] Cyril Goutte and Eric Gaussier. A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In Proceedings of the 27th European Conference on Advances in Information Retrieval Research, ECIR’05, pages 345–359, Berlin, Heidelberg, 2005. Springer-Verlag. [99] Mahanth Gowda, Justin Manweiler, Ashutosh Dhekne, Romit Roy Choudhury, and Justin D. Weisz. Tracking drone orientation with multiple gps receivers. In Proceedings of the 22Nd Annual Interna- tional Conference on Mobile Computing and Networking, MobiCom ’16, 2016. [100] J. Graeter, A. Wilczynski, and M. Lauer. Limo: Lidar-monocular visual odometry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7872–7879, 2018. [101] Offer Grembek, Alex A. Kurzhanskiy, Aditya Medury, Pravin Varaiya, and Mengqiao Yu. An intelligent intersection. CoRR, abs/1803.00471, 2018. [102] Jacopo Guanetti, Yeojun Kim, and Francesco Borrelli. Control of connected and automated vehicles: State of the art and future challenges. Annual Reviews in Control, 45:18–40, 2018. [103] Santanu Guha, Kurt Plarre, Daniel Lissner, Somnath Mitra, Bhagavathy Krishna, Prabal Dutta, and Santosh Kumar. Autowitness: Locating and tracking stolen property while tolerating gps and radio outages. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems, SenSys ’10, 2010. [104] Guru, 2018. https://www.guru.com/. [105] Bo Han, Yu Liu, and Feng Qian. Vivo: visibility-aware mobile volumetric video streaming. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pages 1–13, 2020. [106] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968. [107] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [108] Heads Up Display. http://continental-head-up-display.com/, 2018. [109] Here’s how Tesla’s Autopilot works. http://www.businessinsider.com/ how-teslas-autopilot-works-2016-7. 132 [110] Wolfgang Hess, Damon Kohler, Holger Rapp, and Daniel Andor. Real-time loop closure in 2d lidar slam. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1271–1278, 2016. [111] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017. [112] Meng-Ru Hsieh, Yen-Liang Lin, and Winston H. Hsu. Drone-based object counting by spatially regularized regional proposal networks. In The IEEE International Conference on Computer Vision (ICCV). IEEE, 2017. [113] Peiyun Hu, David Held, and Deva Ramanan. Learning to optimally segment point clouds. ICRA, 2020. [114] Peiyun Hu, Zack Lipton, Anima Anandkumar, and Deva Ramanan. Active learning with partial feedback. In International Conference on Learning Representations, 2019. [115] Jing Huang and Suya You. Point cloud matching based on 3d self-similarity. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 41–48. IEEE, 2012. [116] Po-Han Huang, Hang Qiu, Konstantinos Psounis, and Ramesh Govindan. AUTOCAST: Scalable and Efficient Sensor Sharingbetween Autonomous Vehicles. under submission, 2019. [117] Dirk Hübner and G Riegelhuth. A new system architecture for cooperative traffic centres-the simtd field trial. In ITS World Congress, 2012. [118] Bret Hull, Vladimir Bychkovsky, Yang Zhang, Kevin Chen, Michel Goraczko, Allen Miu, Eugene Shih, Hari Balakrishnan, and Samuel Madden. Cartel: a distributed mobile sensor computing system. In Proceedings of the 4th international conference on Embedded networked sensor systems, SenSys ’06, pages 125–138. ACM, 2006. [119] Kosetsu Ikeda, Atsuyuki Morishima, Habibur Rahman, Senjuti Basu Roy, Saravanan Thirumuru- ganathan, Sihem Amer-Yahia, and Gautam Das. Collaborative crowdsourcing with crowd4u. Proc. VLDB Endow., 9(13):1497–1500, September 2016. [120] Innocentive, 2018. https://www.innocentive.com/. [121] R. Istrate, F. Scheidegger, G. Mariani, D. Nikolopoulos, C. Bekas, and A. C. I. Malossi. Tapas: Train- less accuracy predictor for architecture search. Proceedings of the AAAI Conference on Artificial Intelligence, 33:3927–3934, Jul 2019. [122] Lucas Janson and Marco Pavone. Fast marching trees: a fast marching sampling-based method for optimal motion planning in many dimensions - extended version. CoRR, abs/1306.3532, 2013. [123] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In International Conf. on Computer Vision (ICCV), pages 3192–3199, December 2013. [124] Heinrich Jiang and Maya Gupta. Minimum-margin active learning, 2019. [125] Yurong Jiang, Hang Qiu, Matthew McCartney, William G. J. Halfond, Fan Bai, Donald Grimm, and Ramesh Govindan. Carlog: A platform for flexible and efficient automotive sensing. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems, SenSys ’14, 2014. 133 [126] Yurong Jiang, Hang Qiu, Matthew McCartney, Gaurav Sukhatme, Marco Gruteser, Fan Bai, Donald Grimm, and Ramesh Govindan. Carloc: Precise positioning of automobiles. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, SenSys ’15, 2015. [127] Mark Johnson, Peter Anderson, Mark Dras, and Mark Steedman. Predicting accuracy on large datasets from smaller pilot data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 450–455, Melbourne, Australia, July 2018. Association for Computational Linguistics. [128] J. Judvaitis, A. Hermanis, K. Nesenbergs, R. Cacurs, I. Homjakovs, and K. Sudars. Object transparent vision combining multiple images from different views. Automatic Control and Computer Sciences, 49(5):313–320, Sep 2015. [129] J. Kammerl, N. Blodow, R. B. Rusu, S. Gedikli, M. Beetz, and E. Steinbach. Real-time compression of point cloud streams. In 2012 IEEE International Conference on Robotics and Automation, pages 778–785, May 2012. [130] Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. Model assertions for monitoring and improving ml models. In I. Dhillon, D. Papailiopoulos, and V . Sze, editors, Proceedings of Machine Learning and Systems, volume 2, pages 481–496. 2020. [131] Chieh-Chi Kao, Teng-Yok Lee, Pradeep Sen, and Ming-Yu Liu. Localization-Aware Active Learning for Object Detection, pages 506–522. 05 2019. [132] David R. Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 1953–1961. Curran Associates, Inc., 2011. [133] J. B. Kenney. Dedicated short-range communications (dsrc) standards in the united states. Proceed- ings of the IEEE, 99(7):1162–1182, July 2011. [134] Asif R. Khan and Hector Garcia-Molina. Attribute-based crowd entity resolution. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pages 549–558, New York, NY , USA, 2016. ACM. [135] Asif R. Khan and Hector Garcia-Molina. Crowddqs: Dynamic question selection in crowdsourcing systems. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1447–1462, New York, NY , USA, 2017. ACM. [136] KITTI Visual Odometry Benchmark . http://www.cvlibs.net/datasets/kitti/ eval_odometry.php, 2018. [137] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. In ICLR, 2017. [138] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Introducing geometry in active learning for image segmentation. 08 2015. [139] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4225–4235. Curran Associates, Inc., 2017. [140] Adriana Kovashka, Olga Russakovsky, Li Fei-Fei, and Kristen Grauman. Crowdsourcing in computer vision. CoRR, abs/1611.02145, 2016. 134 [141] Andreas Krause, Eric Horvitz, Aman Kansal, and Feng Zhao. Toward community sensing. In Proceedings of the 7th International Conference on Information Processing in Sensor Networks, IPSN ’08, pages 481–492, Washington, DC, USA, 2008. IEEE Computer Society. [142] Ranjay Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Fei-Fei Li, and Michael S. Bernstein. Embracing error to enable rapid crowdsourcing. CoRR, abs/1602.04506, 2016. [143] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, May 2017. [144] Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daumé III, and John Langford. Active learning for cost-sensitive classification. CoRR, abs/1703.01014, 2017. [145] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009. [146] S. Kumar, L. Shi, N. Ahmed, S. Gil, D. Katabi, and D. Rus. Carspeak: A content-centric network for autonomous driving. In ACM SIGCOMM 2012, August 2012. [147] Swarun Kumar, Shyamnath Gollakota, and Dina Katabi. A cloud-assisted design for autonomous driving. In Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, MCC ’12, 2012. [148] Swarun Kumar, Lixin Shi, Nabeel Ahmed, Stephanie Gil, Dina Katabi, and Daniela Rus. Carspeak: A content-centric network for autonomous driving. SIGCOMM Comput. Commun. Rev., 42(4), August 2012. [149] Mathieu Labbé and François Michaud. Online global loop closure detection for large-scale multi- session graph-based slam. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2661–2666. IEEE, 2014. [150] S. LaValle. Rapidly-exploring random trees : a new tool for path planning. 1998. [151] P. Lindemann and G. Rigoll. Examining the impact of see-through cockpits on driving performance in a mixed reality prototype. In ACM AutomotiveUI 2017, September 2017. [152] Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. Turkit: Tools for iterative tasks on mechanical turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP ’09, pages 29–30, New York, NY , USA, 2009. ACM. [153] Jie Liu and Feng Zhao. Towards semantic services for sensor-rich information systems. In 2nd International Conference on Broadband Networks, 2005., pages 967–974 V ol. 2, Oct 2005. [154] Luyang Liu, Hongyu Li, and Marco Gruteser. Edge assisted real-time object detection for mo- bile augmented reality. In The 25th Annual International Conference on Mobile Computing and Networking, pages 1–16, 2019. [155] Zhiyun Lu, Liyu Chen, Chao-Kai Chiang, and Fei Sha. Hyper-parameter tuning under a budget constraint. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Aug 2019. [156] W. Luo, B. Yang, and R. Urtasun. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In IEEE CVPR 2018. IEEE, June 2018. [157] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual tracking. In IEEE ICCV 2015. IEEE, December 2015. 135 [158] W.-C. Ma, D.-A. Huang, N. Lee, and K. M. Kitani. Forecasting interactive dynamics of pedestrians with fictitious play. In IEEE CVPR 2017. IEEE, June 2017. [159] Sathiya Kumaran Mani, Ramakrishnan Durairajan, Paul Barford, and Joel Sommers. Mntp: En- hancing time synchronization for mobile devices. In Proceedings of the 2016 Internet Measurement Conference, IMC ’16, 2016. [160] Gellert Mattyus, Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Hd maps: Fine-grained road segmentation by parsing ground and aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [161] Leanne Matuszyk, Alexander Zelinsky, Lars Nilsson, and Magnus Rilbe. Stereo panoramic vision for monitoring vehicle blind-spots. In Intelligent Vehicles Symposium, 2004 IEEE, pages 31–36. IEEE, 2004. [162] D. L. Mills. Internet time synchronization: the network time protocol. IEEE Transactions on Communications, 39(10):1482–1493, 1991. [163] Mob4Hire, 2018. https://www.mob4hire.com/. [164] AWS MTurk SDK. https://www.nuget.org/packages/AWSSDK.MTurk/, 2018. [165] Mechanical Turk: Now with 40.92% Spam, 2018. http://www. behind-the-enemy-lines.com/2010/12/mechanical-turk-now-with-4092-spam. html. [166] Raul MurArtal, J. M. M. Montiel, and Juan D. Tardos. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 2015. [167] Wassim G Najm, Raja Ranganathan, Gowrishankar Srinivasan, John D Smith, Samuel Toma, Eliz- abeth Swanson, August Burgett, et al. Description of light-vehicle pre-crash scenarios for safety applications based on vehicle-to-vehicle communications. Technical report, United States. National Highway Traffic Safety Administration, 2013. [168] Richard A. Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J. Davison, Pushmeet Kohli, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, ISMAR ’11, 2011. [169] NHTSA Federal Accident Reporting System. https://www-fars.nhtsa.dot.gov/Main/ index.aspx, 2018. [170] Nuro. https://nuro.ai/. [171] C. Olaverri-Monreal, P. Gomes, R. Fernandes, F. Vieira, and M. Ferreira. The see-through system: A vanet-enabled assistant for overtaking maneuvers. In 2010 IEEE Intelligent Vehicles Symposium, pages 123–128, June 2010. [172] ORB-SLAM Code. http://webdiis.unizar.es/~raulmur/orbslam/, 2018. [173] Aditya Parameswaran, Ming Han Teh, Hector Garcia-Molina, and Jennifer Widom. Datasift: A crowd-powered search toolkit. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 885–888, New York, NY , USA, 2014. ACM. [174] W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics). Wiley-Interscience, New York, NY , USA, 2007. 136 [175] Feng Qian, Bo Han, Jarrell Pair, and Vijay Gopalakrishnan. Toward practical volumetric video streaming on commodity smartphones. In Proceedings of the 20th International Workshop on Mobile Computing Systems and Applications, pages 135–140, 2019. [176] H. Qiu, J. Chen, S. Jain, Y . Jiang, M. McCartney, G. Kar, F. Bai, D. K. Grimm, M. Gruteser, and R. Govindan. Towards robust vehicular context sensing. IEEE Transactions on Vehicular Technology, 67(3):1909–1922, March 2018. [177] Hang Qiu, Fawad Ahmad, Fan Bai, Marco Gruteser, and Ramesh Govindan. Avr: Augmented vehicular reality. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (Mobisys), MobiSys ’18, pages 81–95, Munich, Germany, 2018. ACM. [178] Hang Qiu, Fawad Ahmad, Fan Bai, Marco Gruteser, and Ramesh Govindan. Avr: Augmented vehicular reality. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, MobiSys ’18, pages 81–95, New York, NY , USA, 2018. ACM. [179] Hang Qiu, Fawad Ahmad, Fan Bai, Marco Gruteser, and Ramesh Govindan. Augmented vehicular reality: Enabling extended vision for future automobiles. GetMobile: Mobile Comp. and Comm., 22(4):30–34, May 2019. [180] Hang Qiu, Fawad Ahmad, Ramesh Govindan, Marco Gruteser, Fan Bai, and Gorkem Kar. Aug- mented vehicular reality: Enabling extended vision for future vehicles. In Proceedings of the 18th International Workshop on Mobile Computing Systems and Applications, pages 67–72. ACM, 2017. [181] Hang Qiu, Jinzhu Chen, Shubham Jain, Yurong Jiang, Matt McCartney, Gorkem Kar, Fan Bai, Donald K Grimm, Marco Gruteser, and Ramesh Govindan. Towards robust vehicular context sensing. IEEE Transactions on Vehicular Technology, 67(3):1909–1922, 2018. [182] Hang Qiu, Krishna Chintalapudi, and Ramesh Govindan. Satyam: Democratizing groundtruth for machine vision, 2018. [183] Hang Qiu, Krishna Chintalapudi, and Ramesh Govindan. Satyam: Democratizing groundtruth for machine vision. arXiv preprint arXiv:1811.03621, 2018. [184] Hang Qiu, Krishna Chintalapudi, and Ramesh Govindan. Minimum cost active labeling, 2020. [185] Hang Qiu, Ramesh Govindan, Marco Gruteser, and Fan Bai. Method and apparatus of networked scene rendering and augmentation in vehicular environments in autonomous driving systems, 2018. US Patent 20180261095. [186] Hang Qiu, Konstantinos Psounis, Giuseppe Caire, Keith M. Chugg, and Kaidong Wang. High- rate wifi broadcasting in crowded scenarios via lightweight coordination of multiple access points. In Proceedings of the 17th ACM International Symposium on Mobile Ad Hoc Networking and Computing, MobiHoc ’16, pages 301–310, New York, NY , USA, 2016. ACM. [187] Qualcomm. Lte direct proximity services, 2019. [188] Ahmed H Qureshi, Anthony Simeonov, Mayur J Bency, and Michael C Yip. Motion planning networks. In 2019 International Conference on Robotics and Automation (ICRA), pages 2118–2124. IEEE, 2019. [189] Moo-Ryong Ra, Bin Liu, Tom F. La Porta, and Ramesh Govindan. Medusa: A programming framework for crowd-sensing applications. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services, MobiSys ’12, pages 337–350, New York, NY , USA, 2012. ACM. 137 [190] F. Rameau, H. Ha, K. Joo, J. Choi, K. Park, and I. S. Kweon. A real-time augmented reality system to see-through cars. IEEE Transactions on Visualization and Computer Graphics, 22(11):2395–2404, Nov 2016. [191] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. [192] Dirk Reichardt, Maurizio Miglietta, Lino Moretti, Peter Morsink, and Wolfgang Schulz. Cartalk 2000: Safe and comfortable driving based upon inter-vehicle-communication. In Intelligent Vehicle Symposium, 2002. IEEE, volume 2, pages 545–550. IEEE, 2002. [193] How to Retrain an Image Classifier for New Categories, 2018. https://www.tensorflow. org/tutorials/image_retraining. [194] Radu B Rusu and S Cousins. Point cloud library (pcl). In 2011 IEEE International Conference on Robotics and Automation, pages 1–4, 2011. [195] Andrzej Ruta, Fatih Porikli, Shintaro Watanabe, and Yongmin Li. In-vehicle camera traffic sign detection and recognition. Machine Vision and Applications, 22(2):359–375, 2011. [196] Jedrzej Rybicki, Björn Scheuermann, Wolfgang Kiess, Christian Lochert, Pezhman Fallahi, and Martin Mauve. Challenge: Peers on wheels - a road to new traffic information systems. In Proceedings of the 13th Annual ACM International Conference on Mobile Computing and Networking, MobiCom ’07, pages 215–221, New York, NY , USA, 2007. ACM. [197] Vittorio Sala. Power law scaling of test error versus number of training images for deep convolutional neural networks. In Proc. of SPIE, volume 11059 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, page 1105914, Jun 2019. [198] Satyam Portal. https://satyamresearchportal.azurewebsites.net, 2018. [199] Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Active hidden markov models for informa- tion extraction. IDA ’01, page 309–318, Berlin, Heidelberg, 2001. Springer-Verlag. [200] Ruwen Schnabel and Reinhard Klein. Octree-based point-cloud compression. In Spbg, pages 111–120, 2006. [201] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach, 2017. [202] Burr Settles. Active learning literature survey. Technical report, 2010. [203] Yanyao Shen, Hyokun Yun, Zachary Lipton, Yakov Kronrod, and Animashree Anandkumar. Deep active learning for named entity recognition. Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017. [204] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [205] Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, and Senthil Yogamani. Deep semantic seg- mentation for automated driving: Taxonomy, roadmap and challenges. 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Oct 2017. [206] A. Sinha and E. Modiano. Throughput-optimal broadcast in wireless networks with point-to- multipoint transmissions. In Proceedings of the 18th ACM International Symposium on Mobile Ad Hoc Networking and Computing, Mobihoc ’17, pages 3:1–3:10, New York, NY , USA, 2017. ACM. 138 [207] Miguel Angel Sotelo, Francisco Javier Rodriguez, Luis Magdalena, Luis Miguel Bergasa, and Luciano Boquete. A color vision-based lane tracking system for autonomous driving on unmarked roads. Auton. Robots, 16(1):95–116, January 2004. [208] Spare5 Website. https://app.spare5.com/fives, 2018. [209] R. Stahlmann, A. Festag, A. Tomatis, I. Radusch, and F. Fischer. Starting european field tests for car-2-x communication: the drive c2x framework. In ITS World Congress and Exhibition, Oct 2011. [210] R Stahlmann, A Festag, A Tomatis, I Radusch, and F Fischer. Starting european field tests for car-2-x communication: the drive c2x framework. In 18th ITS World Congress and Exhibition, 2011. [211] Hao Su, Jia Deng, and Li Fei-Fei. Crowdsourcing annotations for visual object detection. In The 4th Human Computation Workshop, HCOMP@AAAI 2012, Toronto, Ontario, Canada, July 23, 2012., 2012. [212] James Surowiecki. The Wisdom of Crowds. Anchor, 2005. [213] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex A. Alemi. Inception-v4, inception- resnet and the impact of residual connections on learning. In ICLR 2016 Workshop, 2016. [214] Arvind Thiagarajan, Lenin Ravindranath, Katrina LaCurts, Samuel Madden, Hari Balakrishnan, Sivan Toledo, and Jakob Eriksson. Vtrack: Accurate, energy-aware road traffic delay estimation using mobile phones. In Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems, SenSys ’09, 2009. [215] Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, Kenny Lau, Celia Oakley, Mark Palatucci, Vaughan Pratt, Pascal Stang, Sven Strohband, Cedric Dupont, Lars-Erik Jendrossek, Chris- tian Koelen, Charles Markey, Carlo Rummel, Joe van Niekerk, Eric Jensen, Philippe Alessandrini, Gary Bradski, Bob Davies, Scott Ettinger, Adrian Kaehler, Ara Nefian, and Pamela Mahoney. Stan- ley: The robot that won the darpa grand challenge: Research articles. J. Robot. Syst., 23(9):661–692, September 2006. [216] TopCoder, 2018. https://www.topcoder.com/. [217] TP-Link Talon AD7200 Multi-Band WiFi Router. https://www.tp-link.com/us/ products/details/cat-9_AD7200.html, 2018. [218] Samples of Tracking Groundtruth Using Satyam. https://www.dropbox.com/sh/ mcdsadqmk91hbgc/AAAdMTL45YuuOJYM7Pek13Rea?dl=0, 2018. [219] Manabu Tsukada, Takaharu Oi, Masahiro Kitazawa, and Hiroshi Esaki. Networked roadside percep- tion units for autonomous driving. Sensors, 20(18):5320, 2020. [220] How Uber’s First Self-Driving Car Works. http://www.businessinsider.com/ how-ubers-driverless-cars-work-2016-9. [221] C. Urmson and W. ". Whittaker. Self-driving cars and the urban challenge. IEEE Intelligent Systems, 23(2):66–68, 2008. [222] Chris Urmson, Joshua Anhalt, Drew Bagnell, Christopher Baker, Robert Bittner, M. N. Clark, John Dolan, Dave Duggins, Tugrul Galatali, Chris Geyer, Michele Gittleman, Sam Harbaugh, Martial Hebert, Thomas M. Howard, Sascha Kolski, Alonzo Kelly, Maxim Likhachev, Matt McNaughton, Nick Miller, Kevin Peterson, Brian Pilnick, Raj Rajkumar, Paul Rybski, Bryan Salesky, Young- Woo Seo, Sanjiv Singh, Jarrod Snider, Anthony Stentz, William “Red” Whittaker, Ziv Wolkowicki, 139 Jason Ziglar, Hong Bae, Thomas Brown, Daniel Demitrish, Bakhtiar Litkouhi, Jim Nickolaou, Varsha Sadekar, Wende Zhang, Joshua Struble, Michael Taylor, Michael Darms, and Dave Ferguson. Autonomous driving in urban environments: Boss and the urban challenge. J. Field Robot., 25(8):425– 466, August 2008. [223] uTest, 2018. https://www.utest.com/. [224] Vatic: Video annotation tool from irvine, ca, 2018. https://github.com/cvondrick/ vatic. [225] Vehicle Average Length. https://www.reference.com/vehicles/ average-length-car-2e853812726d079d, 2018. [226] Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. Waldo: An adaptive human interface for crowd entity resolution. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1133–1148, New York, NY , USA, 2017. ACM. [227] Christoph V ogel, Konrad Schindler, and Stefan Roth. 3d scene flow estimation with a piecewise rigid scene model. Int. J. Comput. Vision, 115(1), October 2015. [228] Keze Wang, Xiaopeng Yan, Dongyu Zhang, Lei Zhang, and Liang Lin. Towards human-machine cooperation: Self-supervised sample mining for object detection. CoRR, abs/1803.09867, 2018. [229] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, Dec 2017. [230] Song Wang, Jingqi Huang, and Xinyu Zhang. Demystifying millimeter-wave v2x: Towards robust and efficient directional connectivity under high mobility. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking, pages 1–14, 2020. [231] Waymo. https://waymo.com/. [232] Waymo Open Dataset. https://waymo.com/open/. [233] Han Xiao, Kashif Rasul, and Roland V ollgraf. Fashion-mnist: a novel image dataset for benchmark- ing machine learning algorithms, 2017. [234] Jinbo Xiong, Renwan Bi, Mingfeng Zhao, Jingda Guo, and Qing Yang. Edge-assisted privacy- preserving raw data sharing framework for connected autonomous vehicles. IEEE Wireless Commu- nications, 27(3):24–30, 2020. [235] Y . Xu, V . John, S. Mita, H. Tehrani, K. Ishimaru, and S. Nishino. 3d point cloud map based vehicle localization using stereo camera. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 487–492, June 2017. [236] Yet Another Computer Vision Index To Datasets (YACVID), 2018. http:// riemenschneider.hayko.at/vision/dataset/. [237] Lin Yang, Yizhe Zhang, Jianxu Chen, Siyuan Zhang, and Danny Z. Chen. Suggestive annotation: A deep active learning framework for biomedical image segmentation. Lecture Notes in Computer Science, page 399–407, 2017. [238] ZED Stereo Camera. https://www.stereolabs.com/. [239] Ji Zhang and Sanjiv Singh. Loam: Lidar odometry and mapping in real-time. In Robotics: Science and Systems, volume 2, page 9, 2014. 140 [240] Ji Zhang and Sanjiv Singh. Visual-lidar odometry and mapping: Low-drift, robust, and fast. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 2174–2181. IEEE, 2015. [241] Lei Zhang, Andy Sun, Ryan Shea, Jiangchuan Liu, and Miao Zhang. Rendering multi-party mobile augmented reality from edge. In Proceedings of the 29th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, pages 67–72, 2019. [242] Xiaohang Zhang, Guoliang Li, and Jianhua Feng. Crowdsourced top-k algorithms: An experimental evaluation. Proc. VLDB Endow., 9(8):612–623, April 2016. [243] Y . Zhou and O. Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018. [244] Chao Zhu, Jin Tao, Giancarlo Pastor, Yu Xiao, Yusheng Ji, Quan Zhou, Yong Li, and Antti Ylä-Jääski. Folo: Latency and quality optimized task allocation in vehicular fog computing. IEEE Internet of Things Journal, 6(3):4150–4161, 2018. [245] Zoox. https://zoox.com/. 141
Abstract (if available)
Abstract
Using advanced 3D sensors and sophisticated deep learning models, autonomous cars are already transforming how people commute daily. However, anticipating and managing corner-cases is a significant remaining challenge for further advancement of autonomous cars. Human drivers, on the other hand, are extremely good at handling corner-cases, so if autonomous vehicles are to be widely accepted, they must achieve human-level reliability. ❧ In this dissertation, we show that wireless connectivity has the potential to fundamentally re-architect the transportation systems beyond human-driver performance. The dissertation presents algorithms and systems that enable cooperative perception among networked vehicles and infrastructure sensors that can substantially augment perception and driving capabilities. The dissertation makes several fundamental contributions. Augmented Vehicular Reality (AVR) is the first system in which vehicles can exchange 3D sensor information to help each other see through obstacles, and thereby make more reliable and efficient driving decisions. To scale cooperative perception to multiple autonomous vehicles, the dissertation presents AutoCast, a scalable cooperative perception and coordination framework, that enables efficient and cooperative exchange of sensor data and metadata in practical dense traffic scenarios. To effectively leverage shared sensors to improve the robustness of perception and driving, autonomous cars need large high-quality datasets. Satyam is an open-source annotation platform, that automates human annotation collection and deploys quality control techniques to deal with untrained workers, human errors, and spammers. Finally, the dissertation includes a hybrid human-machine labeling framework, active labeling, that effectively leverages machine labeling to significantly reduce the total annotation cost. These annotations can be used to build novel large scale datasets that are tailored to new applications, such as training agents using extended vision.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Enabling virtual and augmented reality over dense wireless networks
PDF
Performant, scalable, and efficient deployment of network function virtualization
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Towards building a live 3D digital twin of the world
PDF
Rate adaptation in networks of wireless sensors
PDF
Enabling massive distributed MIMO for small cell networks
PDF
Point-based representations for 3D perception and reconstruction
PDF
Reliable languages and systems for sensor networks
PDF
Efficient pipelines for vision-based context sensing
PDF
Making web transfers more efficient
PDF
Performance and incentive schemes for peer-to-peer systems
PDF
Cooperation in wireless networks with selfish users
PDF
Scaling-out traffic management in the cloud
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Personalized driver assistance systems based on driver/vehicle models
PDF
Remote exploration with robotic networks: queue-aware autonomy and collaborative localization
PDF
Modeling intermittently connected vehicular networks
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Enabling efficient service enumeration through smart selection of measurements
Asset Metadata
Creator
Qiu, Hang
(author)
Core Title
Networked cooperative perception: towards robust and efficient autonomous driving
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
01/20/2021
Defense Date
11/25/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
autonomous driving,cooperative perception,OAI-PMH Harvest,vehicular network
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Govindan, Ramesh (
committee chair
), Lim, Joseph (
committee member
), Psounis, Konstantinos (
committee member
)
Creator Email
hangqiu@usc.edu,hangqiu7@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-415444
Unique identifier
UC11667411
Identifier
etd-QiuHang-9236.pdf (filename),usctheses-c89-415444 (legacy record id)
Legacy Identifier
etd-QiuHang-9236.pdf
Dmrecord
415444
Document Type
Dissertation
Rights
Qiu, Hang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
autonomous driving
cooperative perception
vehicular network