Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient crowd-based visual learning for edge devices
(USC Thesis Other)
Efficient crowd-based visual learning for edge devices
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT CROWD-BASED VISUAL LEARNING FOR EDGE DEVICES by Giorgos Constantinou A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2021 Copyright 2021 Giorgos Constantinou Dedication Dedicated to my family and friends for their endless guidance and support ii Acknowledgements I wish to thank my Ph.D. advisor and committee chair, Prof. Cyrus Shahabi for the tremendous support throughout my graduate studies. He helped me to nd and focus on the research topic, dene experiments, work on results, assess the eectiveness of new algorithms, present in a top-down approach, and give honest feedback to always strive for improvement. It was an honor to be a doctorate student under his supervision. Special thanks to Prof. C.-C. Jay Kuo and Prof. Bhaskar Krishnamachari for serving as my committee members and their valuable feedback of this research work. I am grateful to work on several projects at the Integrated Media Systems Center (IMSC), the Infor- mation Laboratory (InfoLab) and Annenberg School for Communication and Journalism under the super- vision of Prof. Cyrus Shahabi, Dr. Seon Ho Kim and Prof. Gabriel Kahn such as iWatch, MediaQ and Crosstown. I owe a deep sense of gratitude to Dr. Seon Ho for spending countless hours on discussing research ideas, and Prof. Gabriel Kahn for providing emotional and nancial support throughout my re- search studies. At IMSC and InfoLab, I was lucky enough to collaborate with a group of talented people. I thank profusely Dr. Hien To, Dr. Mohammad Asghari, Dr. Ying Lu, Dr. Abdullah Alfarrarjeh, Dr. Gowri Sankar Ramachandran, Dr. Luan Tran, Dimitris Stripelis, Chrysovalantis Anastasiou, Haowen Lin, Jiao Sun, Mingxuan Yue, Arvin Hekmati, Ritesh Ahuja, Sina Shaham, Dr. Luciano Nocera, Prof. Yao-Yi Chiang and Lauren Whaley. I would like to express my deep appreciation to my wife Maria, parents Constantinos and Maria, sib- lings Stefanos, Myrianni and Christiana, relatives and friends for the love and support during my academic iii career. Especially, I am extremely thankful to my friends back in Cyprus: George Paschalis, Michalis Senekkis, Prokopis Prokopiou, Theophilos Phokas and Constantina Ioannou for always believing in me and supporting me from thousands miles away, and my friends I met in Los Angeles: Dr. Chrysostomos Marasinou, Dr. Panayiotis Petousis and Panayiota Loizidou. Thank you all! iv TableofContents Dedication ii Acknowledgements iii ListofTables viii ListofFigures ix Abstract xii Chapter1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter2: Crowd-basedImageLearningFrameworkusingEdgeComputingforSmartCity Applications 5 2.1 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Edge Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1.1 Model download module . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1.2 Video frame extraction module . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1.3 Visual feature extraction module . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1.4 Inference module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1.5 Inference quality control module . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1.6 Policy module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 ED-Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2.1 Model distribution module . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2.2 Training module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2.4 Retrain module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2.5 Model metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Crowd-based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Image Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Image Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Image Label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 v 2.3.2 Model Retraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2.1 Width Multiplier and Resolution vs. Accuracy . . . . . . . . . . . . . . . 17 2.3.2.2 Model Size vs. Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.2.3 Inference Time vs. Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2.4 Feature Size vs. Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2.5 Location based retraining . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: Spatial Keyframe Extraction From Urban Videos to Enable Object Detection at theEdge 25 3.1 Spatial Keyframe Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.2 Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.3 Baselines for Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.2 Impact of Spatial Weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter4: TowardsScalableandEcientClientSelectionforFederatedObjectDetection 36 4.1 Federated Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 FL Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.2 FL Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Client Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1 Random (R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2 Random Label (RL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.3 Random Samples (RS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.4 Fair Label Entropy (FLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.5 Coverage Label Entropy (CLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1.1 IID Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1.2 Non-IID Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter5: PlacementofDNNModelsonMobileEdgeDevicesforEectiveVideoAnalysis 61 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1.1 Edge Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1.2 Field-of-View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1.3 DNN Model Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1.4 DNN Model Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1.5 Objects of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1.6 Coverage Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1.6.1 Weighting Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1.6.2 Overlapping FOVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Model Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1 MIQCP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1.1 Inference Latency & Utility . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1.3 Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 vi 5.2.1.4 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.2 Baseline Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.2.1 Random (R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.2.2 LeastMemoryFirst (LMF) . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.2.3 MostAccurateFirst (MAF) . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.2.4 MostPopularFirst (MPF) . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.2.5 MostPopularOnRouteFirst (MPORF) . . . . . . . . . . . . . . . . . 79 5.2.2.6 MostImportantFirst (MIF) . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.2.7 MostProtableFirst (MPRF) . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.2.8 SpatialCoverage (SC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Chapter6: RelatedWork 96 6.1 Related Work for Edge AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Related Work for Keyframe Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3 Related Work for FL client selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4 Related Work for Model Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.1 Deep Learning with Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.1.1 Edge Node Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.1.2 Cross-Device Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.1.3 Edge Device Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter7: ConclusionandFutureWork 108 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Bibliography 111 vii ListofTables 2.1 Object Detection Time on Various Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.1 Notation table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Faster-RCNN, RetinaNet and EcientDet rounds required to reach target mAP for Pascal VOC, KITTI and FedAI datasets while varying client local epochE and batch sizeB. . . . 52 5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Utilization cost parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Devices and Model Versions Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.4 Devices and Model Versions Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 viii ListofFigures 2.1 The Design of the Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Accuracy forD SC andD CAL256 Dataset for Various Models . . . . . . . . . . . . . . . . . 16 2.3 Model Size vs. Accuracy forD SC andD CAL256 Datasets for Various Models . . . . . . . . 16 2.4 Inference Time vs. Model forD SC Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Average Feature Size vs. Accuracy forD SC . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Bandwidth vs. Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Location-based Feature Selection (M1 is trained onD SC nD SC DTLA , whileM2 is trained onD SC nD SCtest DTLA ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Field-Of-View (FOV) model and Coverage MBR. . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Avg. Extraction Time per Video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Hit-Ratio performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 The proposed federated object detection workow diagram. A centralized server learns a global object detection model by asking distributed clients to locally train and upload model weight updates. These updates are aggregated to improve the current global model and the process is repeated multiple rounds until a target accuracy or time limit is reached. The red colored parts indicate the core components of the proposed federated object detection system to which we made major contributions. . . . . . . . . . . . . . . . . . . . 37 4.2 FOV [17] model from training image and GPS, compass and LiDAR data of KITTI dataset. 45 4.3 Toy example for selecting clients usingCLE with 3 clients represented by square, circular and triangle shapes, 9 cellsfc 1 ;c 2 ;:::;c 9 g and an object detection task with 3 labels. The data label frequency of each cell is shown under each client. For example, in cellc 1 the circular useru 1 has label frequencyn 1;1 = 3;n 1;2 = 2;n 1;3 = 7. . . . . . . . . . . . . . . . 46 ix 4.4 IID partition forK =f50;100g clients for Pascal VOC and KITTI andK =f20g clients for FEDAI. Each client receives almost identical number of labels. . . . . . . . . . . . . . . 49 4.5 Non-IID partition forK =f50;100g clients for Pascal VOC and KITTI andK =f20g clients for FEDAI. The label distribution varies across clients. . . . . . . . . . . . . . . . . . 55 4.6 Faster-RCNN on Pascal VOC, KITTI and FEDAI. . . . . . . . . . . . . . . . . . . . . . . . . 56 4.7 RetinaNet on on Pascal VOC, KITTI and FEDAI. . . . . . . . . . . . . . . . . . . . . . . . . 57 4.8 EcientDet on Pascal VOC, KITTI and FEDAI. . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.9 Faster-RCNN on VOC and KITTI for varying number of clientsK =f1;50;100g . . . . . 59 4.10 Client selection method comparison for Non-IID datasets. The proposedFLE andCLE outperform conventionalR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.11 Model performance on KITTI and Virtual KITTI 2 datasets. . . . . . . . . . . . . . . . . . . 60 5.1 Motivating example with two edge devicese 1 ande 2 , two modelsm 1 andm 2 . . . . . . . . 62 5.2 Field-Of-View (FOV) model and Coverage MBR. . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Spatial, directional and user-assigned weight examples. . . . . . . . . . . . . . . . . . . . . 70 5.4 Estimating the utilization cost function, which penalizes for under and over-utilization. . . 75 5.5 Baseline placement algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6 FOVs selection algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 R algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.8 Place algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.9 LMF algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.10 MPF algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.11 MPORF algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.12 MIF algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.13 SC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.14 Candidate solution construction algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.15 Update candidate solutions algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 x 5.16 Device solution algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.17 The dataset consists of 165 trajectories and 190 objects of interest. . . . . . . . . . . . . . . 89 5.18 Weighting schemes impact onSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.19 Cell size impact onSC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.20 Recall and Performance of Placement Heuristics. . . . . . . . . . . . . . . . . . . . . . . . . 91 5.21 Placement misses Placement Heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.22 Utility, Utilization Cost and Coverage of heuristics. . . . . . . . . . . . . . . . . . . . . . . 94 6.1 DNN inference architectures with Edge Computing. . . . . . . . . . . . . . . . . . . . . . . 106 xi Abstract With the availability of massive amounts of visual media covering large geographical regions, several ap- plications have emerged, including classifying the street cleanliness level, detecting forest res or road hazards. Such applications share similar characteristics as they need to 1) detect specic objects or events (what), 2) associate the detected object with a location (where), and 3) know the time that the event hap- pened (when). Advancements in image-based machine learning (ML) benet these applications as they can automate the detection of objects of interest. Along with the edge computing (EC) paradigm, we are entering the Edge AI era, where the processing cost of ML is ooaded to the edge devices where the data are generated, hence reducing latency, communication cost and risk of privacy leaks. Moreover, at the acquisition time, sensors on the edge devices (e.g., GPS, gyroscope) enrich the collected data with other metadata. However, a shortcoming of existing approaches is that they rely on pre-trained “static” models. Nonetheless, crowdsourced data at diverse locations can be leveraged to iteratively improve the eective- ness and eciency of a model. We refer to the aforementioned strategy as “spatial crowd-based visual learning”. To support this class of applications, we design a Crowd-based Visual Learning Framework that inte- grates ML, crowdsourcing, and EC which allows edge devices with diverse resource capabilities to perform machine learning. The framework highlights the importance of maintaining versions of variable complex- ity models that can support heterogeneous edge devices with uncertain resource capacities at dierent xii geographical regions (e.g., model owner may not be the edge device owner, which is especially true for crowd-based and incentive-driven smart city applications). However, because of the high video frame rate, edge devices are unable to process deep learning mod- els at the same speed. Any approach to consecutively feed frames to the model compromises both the quality (by missing important frames) and the eciency (by processing redundantly similar frames) of video analysis. Focusing on outdoor urban videos, we utilize the spatial metadata of frames to select an optimal subset of frames that maximizes the coverage area of the footage. We further extend the idea of exploiting the spatial coverage of the collected data in the Federated Learning (FL) domain. Federated Learning provides a promising solution to learn a model from decen- tralized data. Despite the advances in FL, the diversity of client regions in which they operate and the Non-IID nature of the crowdsourced datasets reduces the accuracy of models signicantly. To address this problem, we introduce a novel FL object detection system to eciently train models with heterogeneous client datasets by designing lightweight client selection methods to learn object detection models faster. Our method leverage the metadata of the training images (e.g., location, direction, depth), to select clients which maximize the coverage of diverse geographical regions. The pervasive deployment of IoT devices along with the advancements in Deep Neural Network (DNN) models require novel solutions to decide how to place models to edge devices eectively. Current ap- proaches require the model developer to take the placement decision by manually assigning models to edge devices. Models deployed on moving IoT devices (e.g., attached on vehicles) operating in a large ge- ographical region poses new challenges because models can have varying eectiveness depending on the location. To tackle these challenges we mathematically formulate and propose a location-aware algorithm to place models to edge devices and quantify the eectiveness of our techniques. xiii Chapter1 Introduction 1.1 Motivation The recent advancements in processing power and explosive growth of IoT devices (20.4 billion by 2020 [50]) enabled the continuous development of Edge Computing (EC) systems. The EC paradigm focuses on processing information close to the edge devices, where the data are generated to reduce long-distance trac and latency. Shifting computation to the edge has several benets: a) applications do not suer from latency and communication bandwidth restrictions because they are able to process and store data locally in real-time, b) it reduces the costs of processing the collected data at a cloud-based or centralized server, and c) it provides a privacy mechanism so that raw sensitive data (e.g., ngerprints) does not need to be directly shared outside the device. Additionally, along with the broad deployment of EC, the hardware improvements of edge devices and advancements of Deep Neural Networks (DNNs) made it possible to analyze the visual data captured at the device in real-time, also known as Edge AI. The combination of EC and ML has enabled many appli- cations that automate the classication or object detection tasks such as road damage detection [91, 10] to improve the infrastructure, street cleanliness monitoring [5] to prioritize sanitation eorts, surface detec- tion for grati removal [130, 9] to improve quality of life and reduce gang crime, and bicycle/pedestrians counters [121, 75, 116] to improve transportation, to name a few. 1 Such Image-Based ML applications share a common lifecycle: 1) collect training data, 2) train models, 3) deploy the models to the edge devices, 4) run model inference at the edge device, 5) update models with new data, 6) redeploy models. However, edge devices come in dierent shapes and sizes and their processing performance heavily depends on their hardware resources; some are equipped with powerful GPUs such as the Nvidia Jetson TX2 while others are less capable such as the Raspberry PIs. However, there exist challenges. First, the video frame rate at the edge device is larger than the DNN inference rate. Blindly feeding newly captured video frames to the DNN as soon as the processing of an older frame is completed, wastes resources and can miss important visual content because it relies on the processing speed of the device. Second, systems with thousands of participating edge devices can generate massive amounts of visual data, hence they require new approaches to prioritize the acquisition process for crowdsourced training data. Third, it is impossible for a human operator to manually deploy models to hundreds of devices. An Edge AI solution could entail hundreds of devices with various resource capabil- ities and dierent versions of DNN models, hence rendering manual placement ineective. Additionally, the model placement to mobile IoT devices (e.g., attached on vehicles) in a large geographical region poses new challenges because a specic model can be more eective (i.e., detect objects of interest) in a particu- lar region compared to alternative models. The problem of model placement becomes more challenging if multiple DNN models can be deployed on the same device (co-placement), which can negatively aect the inference latency and resource utilization in exchange for multi-model region coverage. Eective model placement require to consider the resource capacity of edge devices while maximizing their coverage area, which aects the inference latency and increases the hit-ratio of detecting objects, respectively. Due to the ubiquity of edge devices with cameras and complementary sensors, such as GPS and gy- roscope, the generated visual data and inference results can be associated with rich geospatial metadata at a very ne level (i.e., per video frame [17]). Such metadata not only improve the practicality of the aforementioned applications, but also can be exploited to design more ecient and eective algorithms. 2 1.2 ThesisStatement The goal of this proposed work is to utilize the geospatial metadata of the participating devices to provide a solution to the above mentioned challenges. We argue that visual-based machine learning deployed at the edge can benet from the context provided by the geospatial metadata generated at the edge. Specically, to decide which frames to feed to the DNN of a resource-constraint device, new lightweight and ecient algorithms, based on the coverage area of the captured video, are designed that improve the detection rate of objects. Additionally, the label distribution of training data collected across a geographical region from the crowd is used to prioritize the training process and improve the eectiveness of ML models. Finally, on one hand the heterogeneity of mobile edge devices deployed in various regions and on the other hand the model characteristics, require to adaptively place models, such that the resource constraints of edge devices are satised, while ensuring the eciency of the visual analysis. More formally, the thesis statement is: Exploiting the geospatial metadata of multimedia content while adaptively placing the models at resource-constrained edge devices, improves the eectiveness and eciency of Crowd-based Visual Learning frameworks. 1.3 ThesisOutline The rest of the report is organized as follows: In Chapter 2 we propose a crowd-based learning framework that integrates EC and ML to enable Edge AI for resource constrained devices. We study the trade-o between inference accuracy and the resource constraints of the edge devices. We further investigate the eects of reducing the model complexity on the inference accuracy and performance, which is necessary to support wide classes of edge devices. In Chapter 3 building on the proposed framework of Chapter 2, we investigate a variety of approaches to extract frames from urban mobile videos to detect objects. We design and propose a novel greedy algorithm, Greedy-SKE, for an edge object detection model which constructs a coverage-aware grid 3 structure from the captured FOV metadata, to eciently detect objects. Greedy-SKE highlights the importance of using spatial metadata and optimizes the frame selection based on the coverage area of the video. The algorithm reduces the computational complexity and achieves higher hit-ratio than alternatives. TheGreedy-SKE can be applied in many domains; for example, to enhance the situation awareness in case of disasters. In Chapter 4 we present a novel federated object detection system to eciently train models with heterogeneous data at clients. We developed lightweight client selection methods to learn object detection models accurately and eectively. The fair client selection method based on the object data distribution of clients achieve signicant reduction in required federated rounds compared to conventional approaches. We further extend this method by leveraging the geospatial metadata of object detection training image data and their FOV at the edge to adapt the grid-based coverage structure of Chapter 3, and to eectively select clients in a Federated Learning framework who maximize the coverage of diverse geographical regions. In Chapter 5 we extend the coverage data structures utilized in Chapter 3 and Chapter 4 to enable an optimization-based approach and adaptively place various models to resource-constrained edge devices. We present various methods to automatically place DNN models to a diverse set of mobile edge devices with pre-determined trajectories. Considering the spatial metadata generated by the devices, their resource capabilities and the characteristics of the trained models, we rst mathematically formulate the model placement as an optimization problem, which is proved to be NP-Hard. We then propose several heuristics to solve it eciently. In Chapter 6 we review the major related works image-based ML frameworks, keyframe extraction, FL client selection and model placement in EdgeAI. In Chapter 7 we summarize our contributions and discuss potential future work. 4 Chapter2 Crowd-basedImageLearningFrameworkusingEdgeComputingfor SmartCityApplications Edge computing (EC) paradigm focuses on processing information close to the data source to reduce long- distance trac and latency. Applications in the context of smart cities are expected to benet from the EC due to the increased deployment of smart sensors such as IoT devices [77] and CCTV cameras [5, 10]. Such deployments provide application stakeholders with the ability to recognize objects and events of interests by processing media content close to the data source. Besides, the platforms such as MediaQ [71] provide crowdsourcing of media content, wherein the users can collect and share media content in a centralized media repository for creating a geospatial media library [25, 89, 8] for smart cities. Crowdsourced image data is typically used to improve the algorithms that detect potholes [10], grati [103, 11], and trac ow [72], to name a few. In all these applications, the edge devices such as smartphones, CCTV, drones, police cars or Raspberry PIs are equipped with high-resolution cameras, can capture videos with high frame rate (ranging from 30FPS to 60FPS). Recent advances in neural networks along with the processing power of edge devices have made it possible to run deep learning models locally on edge device, which provides a great potential to utilize lots of mobile edge devices in public domain for urban applications. However, there exist challenges. Heterogeneous devices exhibit dierent capabilities concerning pro- cessing power, memory, communication bandwidth, and battery capacity, which inuences the inference 5 Table 2.1: Object Detection Time on Various Platforms Platform CPU(GHz) GPU ProcessingTime(s) Raspberry Pi 0.6 No 360 PC 1.596 No 29 Low-end server 1.386 Yes 0.2 time of machine learning (ML) algorithms and subsequently the performance of applications. To demon- strate it, we performed object detection using identical software developed using YOLO [108] library on Raspberry Pi, Desktop computer without GPU, and a server machine with GPU to study the performance of various classes of platforms. As shown in Table 2.1, an edge server with GPU outperforms the other platforms, even though all platforms are capable of performing object detection. Although the devices are capable of detecting objects using the given model, the processing time depends on the resource capacity of the device which may limit the practical application of the model. One fundamental challenge is the uncertainty and heterogeneity of participating edge devices. Most of the existing frameworks train a single model on the server side and then distribute it to edge devices [108]. This approach works well in the case when the model owner is well aware of the processing and commu- nication capabilities of the participating edge devices. However, in many cases, the model owner may not be the edge device owner, which is especially true for crowd-based and incentive-driven smart city appli- cations [77]; thus the processing power of edge devices is unknown. Having a single model for a diverse set of edge devices with dierent processing capabilities introduces new limitations because high-end de- vices can run a more complex version of the model which potentially can provide more accurate results, or on the other side of the spectrum, a low-end device can run a simpler version of the model much faster but with less accurate results. In addition, edge devices are generating new multimedia data, which can be used to retrain the model for ne-tuning the inference accuracy. Existing frameworks lack appropri- ate mechanisms to integrate big datasets from the edge devices themselves into the retraining process for quality and performance enhancement. 6 To address the above challenges, we integrate crowdsourcing, ML, and EC in one framework dubbed as “crowd-based learning framework using edge computing”. The framework enhances the ML algorithm of interest by utilizing the crowdsourced data collected by edge devices. One straightforward approach for crowd-based learning framework is to exploit all of the crowdsourced data; however such mechanism is infeasible in image ML due to network constraints, the high computational requirements at the server and the potential degradation of the accuracy of learning model ∗ . Consequently, we propose a distributed selection algorithm that prioritizes the crowdsourced data, transfers only a selected subset of data, and still eciently upgrades the learning model at the server end. The proposed generic crowd-based learning framework for EC applications makes the following key contributions: • An EC framework based on the client-server architecture with built-in support for dispatching ML models to edge devices based on the resource capacity and the bandwidth availability. • Classication, creation, and maintenance of image ML models for a wide array of edge devices with heterogeneous resource and bandwidth capacities. • Model enhancement algorithm which takes into account the geospatial and temporal properties of the newly collected data at edge devices to intelligently decide when to retrain the model. 2.1 ProposedFramework Figure 2.1 presents our proposed framework in two layers: 1) the devices at the edge (denoted as Edge Devices) and 2) the edge device server (denoted as ED-Server). The communication between the edge server and the edge device is realized through conventional technologies such as cellular, WiFi, DSRC, and LPWAN. ∗ Several factors might cause the degradation of the accuracy of learning model (e.g., the crowdsourced data might be of low-quality, or the new data might change the distribution of the training dataset which may generate a biased learning model). 7 Figure 2.1: The Design of the Proposed Framework 2.1.1 EdgeDevices An edge device is capable of collecting media content from a camera, pre-processing the data, performing local inference using a ML model provided by the edge server, extracting visual features vectors (VFVs) 8 and metadata based on the policy set by the edge server, labeling, and transmitting the metadata and the VFVs to the ED-Server. In the framework, an edge device joins the framework and downloads a suitable model based on its resource capacity and the accuracy goals from the edge server. The central edge server maintains a list of models capable of running on a wide variety of edge devices. Conventional approaches use a pre-trained model and do not continuously upgrade the model for accuracy and performance because of the bandwidth and energy requirements associated with the transmission of the entire media content to the edge server for enhancing the accuracy of the model. Example applications in the context of smart cities we are currently working with the City of Los Angeles include: the detection of waste dumped objects [5], grati [103], road damages and potholes [10], trac ow [72], and natural disasters [3]. Users and devices participating in reporting the presence of grati, potholes, and garbage on streets have been relying on a static model provided by the edge server. To increase the detection accuracy and the resource eciency, we introduce a feature detection algorithm on the edge devices, which detects the key features relevant for the application and sends it to the edge server for reinforcement learning to continuously improve the accuracy of models. Note that the devices reporting the presence of grati, pothole or garbage currently do not send the entire media content due to the bandwidth limitation and energy consumption, which means the models are not dynamically trained using the real-world data generated by the application. The core components of an edge devices are discussed below: 2.1.1.1 Modeldownloadmodule This module communicates the device capability and user constraints with the edge server and downloads an ML model that can operate within the resource capacity of the device. Each device, when running the crowd-based learning framework, maintains a resource prole which is used to identify the suitable model for the edge device. The user or the owner of the edge device may congure the edge device to determine the resource allocation for the crowd-based learning framework. For example, the user constraints may 9 include the maximum memory space a model can consume, the maximum time to run an inference, the energy budget per inference, etc. 2.1.1.2 Videoframeextractionmodule This module is responsible for capturing video content and convert it into a series of keyframes [134]. Extracted frames can be fed into the inference module to identify desired objects or classes. 2.1.1.3 Visualfeatureextractionmodule During phase one of the inference, the module resizes the raw image to the trained input resolution and feeds the resized image to the trained model which uses the layer just before the nal output layer (that actually does classication) to extract VFV values. 2.1.1.4 Inferencemodule During phase two of the inference, the VFVs extracted in the previous step, are fed into the inference module, which uses the last layer of the trained model to identify the desired object or feature. 2.1.1.5 Inferencequalitycontrolmodule This module decides whether to submit the inference results and the extracted VFVs to the edge server. The decision is enforced by the policy module which dictates when data need to be transmitted to the server. For example, a quality control policy can specify that only the inference results that are above a certain threshold are reported to the edge server and/or the feature vectors of the images captured in a particular location or time can be uploaded. Such a policy allows the edge device to preserve bandwidth and energy consumption, while the server is able to control the quality and semantic importance of the crowdsourced results. 10 2.1.1.6 Policymodule The policy module maintains the desired policies set by the ED-Server based on the resource capacity. This module ensures that the results generated by edge devices are of high quality, as the reporting of sub- optimal results may result in bandwidth wastage while adding little or no value to the overall application. In addition, the policies provide the exibility to the server to orchestrate the selection of quality, volume, and priority of the extracted VFVs. For example, the policy can allow some devices to prioritize sending their data based on image location, time, utility (e.g., can retrain two classication tasks with the same image) or semantic (e.g., underrepresented labels). Besides, a policy can specify which features to send based on their size: low, medium, high-resolution image or size of the feature vectors. 2.1.2 ED-Server The ED-server is a central repository consisting of a set of ML models and an algorithm to create models. Media content from various online sources can be used to generate a (static) model for the edge devices. Typically, the edge server maintains a generic model for processing media content. In our framework, we introduce a closed loop training system to ne-tune the models using real-world data on the edge server for a wide class of edge devices. An edge device that joins the application network requests the edge server to provide a model for a particular application by announcing its location and the resource capacity. Based on the information received from the edge device, the server dispatches a suitable model. Whenever the edge device captures media content, a feature extraction algorithm is executed to collect relevant features. Instead of transmitting the entire media content over a resource-constrained network, our framework extracts the key features necessary to improve the accuracy of the model and send it to the edge server only when they satisfy the rule set by the edge server’s policy conguration module. Through continuous feedback, our edge server maintains a set of models with dierent resource requirements and 11 accuracy levels to support a wide array of edge devices with varying resource capacities and inference quality. 2.1.2.1 Modeldistributionmodule This module processes the metadata provided by an edge device when it joins the crowd-based learning framework. Based on the resource capacities and the additional constraints enforced by the user, the model distribution module dispatches a suitable model and associated policy conguration to the edge device. 2.1.2.2 Trainingmodule This module is responsible for creating models by using existing datasets. A set of labeled images or video frames are fed to one of the convolutional neural network (CNN) architectures (e.g., Cae [64], TensorFlow [1]), to train an initial model. Our implementation used TensorFlow for the extraction of VFVs and training. 2.1.2.3 Database The ED-Server maintains a database as the storage layer for trained models, model metadata, classes of edge devices, VFVs, inference results and other ED and model metadata. 2.1.2.4 Retrainmodule This module determines when to initiate retraining. The VFVs collected from edge devices are fed to the retrain module to determine the uniqueness of the data. When the newly submitted VFV is unique, it is added to the dataset for retraining. This module ensures that the models maintained for each type of edge devices are of high quality by continuously upgrading the models through the labeled data submitted by edge devices. 12 2.1.2.5 Modelmetadata Along with a trained model, its model metadata is exported and stored. Model metadata contains valuable information that is used to distribute or retrain the model. For example, a spatialcoverage eld is used to prioritize VFV collection from under-represented locations (more details in Section 2.2.2), alabels’ counts eld to collect under-represented classes, theFLOPS eld can provide insight about the model complexity and a rough estimate of an inference time. 2.2 Crowd-basedLearning Our framework uses an available set of images (referenced asD i ) for training an initial model (referenced asM) to be distributed to all edge devices. Subsequently, edge devices perform two tasks: collecting new images (a.k.a. crowdsourced images) (referenced asD c ) and analyzing the content of each image (i.e.,d 2D c ) using the equipped modelM to predict a label describing the image. Over time, the edge devices can aggregate a large amount of images (D c ) that can be used to evolveM and create a better model (referenced asM 0 ). One straightforward approach is to use all of the new imagesD c for obtainingM 0 . However, this approach may suer from multiple challenges: 1) high network bandwidth requirement to transmit the wholeD c , 2) potential degradation of model accuracy (e.g., due to including bad quality images), and 3) long computational processing time required for creatingM 0 usingD i +D c at the edge server. Subsequently, our framework aims at selecting a subset ofD c (referenced asD 0 c ) (i.e.,D 0 c D c ) that can be used to evolveM to an ecientM 0 . Towards this, there are two paradigms: centralized (i.e., the selection process is performed on the edge server) and decentralized (i.e., the selection process is performed on the edge devices). We focus on the decentralized paradigm which utilizes the model metadata for selecting appropriate images which can potentially enhance the model. In what follows, we investigate several metrics that aect the selection procedure. 13 2.2.1 ImageQuality Since the framework includes heterogeneous types of edge devices (i.e., equipped with various types of cameras), the quality of the captured images vary. The quality of an image can be characterized by dierent specications including illumination and image resolution. In particular, an extreme illumination poten- tially aects the model negatively. On the one hand, a very high illumination on an image blurs the image, while a very low illumination makes the image vague and black. Thus, both cases make the image content unclear and harden visual perception. Hence, edge devices can discard the images with such very high or low illumination. On the other hand, image resolution also potentially aects the model eciency. Since the model metadata contains the image resolution ofD i , each image available at an edge devise needs to be processed to get a resolution similar to the one in the model metadata. 2.2.2 ImageMetadata Edge devices typically contain GPS receiver and digital compass to estimate the geospatial location of a captured image. Therefore, the images are automatically tagged with spatiotemporal metadata (i.e., location, viewing direction, and time). Such metadata eectively enable the framework to select a subset of images to expand the spatial and temporal coverage ofD i . Since the model metadata include information about both the spatial and temporal coverage ofD i , the edge devices can identify the images ofD c that are tagged with locations or timestamps not covered byD i . At the edge server, the spatial coverage ofD i can be calculated using the Grid index structure. In partic- ular, the global area ofD i can be divided into grid cells. Thereafter, the coverage of each cell is represented either by a Boolean or percentage value. The Boolean value indicates whether the cell contains at least one image while the percentage represents how much area is visually covered by the images contained in that cell † . Similarly, the edge server can also measure the temporal coverage ofD i . Thereafter, the server † The spatial coverage of a cell forDi can be measured using a spatial coverage model [6] that utilizes various image metadata such as geo-location, viewing direction, and spatial extent (referred to as image eld of view [17] or scene location [7]). 14 augments both the temporal and spatial coverage in the model metadata. Subsequently, the edges use the spatial and temporal coverage ofD i to prioritize the images that are excluded by the temporal and spatial coverage ofD i . 2.2.3 ImageLabel UsingM, an edge device can predict a label to describe the content of each imaged2D c . The label is also associated with a condence score indicating the prediction certainty by the model. The user can provide a feedback on the predicted label and in such cases, the user’s feedback will be used alternatively if the edge devices are operated by human (e.g., smartphones). The image label feedback is used in selectingD 0 c to enhanceM. Since the model metadata, which is stored at each edge, includes the image distribution ofD i among the various labels, the edge can select a subset ofD c that is associated with labels that are less-available inD i . Furthermore, the edge may choose only “certainly labeled” images (i.e., minimizing the noise in the newly created model) by discarding the images associate with low-condence scores. Edge devices may contain multiple trained models for various smart-city applications. In this case, after evaluating the above metrics forD c , the image subset which is qualied is prioritized based on the metrics of multiple models. 2.3 Experiments 2.3.1 Datasets To demonstrate the eectiveness of our proposed framework we used a real-world dataset: a labeled col- lection of Los Angeles County street images. We investigated an approach for the automation of street scene classication based on their cleanliness level using real geo-tagged images from the Los Angeles Sanitation Department (LASAN). Each image has geotagged metadata, i.e., the location where the image was taken is known. The dataset contains 42,331 images with ve distinct labels: 14,495 bulky items, 7,120 15 Figure 2.2: Accuracy forD SC andD CAL256 Dataset for Various Models illegal dumping, 7,007 encampment, 6,982 overgrown vegetation, and 6,727 clean ‡ . We denote the Street Cleanliness dataset asD SC , which represents an unbalanced real-world data in our experiments. Note that the unbalanced dataset may generate a biased learning model. In addition, we included the Caltech 256 [52] dataset, denoted asD CAL256 , which contains 256 labels, 30,608 images in total, with a minimum of 80 images per label and 119 on average.D CAL256 represents a well-dened dataset for retraining in our experiments. In our experiments, we trained models to classify predened objects/classes in images. To save comput- ing power in terms of GPU-hours, instead of training new models from scratch, we usedtransferlearning. ‡ The Dataset and Classication for Identifying the Level of Street Cleanliness is adopted from [5]. Figure 2.3: Model Size vs. Accuracy forD SC andD CAL256 Datasets for Various Models 16 More specically, we extract the VFVs from powerful models pre-trained on ImageNet dataset and train a softmax layer on top. We used three pre-trained models to compare our results: Inception V3 [122], Mo- bileNet V1 [57], and V2 [112]. Although Inception V3 is not designed for mobile and embedded devices, we include it in our experiments for comparison. 2.3.2 ModelRetraining Nowadays, a large amount of imagery information is captured in a continuous and streaming fashion. Once a model is trained, it must be adapted to capture changes over time. To better illustrate the importance of retraining consider the following example: Smart city deployments starting to use video streams captured by sanitary trucks [95] to train models in order to automate the prioritization of street cleaning based on the cleanliness level of the streets. The model predicts ve classes: clean, bulky items, encampment, overgrown vegetation, and illegal dumping [5]. The city plans to release the model as part of a mobile application to the publictocollect moredataforimprovingtheeciency oftheircleaningprocess. Sanitarytrucksaretypically drivenaroundthecityduringspecicworkinghoursoftheday. Inaddition,partsofthecitymaynotbecovered because some trucks lack video cameras. In such a scenario, if the model is fed with new data obtained by the public, that diers signicantly from the data used for training, (e.g., obtained by dierent viewing angle, during dierent hours of the day, etc.), the model could be made more robust through retraining. In the following experiments, we retrain our models from scratch. Unless otherwise noted, we use the initial training datasetD i along with the new dataset obtained by edge devicesD 0 c , as the new training dataset (i.e.,D =D i S D 0 c ). 2.3.2.1 WidthMultiplierandResolutionvs. Accuracy MobileNets allow tuning width multiplier and input resolution hyper parameters which in turn control the model size and complexity. Here, the role of width multiplier is to thin the network connectivity 17 uniformly at each layer in the underlying neural network. Specically, the input channelsM and output channelsN at a given layer, become M andN, respectively. This, in eect, reduces the number of parameters and computational cost by roughly 2 . The input resolution hyper parameter is applied to the input image of the CNN and the internal representation at each layer, reducing computational cost by roughly 2 [57]. For Mobilenet V1, we varied the width multiplier2 (100%,75%,50%,25%) and input resolution2 (224,192,160,128). Similarly, for Mobilenet V2, we varied the width multiplier2 (140%, 130%,100%,75%,50%,35%) and input resolution2 (224,192,160,128,96). Inception V3 has a xed input resolution of299. Figure 2.2 plots the test accuracy (y-axis) forD SC andD CAL256 datasets on 10% of stratied images, after training using 5-fold cross validation on the dierent models (x-axis). The results showed that the accuracy highly depends on width multiplier and input resolution. A given model avor may not be suitable for an arbitrary edge device as some models may require more resources for inferencing, demon- strating a clear trade-o between model complexity, accuracy, and the resource capacity of the edge de- vices. For example, theD SC had the best accuracy of 77.08% when trained on MobileNet V2, = 1:4, = 224, and the worst accuracy of 65.32% when trained on MobileNet V1, = 0:25, = 128. Although most users strive for the most accurate model, the resource constraints of an edge device might render them impracticable. In such cases, an alternative model such as MobileNet V1, = 0:5, = 128, can be chosen, which is less accurate (70.07%) but less computationally demanding. In addition, each model conguration aects theD CAL256 andD SC datasets dierently. For example, MobileNet V1, = 0:5, theD CAL256 dataset degraded the accuracy signicantly compared to its best case (i.e., a 37% decrease on average), whereas theD SC dataset was less aected for the same conguration (i.e., a 12.5% decrease on average). 18 Figure 2.4: Inference Time vs. Model forD SC Dataset 2.3.2.2 ModelSizevs. Accuracy Figure 2.3 plots the accuracy compared to the model disk size § (storage space on device). A more complex model (with a higher width multiplier) resulted in a larger model size and a higher inference accuracy. Although disk space might not be a concern (most devices have plenty of available disk capacity), large neural network models have larger memory footprint, require more time to persist in and out of an edge device’s memory and demand more bandwidth for model distribution. Hence, it is often desirable to keep the model as small as possible, while not sacricing much accuracy. As shown in Figure 2.3, theD SC trained on MobileNet V1, = 1 achieved 76.5% accuracy with a model that requires 13MB disk space. MobileNet V1, = 0:25 achieved the accuracy of 69.14% with 860KB, a 9.6% accuracy loss for 93% disk space reduction. Also in some cases, the size does not increase the accuracy. For example, despite that Inception V3 generated the largest models of 89MB and 87MB forD CAL256 andD SC , respectively, their accuracy is lower than the best of MobileNets. Hence, a model distribution middleware is particularly important to dispatch the right model based on the resource capacity of the devices. § The input resolution does not aect the model disk space. Hence we omit it from the graph for brevity. Here, the accuracy corresponds to the highest image resolution (i.e., 224 for MobileNets and 299 for Inception V3). 19 2.3.2.3 InferenceTimevs. Model After generating models forD SC andD CAL256 datasets, we measured the average time to perform an inference on various edge devices which have dierent resource capacities. The results forD SC¶ are shown in Figure 2.4. We ran each model on 50 random inputs and plotted the mean inference time in milliseconds and logarithmic scale, required to perform each prediction. Raspberry Pi had limited resources compared to desktop class devices, requiring thousands of milliseconds inference time, and on average it was 1.5x order of magnitude slower compared to desktop class devices. The desktop class devices were capable of processing the VFVs faster (tens of milliseconds in most cases) for models with various complexities and image sizes. Our framework dispatches the suitable model, with prior knowledge of resource capacities of edge devices and their performance capabilities, which not only accounts for the resource consumption but also aims to achieve the best result possible within the available resource budget. 2.3.2.4 FeatureSizevs. Accuracy A large number of edge devices run on battery and extending their lifetime is crucial to ensure longevity and to reduce maintenance cost. In our framework, we categorize power consumption in two main cate- gories: 1) CPU/GPU processing and 2) data transmission. The former includes the power consumed, for example, to capture an image, load a model, extract VFVs and infer the class of the images, whereas the latter includes the power consumed to download the model, send the raw image or VFVs and inference results to the ED-Server through a communication channel (WiFi, Bluetooth, etc). Uploading only VFVs instead of raw images from edge devices to the server is crucial in saving band- width and battery power. Besides, a crowd-based learning mechanism to submit VFVs is an essential part of the model retraining module. Figure 2.5 plots the average size in KB (left y-axis), of the extracted VFVs and their corresponding accuracy (right y-axis). For example, the average size of the VFV for MobileNet ¶ Similar trends were observed forD CAL256 dataset so it was omitted. 20 Figure 2.5: Average Feature Size vs. Accuracy forD SC V1, = 1, = 224 was around 10KB. Usually, the larger the size of the VFVs, the higher the accuracy, because they carry a more detailed summary of the image. To make our analysis simpler and widely applicable, instead of limiting the power consumption which is device-specic, we impose limitations on the total consumed bandwidth capacity which is more generic, and investigate how the bandwidth limitation aects the accuracy of the model. Figures 2.6a and 2.6b depict the trade-o between bandwidth and accuracy for four dierent models. For each datasetD CAL256 andD SC , we rst split each to 10% test (D CAL256test andD SCtest , respectively) and 10% train stratied subsets (D CAL256 i andD SC i , respectively). Then, according to the total band- width limitation, the train dataset are augmented to include labeled VFVs uploaded from edge devices (i.e., D CAL256 i S D CAL256 0 c andD SC i S D SC 0 c , respectively). For a given available upload size, the device can send more or less VFVs for a given model, depending on their size. For example, with 0MB available band- width capacity, no VFVs are sent and all models are trained on the 10% initial training dataset. For 5MB, the training dataset for the modelsmobilenet_v1_100_128,mobilenet_v1_100_160,mobilenet_v1_100_192, mobilenet_v1_100_224 are augmented with 588, 550, 524, 505 new VFVs respectively, for 25MB, they are 21 augmented with 2939, 2756, 2626, 2529 new VFVs and so on. We ensure that at the maximum bandwidth capacity, all models receive the 100% of dataset’s VFVs, that isD CAL256 0 c =D CAL256 c andD SC 0 c =D SC c . Figure 2.6a and 2.6b shows how the accuracy is aected by dierent bandwidth capacities. The ac- curacy of the model improved with the increase in VFV set, but it came at the cost of high bandwidth. Figure 2.6a and 2.6b shows that the devices with the higher bandwidth can send more VFVs to the edge de- vices for retraining the model to enhance accuracy. This is more apparent for theD CAL256 dataset which includes 256 labels and requires more VFVs to classify the test images correctly. Interestingly, at 125MB, the accuracy of less complex models is higher. This is because the server can receive a larger number of VFVs, which although they contain a more compact summary, can help to distinguish the labels more accurately. The accuracy of the model settled at bandwidth above 25MB to a steady state since the size of the VFV set did not inuence the accuracy when the edge device sent 125MB worth of VFVs for retrain- ing. Consequently, the retraining and model dispatching channel decisions for the edge devices should be taken based on the bandwidth capacity and the number of labels in the dataset. In summary, our server maintains a set of models which are suitable for devices with various bandwidth capacities. When the edge device operating in a given bandwidth capacity submits a VFV to the server, the server uses the received VFV set to enhance the model by initiating the retraining sequence. Our approach is therefore capable of not only performing inference on resource-constrained devices such as Raspberry Pi and medium-end smartphones but also provides the ability to enhance the models for devices with limited bandwidth. In some cases, an edge device may have processing capacity but limited bandwidth. Such scenarios require models that are targeted for bandwidth-constrained devices rather then processing-constrained devices. 22 (a) Bandwidth vs. Accuracy onD CAL256 Dataset (b) Bandwidth vs. Accuracy onD SC Dataset Figure 2.6: Bandwidth vs. Accuracy 2.3.2.5 Locationbasedretraining ML models deployed on edge devices for Smart City applications should have the exibility to evolve and adapt the model to account for newly fed data. Due to the dynamic nature of the environment, and the vast amount of images and videos captured by edge devices, a VFV selection mechanism should be deployed which will orchestrate the feature acquisition process. One such feature selection mechanism is based on the location where the visual data were captured. The policies enforced by the ED-Server orchestrator on the edge devices prioritize the collection of VFVs at under-represented locations to capture unique information. Intuitively, the infrastructure and landscape vary signicantly in dierent parts of the city, resulting in images with dierent backgrounds, unique architectural styles, and VFVs. In our experiment, we used theD SC dataset to highlight the importance of location-based retraining. We analyzed the location metadata ofD SC and removed all images the fall within the Downtown Los Angeles (DTLA) area (referred asD SC DTLA ). The architecture of DTLA is signicantly dierent compared to other locations in Los Angeles. The datasetD SC DTLA is then split to 50%-50% stratied trainD SC train DTLA and testD SCtest DTLA datasets. Subsequently, we trained two models: a modelM1 on the dataset which does not include images from DTLA at all, i.e.,D SC nD SC DTLA and a modelM2 that includes the train dataset from 23 Figure 2.7: Location-based Feature Selection (M1 is trained onD SC nD SC DTLA , whileM2 is trained on D SC nD SCtest DTLA ) DTLA, i.e.,D SC nD SCtest DTLA . After training, both models were tested on the unseen datasetSC test DTLA . Our results depicted in Figure 2.7 show that under-represented regions (such as DTLA) signicantly aect the accuracy for all variations ofM1; sometimes there is more than 15% drop in accuracy, whileM2 is more robust, with no signicant drop. Although we do not include experiments for time-based image selection, we expect that the temporal information of collected images is symmetrically important. Hence, smart city crowd-based learning applications should not only collect image data, but also enrich the collected datasets with location and time information, to enable spatiotemporal VFV selection. 24 Chapter3 SpatialKeyframeExtractionFromUrbanVideostoEnableObject DetectionattheEdge Recent advancements in processing power and explosive growth of IoT devices (20.4 billion by 2020 [50]) enabled the continuous development of edge computing (EC) systems. With EC paradigm, the processing cost of information is ooaded close to the edge devices, where data is generated. Shifting computation to the edge has several benets: a) applications do not suer from latency and communication bandwidth restrictions because they are able to process and store data locally in real-time, b) it reduces the costs of processing the collected data at a cloud-based or centralized server, and c) it provides a privacy mechanism so that raw sensitive data (e.g., ngerprints) does not need to be directly shared outside the device. In addition to the wide applicability of EC, the improvements in neural networks have made it possi- ble to run deep learning models locally for classication or object detection tasks. Image-based Machine Learning (ML) applications for smart cities already made their appearance, e.g., detect road damage [91, 10] to improve the infrastructure, monitor street cleanliness [5] to prioritize sanitation eorts, detect sur- face for grati removal [130, 9] to improve quality of life and reduce gang crime, and bicycle/pedestrians counters [121, 75, 116] to improve transportation, to name a few. Cameras embedded on edge devices are nowadays able to capture videos at a very high frame rate (e.g., 24-60 FPS). The only way for a Convolutional Neural Network (CNN) (e.g., YOLOv3 [109], Cae [64]) to 25 process such visual data volume at this rate is by using powerful CPUs and GPUs which cannot be found on edge devices. At the edge, frame acquisition rate is much higher from what a deployed ML model can handle, hence only a subset of frames can be processed in real-time. Then, how do we decide which frames to feed to a CNN? Naively, applications use the CNN processing rate after sampling some frames, i.e., feed the newly captured frame to the CNN as soon as the processing of an older frame is completed, which can result in missing important visual content, especially when using a mobile device. In this work, we focus on ecient frame selection from urban mobile videos while considering the limited resources of edge devices in image ML applications. Instead of directly analyzing the raw video frames which requires lots of computing power, our algorithms leverage the geospatial metadata of images (captured at recording time by devices’ sensors) to reduce image processing cost by maximizing the spatial coverage while minimizing the number of selected frames. In particular, we partition an area of interest into grid cells and use the number of unique cells covered by each frame (using its geospatial metadata) as the preference criteria to select or drop a frame. We formally dene this coverage problem, prove its NP-hardness, and propose heuristics to eciently solve it. We compare our heuristics with traditional keyframe selections based on visual content. Experimental results using real dataset show our hit-ratio is at least 25% better than conventional approaches. 3.1 SpatialKeyframeExtraction 3.1.1 Preliminaries Denition3.1.1. Field-Of-View: A videov is represented as a set of individual video framesF =ff 1 ;f 2 ; :::;f i ;:::;f n g ordered by the timet i at which the frame was captured. We use the Field-Of-View (FOV) model [17] as shown in Figure 3.1a to represent the coverage of the viewable scene off i . Hereafter, we denote withf i the video frame and its FOV, interchangeably. The FOVf i is in the form ofhp;;R;i, 26 ⃗ North θ (a) The 2-D FOV model. 2 1 3 4 5 5 (b) 5 FOVs and theirCMBR. Figure 3.1: Field-Of-View (FOV) model and Coverage MBR. where p is the camera position consisting of the latitude and longitude coordinates read from the GPS sensor in a mobile device, 0°< 360° is the angle of the camera viewing direction with respect to the North obtained from the digital compass sensor, R is the maximum visible distance, and 0°<< 360° denotes the visible angle obtained from the camera lens property at the current zoom level. We use the dot notation to access properties, i.e.,f i :p refers to the camera point of the FOVf i . Denition 3.1.2. Coverage Minimum Bounding Rectangle: Given a set of FOVsF, the Coverage Mini- mum Bounding RectangleCMBR is dened as the minimum bounding box which contains all FOVs ∗ as illustrated in Figure 3.1b. Denition3.1.3. Coverage Grid and Cell Set: Given aCMBR and cell sizew, we partition theCMBR into a set of square cellsG =fc 1 ;c 2 ;:::;c m g of widthw forming the Coverage Grid. Given a set of FOVs F and the gridG, the Coverage Cell SetCG contains all the cells which are covered by at least one FOV. In addition, we dene withC i C, the subset of the cells which are covered byf i . For example, in Figure 3.1b, the coverage cell setC 5 forf 5 is highlighted in dark grey color. For simplicity, the cell is ∗ To reduce the computational complexity of the pie slice shape [17], we represent the FOVs as triangles. 27 considered as covered if any FOV covers its center. The setC f =fC 1 ;C 2 ;:::;C n g contains all such subsets. Hence, S n i=1 C i =C. Symmetrically, we dene withF j F the subset of FOVs which cover cellc j 2C. The setF c =fF 1 ;F 2 ;:::;F m g contains all such subsets. Thus, S m j=1 F j =F. 3.1.2 ProposedAlgorithms User-generated urban videos contain a lot of overlapping regions. Even if the camera is moving, due to the variations in direction, a cell can be covered by multiple FOVs. To model the importance of regions and FOVs, we introduce a cell spatial weight which depends on the FOV that covers them. CellSpatialWeight: The spatial weight of an FOVf i and a cellc j 2C i is dened as: w i;j = 8 > > > > < > > > > : 1 d(f i :p;c j :p) f i :R ; ifd(f i :p;c j :p)f i :R 0; otherwise (3.1) The functiond(f i :p;c j :p) calculates the distance † between the camera locationf i :p and the cell centerc j :p. The distance is then normalized by the maximum viewable distancef i :R. Cells closer to camera location are assigned higher weight, whereas cells outside the FOV’s coverage area are assigned zero weight, i.e., d(f i :p;c j :p)>f i :R. This ensures that the weight is0w i;j 1. Cell Overlap Weight Function: We dene a function f : X ! Y , where X = fx 2 R jF 0 j j jx = w i;j ;f i 2F 0 j F j ;c j 2C i g andY =fy2Rj0 y 1g. The functionf accepts as input the weights X = w i;1 ;w i;2 ;:::;w i;jF 0 j j for cellc j for the selected FOVsF 0 j which cover it, and assigns a new weight Y . f denes what is the new spatial weight of the cell when multiple FOVs cover it and are selected by the solution. This function is application-specic. For example, a new weight can be the average sum of all weights contributed by the selected FOVs. ResidualOverlapWeight: For a current frame selectionS, the residual overlap weight forf i and its cells † There are several distance approximation functions between points on Earth surface: euclidean distance [120], haversine or Vincenty’s [131] formula. 28 C i is computed as follows: for cellsc j 2C i not covered by any other FOV already inS, the residual weight w r i;j is equal tow i;j . Otherwise, we use the Cell Overlap Weight Functionf to calculate a new overlap weightw o i;j ifS is to includef i and calculate the dierence with the current selection, i.e.,w r i;j =w o i;j w i;j . The intuition is when a new FOV is added, it increases the total weight by only the weight dierence. Problem 1. Maximum Weighted Overlap Coverage Problem: Given a set of FOVsF =ff 1 ;f 2 ;:::;f n g, the weightsw i;j , the cell overlap weight functionf, the set of covered cellsC j =fc 1 ;c 2 ;:::;c m g for each FOV, the maximum budget for framesB, the Maximum Weighted Overlap Coverage Problem (MWOCP) nds a subsetF 0 , s.t.jF 0 jB which maximizes the weighted sum of covered cells in the setsF 0 j . Theorem1. The MWOCP is NP-Hard. Proof. The proof forMWOCP ’s hardness comes from the reduction of the Maximum Coverage Problem (MCP ), i.e.,MCP p MWOCP . Given a nite set of elements, called the universeX =fx 1 ;x 2 ;:::; x m g, a collection of setsS =fS 1 ;S 2 ; :::;S n g;S i X , whose union equals the universe, i.e., S n i=1 S i = X , and a budget valuek the Maximum Coverage is the problem of nding a subset ofS 0 S, s.t.,jS 0 jk andj S S i 2S 0S i j is maximized. For anyMCP instance, we reduce it to an instance ofMWOCP in polynomial time. The reduction is straight-forward. The number of maximum framesk is passed as input budgetB to theMWOCP . For each elementx j in theMCP , we create a cellc j in theMWOCP , henceX =G. This construction takesO(m) time. Additionally, each setS i inMCP is mapped to an FOV Coverage SetC i inMWOCP , assigning a weightw i;j = 1 when elementx j 2S i , andw i;j = 0 otherwise. The cell weight functionf is deliberately set to return 1 whenc j (thusx j ) is not part of the current solution, otherwise 0 to prevent double-counting an element. This mapping takesO(nm) time. The construction ensures that elements covered byS i are represented as covered cells by FOVf i inC i (S =C f ). Therefore,MWOCP ’s output of FOVs is exactlyS 0 inMCP , which completes the proof. 29 Algorithm1Greedy-SKE 1: procedureGreedy(F;C;W;f;B) 2: S =; . Solution set 3: U =C . Cell Universe 4: F 0 =F . Candidate FOVs 5: whilejSjB do . Up toB FOVs 6: bestFOV =bestCellSet =null;bestWeight = 0 7: forallf i 2F 0 do 8: FOVCells =getFOVCells(f) . Cell SetC i 9: uncoveredCells =FOVCells\U 10: weight =computeWeight(f;uncoveredCells) 11: coveredCells =FOVCellsnU 12: ifcoveredCells6=;then 13: weight+ =computeResidual(f;S;coveredCells) 14: endif 15: ifweight>bestWeightthen 16: bestFOV =f 17: bestCellSet =FOVCells 18: bestWeight =weight 19: endif 20: endfor 21: ifbestFOV ==nullthen break 22: endif 23: S =S[bestFOV 24: F 0 =F 0 nbestFOV 25: U =UnbestCellSet 26: endwhile 27: returnS . The greedy solution to MWOCP 28: endprocedure AsMWOCP is NP-Hard, we propose a polynomial-time greedy algorithm to solve it eciently. Our approach extends the Generalized Maximum Coverage Problem [29] (GMCP), to support overlaps (an el- ement can belong to multiple bins). Spatial Keyframe Extraction (SKE) algorithm: Algorithm 1 shows our proposed greedy algorithm, referred asGreedy-SKE, to solve the MWOCP. Lines 2-4 initialize the solution set, the cell universe and the list of candidate FOVsF 0 . The algorithm iterates until the budget of framesB (Line 5) is exhausted or no new FOV contributes to the total weight of the solution (Line 21). At each step of the main loop, the greedy algorithm tries to nd the best FOV which has a higher weight and updates the current best at Lines 15- 19. For each candidate FOVf, the total weight is measured as the sum of weights from uncovered and 30 covered cells. Uncovered cells are the cells which are not covered by any FOV in the current solutionS, while covered cells are the cells which have at least one FOV in the existing solutionS. For uncovered cells, the weight is calculated as described in Section 3.1.2. For covered cells, with the help of a subroutine computeResidual, we calculate the residual overlap weight. When an FOV is selected, it is added to the current solutionS, is removed from candidate frame setF 0 and the cell universeU of uncovered cells is updated at Lines 23-25. 3.1.3 BaselinesforComparison Using Visual Features: A straightforward method of selecting distinct video frames is by comparing visual features. Clustering-based extraction utilizes VSUMM’s [14] technique. Initially, a set of frames are sampled every half-second. Then, a histogram of 50 bins is constructed from the HSV color space (20-20-10 bins for each component, respectively) for each frame image which are used as the feature vectors. Finally, given a budget valueB as the maximum number of frames that the algorithm can extract from each video, the k-means algorithm is applied to select the closest frame from each cluster centroid. Using Frame FOV metadata: The previous baseline requires to analyze image content in order to decide whether there are signicant visual changes between frames. This is a chicken-egg problem because the whole goal is to avoid analyzing all frames while detecting objects within a video. Approaches that require visual analysis are expensive both in terms of processing time and battery consumption, especially when analyzing video at the edge. The techniques introduced here do not require tosee the image frame content. Instead, they solely rely on the spatial metadata captured along with the video. Temporal selects frames based on a predened sampling ratet s (e.g., 2 frames per second). t s can be adjusted in a way, such that it matches the processing capacity of an edge device. This method can be fast, but it may introduce unnecessary redundant frame processing in case of fast sampling or capturing similar 31 frames when the camera is slowly moving. Trajectory-SKE selects frames based on the camera location of the FOV metadata. For each frame of a video, sorted by captured time, a predened radius thresholdt r is used to determine whether the camera location of the frame is farther than the previous frame’s radius. Sampling the trajectory by adjustingt r ensures that the selected frames are captured at dierent locations, hence partially addresses the problem of processing redundant content of frames. However, it may still process duplicate frames whent r is very small or the FOV cover a particular region. Naive-SKE uses a max-heap to get cells in order based on their cumulative spatial weight of all FOVs. For the current cellc j , a random FOVf i 2F j is selected and added to the solutionS. Additionally, all cellsc l 2C i are removed from the heap. The algorithm stops whenB is reached or the heap is emptied. 3.2 Experiments To demonstrate the eectiveness of our proposed techniques, we have conducted experiments on a real dataset. With MediaQ mobile app [73], we collected 25 FHD videos at 30 FPS generating 69K frames (2872 frames per video on average) along with their FOV metadata. All videos recorded so that they intentionally contain frames that capture a Starbucks coee shop. The experimental setup is how to eciently detect Starbucks logos from the collected videos using the discussed frame selection approaches. For actual logo detection, we used Google Vision API to detect the Starbucks logo for each frame in every video and log the detected frames with a condence 70%, resulting to 5.5K frames with the detected logo. Experiments were performed on two types of devices: a powerful Ubuntu OS 16.04 equipped with a Quad Core Intel(R) Xeon(R) CPU E3-1240 v5 at 3.50GHz and 64GB of RAM and a less capable Raspberry Pi (RPI) 3 Model B device. In our analysis, the budget B is the maximum number of frames an algorithm can extract from each video and is given as an input parameter. It is application/device specic value based on the constrained 32 0 10 20 30 40 50 Budget 10 1 10 0 10 1 10 2 10 3 10 4 avg. extraction time (ms) Clustering Greedy-SKE Trajectory-SKE Temporal Naive-SKE (a) Desktop performance. 0 10 20 30 40 50 Budget 10 1 10 2 10 3 avg. extraction time (ms) Greedy-SKE Trajectory-SKE Temporal Naive-SKE (b) Raspberry Pi performance. Figure 3.2: Avg. Extraction Time per Video. resources on edge device such as power consumption and data transfer bandwidth capacity [30]. The actual number of extracted frames is denoted byK B. Note that it is possible thatK < B because forGreedy-SKE the weighted overlap coverage is reached and adding a new FOV does not increase the total weight, or for the baselines the sampling rate results in fewer frames for short videos. We set the grid widthw = 15 meters (similar trends were observed when we varied the widthw2f5;10;15;20;25g, hence results are omitted). Additionally, forTemporal we xedt s = 500ms and forTrajectory-SKE, we empirically found that the sampling radius t r = 2m performs best, to select enough frames given the length of the videos.Greedy-SKE uses haversine as distance functiond(f i :p;c j :p) to calculate the spatial weight (results were similar with euclidean distance). After testing bothavg andmax, we chose max (performs slightly better) as the residual overlap functionf, which assigns the maximum weight to the cell contributed by the selected FOVs that cover it. 3.2.1 Performance Figure 3.2 shows the logarithmic scale computation time in msec of the various approaches while we vary B on both desktop (Figure 3.2a) and RPI (Figure 3.2b). TheClustering approach suers the most (i.e., 2+ orders of magnitude slower than others) because of the time complexity to create the frame’s histogram bins and to run k-means algorithm. ML-based frame extraction techniques, due to the processing cost, are 33 not suitable to run on edge devices and is omitted from Figure 3.2b. The techniques share the same patterns on both the desktop and the RPI. The RPI, due to its limited resources, requires 1 order of magnitude more time. For example, Greedy-SKE needs 1sec to select 50 frames on RPI, compared to 100msec on the desktop, which is practical, since the majority of edge devices are not able to process frames at 50 FPS. Greedy-SKE requires to keep track of the residual weight each time a new FOV is selected; however, it linearly scales with the maximum number of framesB. Temporal andNaive-SKE are extremely fast. Temporal solely relies on the temporal information of each frame (i.e., extracts a frame per half second). Naive-SKE eciently uses a max-heap to pick the next uncovered cell with the highest total weight and chooses any FOV that covers it. Trajectory-SKE is slightly slower thanTemporal and Naive-SKE due to the calculation of the haversine distance to identify whether a new FOV was already sampled withint r . 3.2.2 ImpactofSpatialWeight To demonstrate the eectiveness of assigning spatial weight to each cell, we constructed a eld experiment for automated logo visibility analysis. We compare the selected frames by the approaches to the actual frames which contain Starbucks logo. If at least one of the selected frames contain the logo it is considered as ahit. Figure 3.3a depicts the hit-ratio while varyingB2f1;2:::50g. The graph plots the average selected framesK. Notice thatK < B. For example, theGreedy-SKE atK = 26:96 frames it saturates, i.e., no other frame adds any value to the total weight, so it stops. Although, theNaive-SKE algorithm is fast to execute, it is inferior to Greedy-SKE. The Naive-SKE considers the cell total weight con- tributed from all FOVs, whereasGreedy-SKE smartly adjusts the overlapping cell weight while new frames are selected. TheTrajectory-SKE, due tot r , requires more frames to reach the same hit-ratio. Greedy-SKE does not require any user input and is able to reach 80% with only 15 frames (compared to 34 0 10 20 30 40 50 Budget 0.0 0.2 0.4 0.6 0.8 Hit-Ratio @ Budget K=26.96 K=46.25 K=39.17 K=49.12 K=44.00 Greedy-SKE Naive-SKE Trajectory-SKE Temporal Clustering (a) Hit-Ratio @ Budget. 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 Hit-Ratio @ 30 0 5 10 15 20 25 30 K Clustering Temporal Trajectory-SKE Naive-SKE Greedy-SKE (b) Frames for Hit-Ratio@30. Figure 3.3: Hit-Ratio performance. 41% and 18% forNaive-SKE andTrajectory-SKE, respectively). Trajectory-SKE has the limita- tion that it needs to set at r such that the whole trajectory is sampled, which is not a realistic assumption when the video length is unknown. Figure 3.3b illustrates the output number of selected framesK (y-axis) while varying the target hit- ratio at a xed budgetB = 30 (x-axis), which is reasonable given the average length of our video dataset. OurGreedy-SKE algorithm outperforms all other approaches. It is able to detect the logo in 66% of the videos with 6 frames in 17ms, 75% of the videos with 10 frames in 26ms, and reaches 80% with 15 frames in 39ms. None of the other approaches reach beyond 54%, despite that all videos contain frames which capture the Starbucks logo.Naive-SKE requires 4 times the number frames ofGreedy-SKE (20 vs 5, respectively) to reach a hit-ratio of 50%. Despite consuming more time,Clustering does not achieve high hit-ratio in urban videos because it is unaware of the coverage area. Frames that are spatially close, are not necessarily visually close due to background changes and directional dierences. Temporal yields worse results thanTrajectory-SKE because it selects near-duplicate or miss important frames due to the sampling ratet s . TheGreedy-SKE considers the overlap coverage area due to the eect of the spatial weight and selects the optimal frames well ahead of any other technique. 35 Chapter4 TowardsScalableandEcientClientSelectionforFederatedObject Detection The hardware advancements on IoT edge devices increased their processing capabilities and enabled the execution of complex machine learning (ML) models at the edge. Equipped with advanced sensors that can capture various visual contents (e.g., videos and images) with high resolution and frame rate and the continuous improvement of Deep Neural Networks (DNNs) made it possible to analyze the visual data in real-time. To successfully train ML-driven object detection models requires a large amount of high quality datasets that are often heterogeneous (e.g., dierent image resolutions) and distributed at varying quantities at dif- ferent platforms (e.g., stationary or moving) and physical locations (e.g., edge devices). Such trained ML models need to adapt to the deployment environment. It is often necessary to collect new training data from the deployment environment in order to improve an existing model. For example, self-driving cars introduced in a new geographical region during winter season should be able to detect the objects of in- terest covered in snow or detect new road signs. However, collecting and sharing data from the devices at a centralized location are subject to privacy concerns. For instance, data obtained from self-driving cars 36 Client Model Updates model update client u 1 Object Detection Model training data local data model update client u 2 training data local data model update client u 3 training data local data model update client u K training data local data Centralized Server Model Weights Aggregated Model Model Aggregation Algorithm Client Selection Global Model Model Broadcast Training Parameters Clients DB Client Stats Stats Client Stats Figure 4.1: The proposed federated object detection workow diagram. A centralized server learns a global object detection model by asking distributed clients to locally train and upload model weight updates. These updates are aggregated to improve the current global model and the process is repeated multiple rounds until a target accuracy or time limit is reached. The red colored parts indicate the core components of the proposed federated object detection system to which we made major contributions. 37 may capture faces and license plates which can expose the location of individuals. Organizations and indi- viduals are unwilling to share sensitive data. In addition, the General Data Protection Regulation (GDPR) poses new restrictions on how organizations share data. Federated Learning (FL) [94] oers a potential solution by enabling to collaboratively learn a shared prediction model across multiple decentralized devices or servers (i.e., clients) holding local training data. With FL, a centralized server acts as the orchestrator of many clients which train a machine learning model collaboratively as shown in Figure 4.1. The training data are decentralized, i.e., each client (edge/mobile device, organization, etc.) locally trains the model with data stored on the device. Instead of aggregating data, the server aggregates model updates received from the clients in several rounds of communication. Applications benets from FL include: a) Device owners do not have to share sensitive data outside of the device; instead only a model update summary is communicated to a centralized server. Hence, the FL provides a data privacy mechanism by design. b) The size of raw data is usually larger than the size of the summarized model update, hence communication cost is reduced. c) Training a DNN with a large amount of data on a centralized server requires powerful CPU/GPU processing power and memory. Developers, instead of relying on cloud servers to meet these requirements, ooad the processing cost on the edge devices, which has lower associated costs. d) Devices have the freedom to improve the local model as new data are collected. This is crucial for the object detection task to enable the detection algorithms to adapt to new local environments and object changes. One of the key challenges is the selection of clients to participate in the federated training. Selecting a fraction of clients at each round is more ecient [94]; adding more clients slightly improves results while incurring higher communication costs. In FL setting, the training data on clients’ devices is by nature Non-IID which introduces new sources of bias [67]. For example, training with clients from a particular geographical region can lead to less accurate predictions for clients from another region. Moreover, some clients can be equipped with less data than others, thus are able to compute the output of a federated round 38 faster, but without improving the model’s performance. As a result, these clients may be over-represented during training and new methods to select clients fairly is needed. To address the aforementioned challenges, we introduce several client selection methods. While prior work on client selection for FL focuses on simulating the random nature of participating clients [94] and device/hardware heterogeneity [98] ∗ , our focus is on data heterogeneity. We propose highly eective meth- ods that fairly select clients based on their data label and geospatial distribution. Additionally, we study how clients that train on a mixture of real and synthetic data can improve the accuracy of federated object detection models. To the best of our knowledge no studies looked at images/videos metadata for client selection for federated object detection. We apply our techniques to the state-of-the-art object detection models including Faster-RCNN [110], RetinaNet [83] and EcientDet [123]. To summarize, the following are our key contributions: • Introduce a novel general FL object detection system that addresses wide range of data heterogeneity (e.g., clean data, noisy data, real and synthetic) distributed at dierent geographical regions and platforms. • Introduce location-aware training by utilizing the geospatial metadata of the dataset, which consid- ers both data distribution and region coverage for ecient training. • Propose new client selection algorithms based on fair data and spatial distribution. • Evaluate the eectiveness of our approaches with real and synthetic training data on a FL environ- ment. The remainder of the chapter is structured as follows: Section 6.3 reviews the related work. Federated Object Detection is presented in Section 4.1. Section 4.2 discusses the various client selection methods. Experimental results are reported in Section 5.3. ∗ Our work is orthogonal to [98] and other selection criteria can be added to our framework in the future to consider device’s capabilities. 39 4.1 FederatedObjectDetection The federated training diagram is illustrated in Figure 4.1. In FL framework there are two main entities: a) a centralized server which acts as the coordinator for the learning procedure and b) a number of independent, distributed and often heterogeneous clients. Below, we summarize their main components. 4.1.1 FLServer The FL Server is the central entity which acts as the coordinator for the client devices. It orchestrates the training procedure by distributing the trained model to participating clients and aggregates model contri- butions from the clients. Model Aggregation Algorithm: At each federated round, the FL Server receives a number of model updates from each client. The model aggregation algorithm combines these model updates to obtain an improved version of the initial model. The Federated Averaging (FedAvg) algorithm [94], an adaptation of stochastic gradient descent, is widely applied in FL to optimize the learned model. Client Selection Algorithm: The client selection algorithm is the method used in selecting a subset of the clients (which can be in the range of millions) to participate in a given federated round. In Section 4.2, we propose several client selection methods. ClientStats: FL Clients periodically connect to the FL server and upload aggregated statistics and meta- data regarding their datasets such as the number of collected images and the total frequency of each label within a geographical region. The data are queried from a database by the client selection component to decide which clients to select. Global Model Broadcast: The current model and its current weights are broadcast to the participating clients. This component also communicates the hyperparameters for the local training procedure, such as the client learning rate, batch size and optimizer. 40 Notation Description T Number of federated rounds K Total mumber of clients C Fraction0C 1 of participating clients B Client batch size E Client local epoch U Set of usersfu 1 ;u 2 ;:::;u k g G Set of grid cellsfc 1 ;c 2 ;:::;c j g L Set of labels, e.g.,L =fCar;Persong n k;l Labell2L frequency of clientu k 2U n k Total number of labels P jLj l n k;l of clientu k s k Total number of samples of clientu k Table 4.1: Notation table. 4.1.2 FLClients Participating clients in FL are dynamic; they might enter or leave the training iterative procedure at any time. In addition, they are independently capturing multimedia data at dierent rates, volume and quality. Inherently, the data are Non-IID, since clients operate in a particular region, at specic times of the day and capture specic kind of objects. ClientModelTraining: After selecting participating clients, the FL Server transmits the model and its weights along with the training parameters (e.g., optimizer, batch size, epoch, training variables). Each client independently trains the received model with their local dataset and upload the model weight up- dates to the centralized server. 4.2 ClientSelectionMethods Data collected by edge devices are by nature Non-IID. The most common reasons that clients’ datasets deviate from IID are variations in both data volume and distribution [67]: Label distribution: Skewness in label distribution is present due to clients operating in particular regions, capturing specic labels/objects. For example, trucks appear more in roads next to industrial zones. Dierent features for the same label: Dierent clients can have dierent features for the same label due to 41 cultural dierences, geo-location, or weather conditions. For example, trac signs vary signicantly from country to country. Also, cars appear dierently during dierent times of the day or weather conditions. Quantity skew: Dierent clients collect data independently, hence skewness in data quantity can vary enormously. This section discusses the various client selection methods. 4.2.1 Random(R) As the name suggests, random client selection picks clients randomly at each federated round. Specically, at each roundt i , a subset of clientsU 0 U;jU 0 j = max(CK;1) is picked uniformly at random to perform the training computation. Although, this approach is straightforward to implement, it does not take into account the heterogeneity of clients. 4.2.2 RandomLabel(RL) This strategy selects a subset of clients U 0 U at each round, by picking a client u k at random with probabilityn k , the number of labels collected by the client. 4.2.3 RandomSamples(RS) This strategy is similar toRL, but instead of picking a clientu k at random with probabilityn k , it chooses clientsU 0 U at each round with probabilitys k , the total number of samples collected by the client. We considerR,RL andRS client selection methods as baselines. 4.2.4 FairLabelEntropy(FLE) Instead of blindly selecting clients, this approach aims to prioritize clients that have both diverse and large number of labels. Intuitively, training with clients that hold enough diverse labels will help to learn the 42 shared model faster. To capture the diversity of labels we use the label entropy (LE) of a clientu k which is dened as: H u k = jLj X l p k;l logp k;l (4.1) where p k;l = n k;l n k , is the frequency of label l collected by client u k over the total number of labels of clientu k across all collected data. By convention, whenp k;l = 0 we setp k;l logp k;l = 0. The value of H u k is maximal when the probabilities are equally likely. However, clients with unbalanced number of samples can still have similar label entropy (i.e., when the label distribution is similar). To tackle this issue, clients are given an importance value relative to the size of their collected data by calculating the weighted entropy: H u k = n k n H u k (4.2) wheren is the total number of labels across all clients. A shortcoming of this approach is if this value is used to select clients (i.e., the clients with highest value of H u k ), then in a setup where specic clients always achieve the highest values of LE then this method introduces bias and will never request alternative clients to participate in the federated training computation. This problem is similar tostarvation in the context of scheduling algorithms. In concurrent computing, resource starvation is when a process is denied resources repeatedly and cannot complete its work. Fair resource allocation is also studied in the context of wireless networks. For example, a wireless service provider needs to provide high throughput. Due to unfair resource allocation, some users may experience poor service; however, the service provider needs to ensure that all clients receive good quality of service. Similarly, in FL, clients that are never selected will not contribute to the overall training process. Even if 43 the LE is not as high, unselected clients can contain valuable data to improve the model. Hence, in this section we incorporate fairness in this selection process. Inspired by the work in wireless networks, we utilize a compromise-based scheduling algorithm named Proportional Fairness [102] to prioritize clients during the selection process. In proportional fairness the decision of the scheduling is given by the following equation which calculates the priority of clientk: P k =argmax k2f1;2;:::;Kg r k (t) R k (t) (4.3) wherer k (t) =H u k is the weighted entropy at federated roundt for userk andR k (t) the weighted entropy. and tune the fairness of the scheduler. In the extreme cases, when = 1 and = 0 the scheduler will prioritize clients with the weighted entropy, while when = 0 and = 1 the scheduler prioritize clients in a round-robin fashion, independent from the weighted entropy. R i (t+1) = 8 > > > > < > > > > : R i (t)+(1)r i (t); if P k =i R i (n) if P k 6=i (4.4) where = 1 1 c , c > 1 is a memory constant which controls the eective memory of the weighted low-pass lter function. Additionally,R i (0) = i is a chosen initial value. 4.2.5 CoverageLabelEntropy(CLE) With this method, instead of calculating the global value of label entropy for the training dataset at each client, we utilize auxiliary metadata, such as location data from GPS sensor, directional data from gyro- scope and depth data from LiDAR to assign the label entropy at a ner granularity while maximizing the coverage area of the training dataset. Utilizing data from diverse regions can help to learn otherwise under-represented features. 44 Figure 4.2: FOV [17] model from training image and GPS, compass and LiDAR data of KITTI dataset. Our structure partitions the world into smaller grid cells using geohashing and models each image of the training dataset as aField-Of-View (FOV) [17] as illustrated in Figure 4.2. Given a latitude and longitude, its geohash can uniquely identify a rectangular cell using a short alphanumeric string at dierent user- dened precision levels. For instance, at precision 9 any point will be placed in a cell of width 4.77m. The FOV is expressed as a tuple:hp;;R;i, wherep is the camera location obtained from the GPS, is the angle of the viewing direction captured by the compass sensor,R is the maximum visible distance, and denotes the visible angle obtained from the camera lens property. Given the training RGB image, the object bounding boxes, their observation angles and a depth map projected from LiDAR point cloud † we map each object to a new location using the direct form of Vin- centy’s formula [131]. For example, in Figure 4.2 the two cars on the training image (left) are projected to the physical world as green dots in two separate grid cells (right). Hence, if we summarize the dataset of each client, we can obtain the grid shown in Figure 4.3. Figure 4.3 consists of 3 clientsfu 1 ;u 2 ;u 3 g represented by ,4 and, respectively, and a grid of 9 cells,G =fc 1 ;c 2 ;:::;c 9 g. For simplicity, assume that there are 3 labels that can be identied by the model. The frequency of each object is shown as a column in each cell under each client. Notice that some † Depth data can be obtained from a sensor, such as LiDAR or inferred from RGB image if such sensor is not available. 45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 5 4 1 2 0 1 3 2 7 1 0 1 13 2 5 7 6 4 1 0 0 5 4 1 2 0 1 3 2 7 1 0 1 13 2 5 7 6 4 0 0 0 0 0 0 13 2 5 5 4 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 best candidate 2 0 1 3 2 7 1 0 1 7 6 4 0 0 0 0 0 0 13 2 5 5 4 1 3 0 1 0 0 0 0 0 0 0 0 0 7 6 4 best candidate SERVER STATE Figure 4.3: Toy example for selecting clients usingCLE with 3 clients represented by square, circular and triangle shapes, 9 cellsfc 1 ;c 2 ;:::;c 9 g and an object detection task with 3 labels. The data label frequency of each cell is shown under each client. For example, in cellc 1 the circular useru 1 has label frequency n 1;1 = 3;n 1;2 = 2;n 1;3 = 7. clients may capture objects in the same cells (i.e.,c 3 andc 5 ), some clients operate alone in some cells (i.e., c 1 ,c 4 andc 9 ), and some cells have no client data (i.e.,c 2 ,c 6 ,c 7 andc 8 ). We extend the notationn k;j to refer to the number of labels collected by useru k at cellc j . Additionally,G u k refers to the subset of cells G whereu k captured at least one label (e.g.,G u 1 =fc 1 ;c 3 g). Similarly,H u k ;c j represents the value of the label entropy for objects collected by useru k at cellc j , thusH u k = P c j 2Gu k H u k ;c j . Finally,H u k;::;k 0;c j is the aggregated value of the entropy at cellc j , contributed by usersu k;::;k 0. To obtain the label distribution across a geographical area, a middle-layer is added to facilitate the communication between the FL Server and Clients (see Section 4.1). The FL Server initially shares the 46 Algorithm2CLE 1: procedureCLE(U;G;C) 2: K =jUj . Clients number 3: S =; . Solution set 4: G 0 =G . Cell Universe 5: U 0 =U . Candidate Clients 6: whilejSj<CK do . At mostCK clients 7: bestClient =bestCellSet =null;bestWeight = 0 8: forallu k 2U 0 do 9: clientCells =getClientCells(u k ) . Cell SetG u k 10: uncoveredCells =clientCells\G 0 11: weight =entropy(uncoveredCells) 12: coveredCells =clientCellsnG 0 13: ifcoveredCells6=;then 14: weight+ =residualEntropy(S;coveredCells) 15: endif 16: ifweight>bestWeightthen 17: bestClient =u k 18: bestCellSet =clientCells 19: bestWeight =weight 20: endif 21: endfor 22: S =S[bestClient 23: U 0 =U 0 nbestClient 24: G 0 =G 0 nbestCellSet 25: endwhile 26: returnS . The selected clients 27: endprocedure desired geohash precision which denes the spatial granularity. Subsequently, the FL Clients periodically report the label distribution of their collected data. TheCLE Client Selection Algorithm 2, adapts the greedy algorithm introduced in [31] which approx- imately solves theMaximumWeightedOverlapCoverageProblem. The algorithm iterates on all clients and identies the most contributing client by aggregating the weighted label entropy at each cell. The value of the total weight depends on whether the cell is already covered by other selected clients. For example, at the beginning in the toy scenario of Figure 4.3 the server state is empty and none of the cells are covered. HenceH u 3 = 20 22 H u 3 ;c 3 + 10 10 H u 3 ;c 4 + 1 4 H u 3 ;c 5 = 2:485. The clientu 3 with larger weighted label entropy value is selected and the server state is updated as shown in the bottom center of Figure 4.3 to include the label distribution for each cell, henceS =fu 3 g. For the next client candidate selection, the 47 CLE algorithm considers both the existing solution (server state) coverage and the contribution of the client. For each candidate client and cells that are uncovered by the existing solution, the label entropy is computed as before. However, for cells covered by a previous client selection, the residual label entropy is computed instead. The residual entropy is the dierence between the solution obtained by adding a candidate client and the previous solution. For instance,u 1 andu 3 both collected images in cellc 3 . If the algorithm was going to select u 1 , then the new solution S 0 = fu 3 ;u 1 g, would have weight 12 12 H u 1 ;c 1 + 22 22 H u 13 ;c 3 + 10 10 H u 3 ;c 4 + 1 4 H u 3 ;c 5 =1:384+1:240+1:361+0 =3:985. The candidate solutionS 0 improvesS by3:9852:485 =1:5. In this case the residual weight is calculated onc 3 , because it is shared by bothu 1 andu 3 . If the algorithm was going to selectu 2 , then weight ofS 0 =fu 2 ;u 1 g would be 20 22 H u 3 ;c 3 + 10 10 H u 3 ;c 4 + 4 4 H u 23 ;c 5 + 17 17 H u 2 ;c 9 =1:123+1:361+0:811+1:548 =4:843. The candidate solutionS 0 improvesS by4:8432:485 =2:359. Here, the residual entropy is calculated for cellc 5 .u 2 is selected and the server state in the bottom right of Figure 4.3 is obtained. This process is repeated untilCK clients are selected. Intuitively,CLE chooses clients which maximizes the weighted label entropy. By calculating the residual entropy in cases where the cell is already covered by another client in the solution, the algorithm favors cells that are still uncovered. 4.3 Experiments All code is written in Python 3.7, and uses Tensorow Federated Framework (TFF) [128] to simulate the federated environment. For the object detection models we use the TensorFlow Object Detection API [129]. We apply our federated object detection technique to state-of-the-art object detection mod- els Faster-RCNN, RetinaNet and EcientDet and conduct extensive experiments. The initial state of the models is initialized with the COCO [84] pre-trained weights for faster convergence. For the evaluation 48 0 10 20 30 40 50 Client # 0 50 100 150 200 250 300 350 400 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Pascal VOC IID - K=50 Labels # Samples # Weighted Entropy 0 20 40 60 80 100 Client # 0 50 100 150 200 0.000 0.005 0.010 0.015 0.020 0.025 0.030 Pascal VOC IID - K=100 Labels # Samples # Weighted Entropy 0 10 20 30 40 50 Client # 0 100 200 300 400 500 600 700 800 0.000 0.005 0.010 0.015 0.020 0.025 KITTI IID - K=50 Labels # Samples # Weighted Entropy 0 20 40 60 80 100 Client # 0 100 200 300 400 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 KITTI IID - K=100 Labels # Samples # Weighted Entropy 0 2 5 7 10 12 15 17 20 Client # 0 5 10 15 20 25 30 35 40 0.00 0.02 0.04 0.06 0.08 FEDAI IID - K=20 chair sunshade electrombile basket gastank table carton Samples # Weighted Entropy Figure 4.4: IID partition forK =f50;100g clients for Pascal VOC and KITTI andK =f20g clients for FEDAI. Each client receives almost identical number of labels. 49 of the various models we use the standard Pascal VOC challenge [42] which computes the mAP based on 50% IOU. At each round we report the best performing model up to that round (i.e., best cumulative mAP). 4.3.1 Datasets Pascal VOC [42] 2007 and 2012 datasets contain data for 20 objects/labels. The training dataset consists of 16551 images (2007 and 2012 combined) and 4952 test images. KITTI [51] dataset contains data for 8 object categories. It consists of 7481 images for training, of which 500 are used for testing. Virtual KITTI [47, 24] is a synthetic dataset generated by Unity game engine which recreates real-world videos from the original KITTI dataset. It contains 5 dierent video scenes, each with 7 dierent variations of the clone images, like changes in the weather (fog, clouds, and rain) and time of day (overcast, morning and sunset). Each variation for all 5 scenes contains 2066 images. Unlike KITTI, Virtual KITTI contains a subset of 3 object categories (i.e., car, truck, van). When we compare KITTI and Virtual KITTI, we remove the labels from KITTI not used in Virtual KITTI for fair comparison. For testing, we use the same 500 images from the original KITTI, with the unused labels removed. FedAI [90] dataset contains 956 real-world images captured by CCTV cameras and labels 7 common objects. FedAI was specically designed to target FL research. The label metadata splits the images into 5 or 20 client devices. 4.3.1.1 IIDData In IID data partition, we divide and distribute the training data discussed above, randomly and evenly to clients. The resulting entropy and sample count is shown in Figures 4.4 forK = 50 andK = 100 clients for each dataset. For FedAI, we ignore the client ids in the metadata to ensure IID split between clients and the number of clients is xed toK = 20. When the data are partitioned in IID fashion, all clients have almost identical label entropy and number of samples. This scenario is ideal for converging faster during 50 the federated training; however it is not a realistic scenario. In most cases, clients have heterogeneous datasets. In the next section we discuss how to partition the dataset in a Non-IID fashion. 4.3.1.2 Non-IIDData Clients are usually acting independent in a FL setting. Their training data living on a client’s device depends on the usage of the particular mobile device. In Non-IID data partition, we want to ensure: a) clients receive dierent amount of data and b) dierent object categories distribution. For a) the number of samples for each client is sampled from a multinomial distribution with probabilities generated from Dirichlet distribution. For b) the extreme case would be each client to have images for one object category. However, this is not possible because for object detection, an image can contain multiple objects. Removing the bounding boxes for other objects is also not desirable as the model will not learn from that image and treat it as background. To generate Non-IID labels, we follow a simple strategy. First, we sort the label names. Then for each label, we retrieve all images that contain a bounding box with that label and remove them from the available images. During a label-image matching, the image can contain multiple other labels. Regardless, in practice this method generates Non-IID data as shown in Figure 4.5. For FedAI dataset, the device id in the metadata is used to split it to clients. 4.3.2 ExperimentalResults mAP@.50 for IID and Non-IID: Figure 4.6 illustrates the performance of Faster-RCNN for Pascal VOC and KITTI datasets forK = 100 clients. We xC = 0:1 and use random (R) client selection method while varying the client epochE =f1;5g and batch sizeB =f4;8g. Although decreasingB and increasing E, the client needs to perform more local computation, it can signicantly reduce the amount of rounds required to reach convergence. Our proposedFLE andCLE perform similarly toR when data are IID and 51 Faster-RCNN RetinaNet EcientDet Pascal VOC E B IID (54%) Non-IID (21%) IID (61%) Non-IID (35%) IID (45%) Non-IID (5%) 1 8 981 933 353 136 794 32 1 4 409 621 113 114 295 245 5 8 98 82 46 56 207 868 5 4 51 129 38 63 125 695 KITTI E B IID (24%) Non-IID (24%) IID (34%) Non-IID (34%) IID (22%) Non-IID (14%) 1 8 951 402 - 481 952 347 1 4 322 133 320 202 277 550 5 8 98 59 70 69 291 271 5 4 49 31 60 31 35 455 FedAI E B IID (45%) Non-IID (35%) IID (31%) Non-IID (39%) IID (14%) Non-IID (15%) 1 8 849 446 924 994 - - 1 4 196 319 278 435 - - 5 8 50 78 50 104 976 831 5 4 20 13 30 75 522 270 Table 4.2: Faster-RCNN, RetinaNet and EcientDet rounds required to reach target mAP for Pascal VOC, KITTI and FedAI datasets while varying client local epochE and batch sizeB. are omitted from the graphs for clarity. For Non-IID data, we compare the best performing congurations ofR along with baselinesRL andRS in subsequent experiments. For the IID Pascal VOC dataset and targetmAP = 54% decreasingB from 8 to 4 and increasingE from 1 to 5, reduces the number of communication rounds from 981 to 51, a 19.2x improvement. For the IID KITTI dataset and targetmAP = 24% decreasingB from 8 to 4 and increasingE from 1 to 5, reduces the number of communication rounds from 951 to 49, a 19.4x improvement. The Non-IID partition of the datasets to clients signicantly reduces the performance of the models. The eect of Non-IID data is more apparent on Pascal VOC, where no conguration achieves more than 40% during the rstT = 1000 rounds. Figure 4.7 and Figure 4.8 shows symmetrical graphs for the performance of RetinaNet and EcientDet. Table 4.2 summarizes these results. 52 ImpactofParticipatingClients: Figure 4.9 shows the eect of varying the total number of participating clientsK =f1;50;100g with xedE = 5,B = 4 andC = 0:1, for Pascal VOC and KITTI datasets for Faster-RCNN. When the datasets are distributed to less number of clients (i.e., fromK = 100 toK = 50), a fewer number of rounds are required to reach higher mAP, regardless if the data are IID or Non-IID. In the extreme case whenK = 1 the federated training becomes centralized. Similar trends were observed for RetinaNet and EcientDet and are omitted due to lack of space. Impact of Client Selection Methods: This experiment examines how the client selection methods can improve the mAP in the case of Non-IID datasets. Figure 4.10 illustrates the mAP for Pascal VOC (left), KITTI (middle) and FEDAI (right) for Non-IID data partition as discussed in Section 4.3.1.2 while varying the dierent selection methods. ForFLE, we empirically set = = i = 1 and c = 5. ForCLE the geohash precision was set to 10, resulting in 600 distinct grid cells. TheR client selection method which randomly chooses clients at each round requires more federated rounds to reach the same target mAP. For instance, Faster-RCNN for Pascal VOC requires 130 rounds to reach 35% withFLE, whileR requires 500 rounds. RetinaNet for Pascal VOC shows similar results.FLE takes 15 rounds to reach 37% mAP for Pascal VOC, whileR reaches the same mAP at round 105,RS in 131 andRL in 160. The proposedCLE technique, requires geotagged training dataset, hence is only shown for the KITTI dataset.FLE andCLE both outperformR,RS andRL during the early rounds of the federated learning process and improve the models to reach higher mAP. Faster-RCNN reaches the target mAP of 34% in 110 rounds withCLE, 120 rounds withFLE and 290 rounds withR.RL which weights clients higher based on the number of labels performs better thanRS (140 vs 200 rounds). RetinaNet reaches the target mAP of 36% in 3 rounds withFLE, 3 rounds withCLE and 179 rounds withR. EcientDet reaches the target mAP of 21% in 3 rounds withCLE, 86 rounds withFLE and 916 rounds withR.RL reaches the target mAP at 235 compared toRS which requires 516. However, after round 500,RL outperformsRS. After 53 round 500,CLE achieves about 5-8% dierence compared to the baselines in all models. CLE not only provides comparable eciency toFLE, but also considers the geospatial coverage of the training dataset. The proposedFLE andCLE (where applicable) selection methods outperforms the other baselines in all models across all datasets. ImpactofSyntheticdata: This experiment analyzes the impact of synthetic datasets in federated learn- ing environment. Specically, we use the Virtual KITTI dataset as discussed in Section 4.3.1 to include clients that train on real, synthetic or a mixture of data. We x the number of clientsK = 100, and client fractionC = 0:1 and construct the following variation of clients: 1)kitti which uses the original KITTI with only 3 labels, 2)vkitti2 which uses Virtual KITTI with the clone variation, 3)kitti+vkitti2 which combines 1) and 2), 4)kitti+vkitti2noisy which uses overcast, rain and sunset variation. When par- titioning the data to clients, we make sure that each client receives either synthetic or real data but not mixed. Figure 4.11 shows the performance of RetinaNet and Faster-RCNN.vkitti2 performs worse in both cases, than others, but this is expected because the test dataset is using only real images from the original KITTI dataset. When adding the clone variation dataset inkitti+vkitti2 the performance is comparable to thekitti (55% and 54%, respectively) for RetinaNet. A mAP improvement of 2% is more apparent for kitti+vkitti2noisy. The augmented scenes which include weather (i.e., rain, overcast) and time (i.e., sunset) variations in the synthetic dataset enhance the model accuracy. 54 0 10 20 30 40 50 Client # 0 200 400 600 800 1000 1200 1400 1600 0.00 0.05 0.10 0.15 0.20 Pascal VOC Non-IID - K=50 Labels # Samples # Weighted Entropy 0 20 40 60 80 100 Client # 0 100 200 300 400 500 600 700 800 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Pascal VOC Non-IID - K=100 Labels # Samples # Weighted Entropy 0 10 20 30 40 50 Client # 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.02 0.04 0.06 0.08 0.10 KITTI Non-IID - K=50 Labels # Samples # Weighted Entropy 0 20 40 60 80 100 Client # 0 250 500 750 1000 1250 1500 1750 0.00 0.01 0.02 0.03 0.04 KITTI Non-IID - K=100 Labels # Samples # Weighted Entropy 0 2 5 7 10 12 15 17 20 Client # 0 20 40 60 80 100 120 140 160 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 FEDAI Non-IID - K=20 chair sunshade electrombile basket gastank table carton Samples # Weighted Entropy Figure 4.5: Non-IID partition forK =f50;100g clients for Pascal VOC and KITTI andK =f20g clients for FEDAI. The label distribution varies across clients. 55 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 mAP@.50IOU Faster-RCNN - Pascal VOC (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 mAP@.50IOU Faster-RCNN - Pascal VOC (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 mAP@.50IOU Faster-RCNN - KITTI (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 mAP@.50IOU Faster-RCNN - KITTI (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 mAP@.50IOU Faster-RCNN - FEDAI (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.0 0.2 0.4 0.6 0.8 mAP@.50IOU Faster-RCNN - FEDAI (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 Figure 4.6: Faster-RCNN on Pascal VOC, KITTI and FEDAI. 56 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 mAP@.50IOU RetinaNet - Pascal VOC (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 mAP@.50IOU RetinaNet - Pascal VOC (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 mAP@.50IOU RetinaNet - KITTI (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 mAP@.50IOU RetinaNet - KITTI (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 mAP@.50IOU RetinaNet - FEDAI (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 mAP@.50IOU RetinaNet - FEDAI (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 Figure 4.7: RetinaNet on on Pascal VOC, KITTI and FEDAI. 57 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 mAP@.50IOU EfficientDet - Pascal VOC (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.02 0.04 0.06 0.08 0.10 0.12 mAP@.50IOU EfficientDet - Pascal VOC (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 mAP@.50IOU EfficientDet - KITTI (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 mAP@.50IOU EfficientDet - KITTI (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 mAP@.50IOU EfficientDet - FEDAI (IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 0 200 400 600 800 1000 Round # 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 mAP@.50IOU EfficientDet - FEDAI (Non-IID) B=4, E=1 B=4, E=5 B=8, E=1 B=8, E=5 Figure 4.8: EcientDet on Pascal VOC, KITTI and FEDAI. 58 0 20 40 60 80 100 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 mAP@.50IOU Pascal VOC K=1 K=50 IID K=100 IID K=50 Non-IID K=100 Non-IID 0 20 40 60 80 100 Round # 0.0 0.1 0.2 0.3 0.4 0.5 mAP@.50IOU KITTI K=1 K=50 IID K=100 IID K=50 Non-IID K=100 Non-IID Figure 4.9: Faster-RCNN on VOC and KITTI for varying number of clientsK =f1;50;100g 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 mAP@.50IOU Faster-RCNN - Pascal VOC R RL RS FLE 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 mAP@.50IOU Faster-RCNN - KITTI R RL RS FLE CLE 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 mAP@.50IOU Faster-RCNN - FEDAI R RL RS FLE 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 mAP@.50IOU RetinaNet - Pascal VOC R RL RS FLE 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 mAP@.50IOU RetinaNet - KITTI R RL RS FLE CLE 0 200 400 600 800 1000 Round # 0.0 0.1 0.2 0.3 0.4 0.5 0.6 mAP@.50IOU RetinaNet - FEDAI R RL RS FLE 0 200 400 600 800 1000 Round # 0.00 0.02 0.04 0.06 0.08 0.10 0.12 mAP@.50IOU EfficientDet - Pascal VOC R RL RS FL 0 200 400 600 800 1000 Round # 0.00 0.05 0.10 0.15 0.20 0.25 0.30 mAP@.50IOU EfficientDet - KITTI R RL RS FLE CLE 0 200 400 600 800 1000 Round # 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 mAP@.50IOU EfficientDet - FEDAI R RL RS FL Figure 4.10: Client selection method comparison for Non-IID datasets. The proposed FLE and CLE outperform conventionalR. 59 0 20 40 60 80 100 Round # 0.0 0.1 0.2 0.3 0.4 0.5 mAP@.50IOU RetinaNet kitti vkitti2 kitti+vkitti2 kitti+vkitti2 noisy 0 20 40 60 80 100 Round # 0.0 0.1 0.2 0.3 0.4 mAP@.50IOU Faster-RCNN kitti vkitti2 kitti+vkitti2 kitti+vkitti2 noisy Figure 4.11: Model performance on KITTI and Virtual KITTI 2 datasets. 60 Chapter5 PlacementofDNNModelsonMobileEdgeDevicesforEectiveVideo Analysis Over the last ve years, the enormous growth in the number of IoT devices [49] and combined improve- ments in Edge Computing (EC) systems and Articial Intelligence (AI) [27], have made it possible to exe- cute computationally demanding Deep Neural Networks (DNNs) models at edge devices [46], commonly known as Edge AI. Edge AI is crucial for large applications requiring to perform cost eective analytical tasks. It enables devices to process models locally at edges, reducing infrastructure and cloud storage costs, bandwidth consumption, and the risk of privacy leakage. As a result, Edge AI applications enjoy several benets. First, by analyzing the data locally instead of the cloud-based data centers, it reduces the associated processing costs, which can constitute up to 90% of machine learning infrastructure costs [16]. Second, far less data need to be transmitted to the cloud servers which is essential for applications with IoT devices generating large-scale video data. Instead of sending vast amount of video data for analysis, which is usually not practical, devices can locally process the content and transmit small inference results, and consequently can scale to support more devices with low bandwidth as well as cloud storage. Finally, the raw data that can otherwise expose sensitive information, remain at the device which reduces the risk of privacy leakage. 61 Figure 5.1: Motivating example with two edge devicese 1 ande 2 , two modelsm 1 andm 2 . Of particular interest are edge devices equipped with cameras, such as NVIDIA’s Jetson Nano[99], Google’s Coral Edge TPU[33] or Amazon’s DeepLens[15], which can be used to automate a range of visual analysis applications. For instance, the LA Sanitation & Environment (LASAN) reported the installation of camera systems on sanitation trucks [78] to automate the process of assigning cleanliness score on road segments [41, 20, 138] to eectively schedule clean-up crews. Other examples include DNN models used to assess road damage [10, 13, 55] to improve the road network, detect res [66, 80] to prevent re damage, recognize license plates [118] to enhance law enforcement. Nonetheless, edge devices come in dierent shapes and sizes and their processing performance heavily depends on their hardware resources; some are equipped with powerful GPUs such as the NVIDIA’s Jetson Nano while others are less capable such as the barebone Raspberry PIs. This leads to designing dierent DNN models for resource-constraint devices by varying the number of model parameters and hence re- ducing the memory required to load and feed-forward the model. These models include MobileNets [57, 112, 58], EcientNet [124], SqueezeNet [61], Single Shot Detector (SSD) [86] and YOLO [109] which are designed to execute inference computation fast and with high accuracy. Inherently, there arises a problem of assigning models over edges for multiple applications. We gen- eralize this class of applications with diverse devices and DNN models with the running example shown 62 in Figure 5.1. There are two participating edge devices of dierent type, e 1 ande 2 , attached to moving vehicles with known trajectories (e.g., sanitation trucks, buses, mail trucks or police patrols) shown in green and blue dotted lines, respectively. The devices are equipped with cameras, each generating a set of Field-of-Views (FOVs, i.e., spatial representation of visual coverage) [17]:F 1 =ff 1 1 ;f 1 2 ;:::;f 1 9 g fore 1 andF 2 =ff 2 1 ;f 2 2 ;:::;f 2 6 g fore 2 . Additionally, we show two objects of interest, tents and trash, from prior knowledge (i.e., historic homeless population counts data and regions that frequently get dirty , re- spectively). There are two available model classes, M =fm 1 ;m 2 g, where m 1 can detect tents (as an estimation for homeless counting) andm 2 can detect trash (for cleanliness assessment). Given a diverse number of mobile edge devices, how can a set of models be placed on devices in the most eectiveway? Answering to this question is critical in a large scale application, in particular, when exible and cost-ecient conguration of video analytics is needed in a dynamically changing environment. Ex- isting solutions require developers to consider various properties of the model and capabilities of devices and thereafter to manually place the models to devices. This approach suers from several drawbacks. First, manually deploying models is labor intensive and cannot scale as the number of models and devices increase. A bad conguration due to human error may incur a high utilization cost. For example, a place- ment of a computation intensive model on a limited capacity device can lead to a high inference latency and consequently to a low detection rate. Second, existing solutions do not consider edge devices that are mobile. For instance, the placement of models to devices with a pre-determined trajectory, all of which may or may not overlap certain regions, can benet from the knowledge of the coverage of the underlying region. To keep the running example in Figure 5.1 simple, we assume that each device has enough resources to run one of the models (we relax this assumption later). However, e 2 is less capable thane 1 and can only run inference every other frame (dotted FOVs represent frames without inference results). Without prior knowledge of the underlying region, an operator may place the models randomly on the devices. 63 In that case, if m 1 is deployed to e 2 and m 2 to e 1 , then no tents will be detected and only a few trash are found. Nonetheless, a smarter placement procedure can use prior knowledge of regions where they tend to concentrate homeless population or tend to get dirty frequently, to deploy models where they have higher success rate of detecting objects. Thus, an eective model placement needs to be considered. The problem of model placement becomes more challenging if we consider other factors in reality: a) Devices can have overlapping FOVs between themselves, e.g.,f 1 1 andf 1 7 , or across devices, e.g.,f 1 9 andf 2 5 , b) the heterogeneity of device resources; devices exhibit various constraints related to inference latency, available memory size, and utilization cost. For instance,e 1 is a Jetson Nano device which can get prediction results from m 1 much faster than e 2 , a Raspberry PI, and c) it is possible to run multiple models on the same device in expense of higher latency per model inference. In this paper, we provide the denitions of the model placement problem, then formulate the placement problem as a Mixed Integer Quadratically Constrained Programming (MIQCP) problem which models the capabilities of the devices, the characteristics of DNN models and the region coverage with a set of vari- ables and constraints. MIQCP problems are NP-Hard, hence we implement several heuristics to solve the placement problem eciently. Our Spatial Coverage (SC) heuristic utilizes the spatial property of videos in a novel way which uniquely considers the coverage region of the edge devices, and achieves a higher recall than any other alternatives. The remainder of this paper is structured as follows. Section 5.1 introduces the groundwork and the necessary denitions used throughout the paper. Section 5.2 presents the MIQCP formulation of the model placement along with the proposed heuristics. Section 5.3 evaluates the eectiveness of the proposed approximation techniques. 5.1 Background In this section we provide the denitions used throughout this work. 64 Table 5.1: Notations Notation Meaning E =fe 1 ;:::;e jEj g Edge devices F e =ff e 1 ;:::;f e jF e j g FOVs of devicee M =fm 1 ;:::;m jMjg Model classes MV m =fmv 1 ;:::;mv jMVmj g Model versions P =fp 1 ;:::;p jPj g Objects of interest P m Objects that can be detected bym P e;m Objects that can be detected bym ate G n;m =fc 1;1 ;:::c n;m g 2D Grid C e i Cells covered byf i 2F e F i;j FOVs that coverc i;j w c;f =w (i;j);k Weight of cellc i;j by FOVf k p e mv i Binary placement variable formv i x m f e k Binary FOV selection variable L e mv i Inference latency formv i ate IL e mv i Total inference latency formv i ate K e mv i Max. number of selected FOVs ate a e mv i Interference coecient e Utilization of devicee U Avg. Utility C Avg. Coverage UC Avg. Utilation Cost 5.1.1 EdgeDevices The system consists of a set of edge devices equipped with cameras and computing capability to perform DNN visual tasks, such as Jetson Nanos, Raspberry PIs, Coral Edge TPUs,E =fe 1 ;e 2 ;:::;e e ;:::;e jEj g. Each edge devicee i has the propertieshmc;k;Fi, wheree i :mc refers to the memory capacity of the device, e i :k to the maximum number of co-located models ande i :F to the set of FOVs generated by the device’s camera. We assume the edge device is attached to vehicles with known trajectories. 5.1.2 Field-of-View Each edge device e generates a video footage v, which can be represented as a set of individual video frames ordered by their captured timet i :F e =ff 1 ;f 2 ;:::;f i ;:::;f jF e j g. To model the viewable region of a framef i we use the Field-Of-View (FOV) structure [17]. Each FOVf i is in the form ofhp;;R;i as shown in Figure 5.2a, wheref i :p is the camera position consisting of the latitude and longitude coordinates 65 (a) The 2-D FOV model. (b) 7 FOVs from two devices and theirCMBR. Figure 5.2: Field-Of-View (FOV) model and Coverage MBR. read from the GPS sensor of the device,0°f i :< 360° is the angle of the camera viewing direction with respect to the North obtained from the digital compass sensor,f i :R is the maximum visible distance, and 0°<f i :< 360° denotes the visible angle obtained from the camera lens property. We refer to the set of FOVs of devicee, byF e ore:F, interchangeably. Additionally, we drop the index notation (f:R instead off i :R) for clarity, when possible. 5.1.3 DNNModelClasses We dene the set of DNN model classesM=fm 1 ;m 2 ;:::;m jMj g. Each model class represents a dierent ML visual detection or classication task, e.g.,m 1 is a model class that can detect tents andm 2 can detect trash, as in Figure 5.1. Each model classm i dene its own latency requirementm i :max_l depending on the visual task. 66 5.1.4 DNNModelVersions Model versions refer to trained models that belong to model classes,MV m =fmv 1 ;mv 2 ;:::;mv jMV m j g. Model versions is a generic way of representing dierent model architectures (e.g., MobileNets [58], E- cientNet [124]) or model complexity (e.g., MobileNet with width multiplier 100% and image input resolu- tion 224px vs MobileNet with width multiplier 50% and image input resolution 128px). Each model version mv i is in the form ofhmem;acci, wheremv i :mem is the minimum memory required to load the model version to a device andmv i :acc the reported test accuracy. 5.1.5 ObjectsofInterest We dene the set of objects of interestP =fp 1 ;p 2 ;:::;p jPj g, each with an idp i :id which represent unique objects that can be detected by model versions.P m P denotes the subset of objects that belong to model classm, hence can be detected by model versionsMV m . In addition, for a devicee,P e P represents the set of objects that are covered by the device’s FOVsF e (we relax the assumption of knowing which objects will appear on device’s trajectory in Section 5.2.2.8. Finally,P e;m P e P contains the set of objects that can be detected by model versions in classm when placed on devicee. 5.1.6 CoverageGrid Given the set of edge devicesE and their FOVsF e ,8e2E, we dene the Coverage Minimum Bounding Rectangle (CMBR) which encloses all FOVs. From theCMBR and a cell widthw, we partition the region of interest into square cells, generating a 2-dimensional gridG n;m =fc 1;1 ;c 1;2 ;:::;c i;j ;:::;c n;m g. For each devicee we keep track of the set of cells that are covered by their respective FOVs withC e i G, the subset of the cells which are covered byf i 2F e . Additionally, for each cellc i;j 2G n;m , we keep track of the FOVs that cover it across all devices, i.e.,F i;j F,F = S jEj e=1 F e . 67 Figure 5.2b illustrates theCMBR, and the gridG 12;12 with 7 FOVs generated from devicese 1 ande 2 , F 1 =ff 1 ;:::;f 5 g andF 2 =ff 1 ;f 2 g, respectively. The cell set covered by FOVf 1 1 isC 1 1 , highlighted in dark grey color. For simplicity, a cell is considered covered by an FOV if it covers its center. The set F 6;6 =ff 1 4 ;f 2 2 g contains the FOVs that cover cellc 6;6 . 5.1.6.1 WeightingSchemes To model the importance of regions and FOVs, we introduce several weighting schemes. Each scheme applies a dierent weighting function for each pair of FOVs and cells. Then, depending on which weighting scheme is used, the proposed placement algorithm decides which regions are still covered enough or not. Binary(BIN): The binary weight of an FOVf and a cellc is dened as: w B c;f = 8 > > > > < > > > > : 1 ifd(f:p;c:p)f:R 0; otherwise (5.1) whered(f:p;c:p) is the haversine distance between the location of the cameraf:p and the cell centerc:p. The binary weight assigns a unitary weight to all cells covered by the corresponding FOV. Spatial(SP): Extending the binary weight, a cell is weighted based on the distance from camera location. More formally: w S c;f = 8 > > > > < > > > > : 1 d(f:p;c:p) f:R ifd(f:p;c:p)f:R 0; otherwise (5.2) We limit the weight to be 0 w S c;f 1, by normalizing the distance by the the maximum viewable distance f:R. As a result, cells closer to camera location are assigned higher weight, whereas cells far 68 away are assigned zero weight. Directional(DIR): The directional weight of the FOVf and a cellc2C k is dened as: w D c;f = 8 > > > > < > > > > : 1 2(] f:p;c:p ;f:) f: ; if(] f:p;c:p ;f:) f: 2 0; otherwise (5.3) The angle between the camera location and cell center with respect to the North, is given by 0° ] f:p;c:p <360°. Then, the function0°(] f:p;c:p ;f:)< 360° measures the angle dierence between the camera location and cell center with the camera direction. Cells closer to the center of the camera directionf: are assigned higher weight, whereas cells outside the FOV’s viewing angle are assigned zero weight. Similar to the spatial weight, this controls the directional weight range to be within0w D c;f 1. User-Assigned(UA): The user-assigned weight scheme of a cellc is an application-specic value. Unlike the spatial and directional weight functions which are related to the FOV that cover the cell, the user- assigned 0 w U c;f 1 is independent of the FOV. In our system, without loss of generality, givenP as input, the user-assigned weight to each cell: w S c;f = 8 > > > > < > > > > : w U c = max(exp( d(p:p;c:p) 2 2 2 ));8p2P ifd(p:p;c:p) 3 0; otherwise (5.4) For each object of interestp, we apply a radial basis function whered(p:p;c:p) is the haversine distance function. Cells closer to the object location are assigned higher weight, with the spread controlled by the value of variance. Figure 5.3 shows the three weighting schemes: in 5.3a cellc 2;2 receives lower spatial weight compared toc 6;3 , in 5.3b the directional weight of cellc 3;5 andc 5;4 is similar despite being at dierent distances from the camera location and in 5.3c the grid is represented as a heatmap, where cells are assigned a higher 69 (a) Spatial (b) Directional (c) User-assigned Figure 5.3: Spatial, directional and user-assigned weight examples. weight in red areas than in blue areas, hencew U (5;4) <w U (3;5) . Distance-weighted Location (DWL): The distance-weighted location scheme uses the spatial as a weight for the user-assigned weighting scheme:w SU c;f =w S c;f w U c . Since0w S c;f ;w U c 1,0w SU c;f 1. The weight of FOV f and cell c is higher when the cell is important (high value of w U c ) and the FOV location close to cellc. 5.1.6.2 OverlappingFOVs It is often possible that a selected subset of FOVsF 0 i;j of the same or dierent edge device overlap, i.e., F 0 i;j F i;j . For instance,F 0 6;6 =F 6;6 =ff 1 4 ;f 2 2 g cover the same cell c 6;6 in Figure 5.2b. To dene the weight of cells covered by multiple FOVs, we introduce the weight function f : X ! Y , where X =fx2R jF 0 i;j j jx =w (i;j);k ;8f k 2F 0 i;j F i;j g andY =fy2Rj0y 1g which computes a new weightY , given the weightsX = w (i;j);1 ;w (i;j);2 ;:::;w (i;j);jF 0 i;j j for cellc i;j for the selected FOVsF 0 i;j that covers it. 70 5.2 ModelPlacement The model placement problem considers three aspects: a) the coverage area of the deployed model versions, b) the utilization cost of running the model versions on edge devices and c) the utility value obtained by considering both the accuracy and the latency of the deployed model versions. We rst formulate the model placement problem as a Mixed Integer Quadratically Constrained Pro- gramming (MIQCP) problem. Given that the model placement problem is NP-Hard, it is inecient to obtain optimal solutions for large scale setups. Thus, we propose several heuristics to solve it eciently. 5.2.1 MIQCPFormulation The placement decision of model versionmv i on edge devicee is dened by a binary variablep e mv i . Edge devices are assumed to run one model version per model class, hence we restrict the placement of multiple models versions for the same visual task: X mv i 2MV m p e mv i 1;8m2M;8e2E (5.5) Moreover, edge devices can restrict the number of co-located model versions according to their maxi- mum valuee:k: X m2M X mv i 2MV m p e mv i e:k;8e2E (5.6) 5.2.1.1 InferenceLatency&Utility The inference latency of a model versionmv i running on edge devicee depends on the resource constraints of the device. In addition, the successful detection of objects depends on the accuracymv i :acc. Usually, lighter models sacrice inference accuracy for lower inference latency. The inference latency can increase, when multiple model versions are co-located on the same edge device. Consequently, even if a more 71 accurate model version is placed on the device, the increased latency can result to more missed objects, whereas a less accurate model version with less latency can be more suitable. In this section, we model both inference latency of a model version and its utility provided by its accuracy. To model the inference latency, we consider the case where multiple model versions are deployed to the same edge device [21]: IL e mv i =L e mv i + X m 0 2M X mv 0 k 2MV m 0 mv i 6=mv 0 k a e mv 0 k p e mv 0 k L e mv 0 k (5.7) The inference latencyIL e mv i is estimated by the interference coecienta e mv 0 k in terms of the inference latency of other model versions. By settinga e mv 0 k = 0, placing a model version to a device will not have any eect on the inference latency of other model versions, whilea e mv 0 k > 0 linearly increases the inference latency. The model version placement is restricted to devices that can meet the maximum latency requirement: IL e mv i p e mv i m:max_l;8m2M;mv i 2MV m ;8e2E (5.8) To balance latency and accuracy we dene the utility of a model version [100]: U e mv i =mv i :acc m:max_lIL e mv i jm:max_lL e mv i j (5.9) The smaller them:max_l, the utility is more sensitive to model versions with lower inference latency, whereas larger values ofm:max_l helps to choose model versions with higher inference accuracy. 72 Then theaverageutility is given by: U = 1 P e2E P m2M P mv i 2MV mp e mv i X e2E X m2M X mv i 2MV m U e mv i p e mv i (5.10) 5.2.1.2 Coverage We construct a coverage grid for each model classG m . The idea is that whenever a model versionmv2 MV m is placed on a devicee, the FOVs of the deviceF e contribute to the coverage of the model class m. The weight of a FOV covering a cell is determined by one of the weighting schemes discussed in Section 5.1.6.1. By selecting FOVs that gradually increase the weight of m, we favor the placement of model versions towards less covered areas in order increase the chances of detecting objects. Additionally, keeping separate coverage grids for each model class provides the exibility to assign dierent weights for the user-assigned weighting scheme in Eq. (5.4). More formally, we dene binary variables for all devices’ FOVs and cells for each model class,x m f e k andx m c i;j , respectively. x m f e k indicates iff e k is selected andx m c i;j if cellc i;j is covered by at least one selected FOV for model classm. Note that, we do not need to create variables per model version, since only one version is allowed to be deployed per model class, according to constraint (5.5). A cell for model classm is considered covered if a model version of the same class is deployed and at least one FOV that covers the cell is selected: X f e k 2F e;m i;j p e mv i x m f e k x m c i;j ;8m2M;mv i 2MV m ;8e2E (5.11) The valueK e mv i of a model versionmv i depends on the latencyIL e mv 1 and is bounded by the total number of the device’s FOVs: K e mv i IL e mv 1 p e mv i jF e j;8m2M;mv i 2MV m ;8e2E (5.12) 73 Then, the number of selected FOVs of a devicee and model version should not exceed the maximum numberK e mv i that the inference latency allows: X f e k 2F e;m x m f e k K e mv i ;8m2M;mv i 2MV m ;8e2E (5.13) Then, to capture the individual weight of cellsc i;j covered by the selected FOVf e k for model classm, we create auxiliary variablesn m c i;j ;f e k : n m c i;j ;f e k =w (i;j);k x m f e k (5.14) The nal value of the cellc i;j is obtained by assigning the maximum value of all the contributing FOVs: v m c i;j = max f e k 2F e;m i;j n m c i;j ;f e k (5.15) Finally, we compute theaveragecoverage: C = 1 jMjjGj X m2M X c i;j 2G m v m c i;j x m c i;j (5.16) 5.2.1.3 Utilization Utilization represents the usage of the limited resources of edge devices. Focusing on the memory capacity of each edge device we dene the utilization as: e = 1 e:mc X m2M X mv i 2MV m p e mv i mv i :mem 1;8e2E (5.17) During model placement, the utilization of resources should be balanced across devices. For example, it is risky to have devices working close to 100% capacity because the more sensitive they will become to 74 0 0:2 0:4 0:6 0:8 1 0 5 10 y0 y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 under-utilization over-utilization no cost Utilization e Utilization cost Figure 5.4: Estimating the utilization cost function, which penalizes for under and over-utilization. processes running in parallel which can cause congestion. Inspired by research in trac engineering [45, 26], we introduce a utilization cost function which increases exponentially as the device approaches full capacity utilization, i.e., over-utilization. In addition, we expect that all edge devices should participate in the model placement. Hence, we adapt the cost function to penalize edge devices with low utilization,i.e., under-utilization. We approximate the utilization cost function with piecewise linear functionY as illustrated in Fig- ure 5.4. Device utilization between 30% 60% does not incur any cost. Under-utilization between 0% 30% incurs some cost which grows exponentially towards 0. Over-utilization 60% 100% also grows exponentially at a higher rate than under-utilization. For each linear segment iny i 2Y and devicee we applyy i ( e ) =a e b with parameters listed in Table 5.2. Thereafter, the utilization cost of a device is approximated by the following constraint: e y i ( e );8e2E;8y i 2Y (5.18) Theaverageutilizationcost across all edge devices is calculated: UC = 1 jEj X e2E e (5.19) 75 Table 5.2: Utilization cost parameters Under-utilization parameters a -336 -84 -16 -4 -1 0 b -4.8 -3.2 -1.6 -0.8 -0.3 0 Over-utilization parameters a 1 2 4 8 16 32 64 128 b 0.6 1.25 2.65 5.65 12.05 25.65 54.45 115.25 5.2.1.4 Objective The model placement’s objective is to maximize the average utility (5.10) and coverage in (5.16), while minimizing the average utilization cost (5.19) at the edge devices: max w 1 (U)+w 2 (C)+w 3 (UC) s.t. 3 X i=1 w i = 1 Constraints (5:5);(5:6);(5:8);(5:11);(5:12);(5:13);(5:15);(5:17);(5:18) (5.20) 5.2.2 BaselineHeuristics Solving the placement problem for large number of edge devices and models is impractical. In this section we provide several baseline heuristics that can solve the model placement problem approximately. The outline of baselines is given in Algorithm 5.5. The setMV is created which contains all model versions of all model classes (Line 2). Then, two subroutines (Line 3 and 4) are called to place model versions on the devices and given the placement, select the FOVs, respectively. For the placement in Line 3, one of the random or greedy heuristics discussed below is called. Subsequently, Algorithm 5.6, with input the placementp, iterates over the model versions and selects a subset of FOVs (Lines 3-9). The maximum number of FOVs selectedK e mv i by the subroutineselectSub- setFOVs is restricted by Eq. (5.12). selectSubsetFOVs uniformly selects and returns a subset of FOVs F 0e 2F e ;jF 0e j =K e mv i , andx m f e k is set to 1 for all FOVs inF 0e . 76 1: procedureBaselinePlacement(E;M;P) 2: MV S m2M MV m 3: p = One ofR,LMF,MAF,MPF,MPORF,MIF 4: x selectFOVs(p) 5: U Utility(p) . Eq. (5.10) 6: C Coverage(x) . Eq. (5.16) 7: UC Utilization(p) . Eq. (5.19) 8: endprocedure Figure 5.5: Baseline placement algorithm 1: procedureselectFOVs(p) 2: x 0 3: forall(mv i ;e)2p e mv i == 1do 4: K e mv i jF e j IL e mv i 5: F 0e selectSubsetFOVs(F e ;K e mv i ) 6: forallf e k 2F 0e do 7: x m f e k 1 8: endfor 9: endfor 10: return x 11: endprocedure Figure 5.6: FOVs selection algorithm Finally, given the placement and FOV selection Algorithm 5.5 computes the utility, coverage and uti- lization cost (Lines 5-7), calculated by Eq. (5.10), (5.16) and (5.19), respectively. 5.2.2.1 Random(R) TheR placement in Algorithm 5.7 iterates on all devices and then randomly shues the model version set (Line 4). Then, it attempts to place model versions on the device until the memory capacity is reached. 1: procedureR(E;MV) 2: p 0 3: foralle2E do 4: MV 0 shuffle(MV) 5: p e mv i Place(e;p;MV 0 ) 6: endfor 7: return p 8: endprocedure Figure 5.7:R algorithm 77 The subroutine place in Algorithm 5.8 checks if placing model version mv i to device e, by setting p e mv i = 1, will not violate any constraints for model version classes (5.5), maximum co-located models (5.6), maximum latency (5.8) and device utilization (5.17), thenmv i is placed to devicee. 1: procedurePlace(e,p,MV 0 ) 2: forallmv i 2MV 0 do 3: if (p e mv i == 1 violates any (5.5), (5.6), (5.8) (5.17)then 4: p e mv i 0 5: else 6: p e mv i 1 7: endif 8: endfor 9: return p 10: endprocedure Figure 5.8: Place algorithm 5.2.2.2 LeastMemoryFirst(LMF) The LMF heuristic in Algorithm 5.9 prioritizes model versions which consume the least memory and tries to t as many as possible until the resources are exhausted. Before the placement, the model version setMV is sorted by the memory capacity of eachmv i in ascending order (Line 3). 1: procedureLMF(E;MV) 2: p 0 3: MV 0 sort(MV;\asc";mv i :mem) 4: foralle2E do 5: p e mv i Place(e,p,MV 0 ) 6: endfor 7: return p 8: endprocedure Figure 5.9:LMF algorithm 5.2.2.3 MostAccurateFirst(MAF) The MAF heuristic greedily picks the most accurate model versions rst. The procedure is similar to LMF, except that model versions are sorted in descending order ofmv i :acc. 78 5.2.2.4 MostPopularFirst(MPF) The MPF prioritizes placement of models that can detect the most popular (or frequent) objects. The heuristic places the model versions that can detect the most popular objects rst in order to maximize the detection rate. Note that,MPF assumes that the distribution of objects of interest is known in advance, which is not a realistic assumption in a real setting. Regardless, we consider this baseline heuristic to compare our techniques. TheMPF algorithm in Algorithm 5.10, sorts the model versions in the order of most popular object (Line 3). Then, each device attempts to place the model version in that order. 1: procedureMPF(E;MV;P) 2: p = 0 3: MV 0 sort(MV;\desc";jP m j) 4: foralle2E do 5: p e mv i Place(e;p;MV 0 ) 6: endfor 7: return p 8: endprocedure Figure 5.10:MPF algorithm 5.2.2.5 MostPopularOnRouteFirst(MPORF) TheMPORF in Algorithm 5.11 extends the previous heuristic to sort model versions (Line 4) that can detect the most popular objectsP e on each device’se route. 1: procedureMPORF(E;MV;P) 2: p 0 3: foralle2E do 4: MV 0 sort(MV;\desc";jP e;m j),P e;m P e 5: p e mv i Place(e;p;MV 0 ) 6: endfor 7: return p 8: endprocedure Figure 5.11:MPORF algorithm 79 5.2.2.6 MostImportantFirst(MIF) TheMIF in Algorithm 5.12 calculates an importance score on the route of each device based on the term frequency (tf) and inverse document frequency (idf) weight of each object. For each objectp i 2P e , in the route of edge devicee the tf and idf are calculated (Lines 3-8). For each edge devicee and modelm the importance score sums up the tf-idf scores of objects of the same model class (Line 10). The nalScore(e;m) assigns higher score to model classes that appear many times and within small number of devices. Model versions are then sorted based on the importance score. 1: procedureMIF(E;MV;P) 2: p = 0 3: forallp i 2P e do 4: foralle2E do 5: tf(p i ;e) fp i ;e P p j 2P efp j ;e 6: endfor 7: idf(p i ;E) log jEj 1+je2E:p i 2P e j 8: endfor 9: foralle2E do 10: Score(e;m) P p i 2P e;mtf(p i ;e)idf(p i ;E) 11: MV 0 sort(MV;\desc";Score(e;m)),mv i 2MV 12: p e mv i Place(e;p;MV 0 ) 13: endfor 14: return p 15: endprocedure Figure 5.12:MIF algorithm 5.2.2.7 MostProtableFirst(MPRF) TheMPRF heuristic formulates the problem of assigning model versions to edge devices as a combina- torial optimization problem. Given the set ofMV, each with the weight the required memorymv i :mem and value, and an edge devicee, how can model versions placed one such that the total weight of selected mv i does not exceed the memory capacity of the devicee:mc and the total value is as large as possible? The value is dened as the score combined with the utilityScore(e;m)U e mv i . The intuition behind the 80 MPRF heuristic is thatScore(e;m) will rst assign highest score to model classm for devicee as dis- cussed before. However, there are multiple model versions to choose from in model classm. Hence, we use the utilityU e mv i to choose a model version which is a function of model version accuracy and inference latency. Theorem2. TheMPRF is NP-Hard. Proof. The NP-Hardness ofMPRF can be derived directly from the reduction of the 0-1 Knapsack Prob- lem (0/1KP), i.e., 0/1KP p MPRF. Given a nite set ofn elements,X =fx 1 ;x 2 ;:::;x n g, x i 2 f0;1g each with a weightw i and a valuev i and the maximum capacity of the knapsackW , the0/1KP is the problem of maximizing the total value of selected elements, which are restricted to one copy of each element, while the sum of weights is less than or equal to the knapsack capacityW . For any0/1KP instance, we reduce it to an instance ofMPRF in polynomial time. The reduction is straight-forward. The knapsack of0/1KP is represented as an edge devicee inMPRF, restricted by the maximimum memory capacity,i.e.,e:mc =W . For each element inx i in0/1KP, we create a new model classm and a model versionmv i in theMPRF. For all model classes we setScore(e;m) = 1. We set the utility of themv i toU e mv i = v i . Hence, the value ofmv i is equal tov i . Additionally, we set the memory constraint ofmv i to be equal to the weight of the element, i.e.,mv i :mem = w i . This construction takes O(n) time. Thereafter, the output with the selected model versions inMPRF is used to setx i = 1 in 0/1KP, which completes the proof. Note that, because the utility is a function of inference latencyIL e mv i that changes value depending on which mv i are already placed on the device e (Eq. 5.7), we can use solutions that can approximate the quadratic knapsack problem (QKP) [48]. To solve the MPRF problem we incorporate a dynamic 81 programming heuristic, which although it yields to relatively good quality solutions it can serve as a lower bound of the optimal solution [44]: c(i;w) = 8 > > > > < > > > > : maxfc(i1;w);c(i1;wmv i :mem)+Score(e;m)U e mv i g ifmv i :memw c(i1;w); otherwise (5.21) 5.2.2.8 SpatialCoverage(SC) Unlike the previous heuristics, which do not account the FOV coverage during the placement, the SC heuristic prioritizes model versions to devices that increase the coverage by the largest margin. More formally theSC Problem is stated as follows. Problem 2. SC Problem: Given the devicesE with their generated FOVsF e , a set of model classesM along with their model versionsMV m and coverage gridG m , the weightsw (i;j);k for each pair of FOV and cells, theSC placement problem nds the subsetMV 0 2MV andjF 0e j to place at each devicee, s.t. the total weight of covered cells is maximized while none of the constraints (5.5), (5.6), (5.8) (5.17) are violated. Theorem3. TheSC is NP-Hard. Proof. The proof forSC’s hardness comes from the reduction of the Maximum Coverage Problem (MCP ), i.e.,MCP p SC. The reduction proof is similar to the approach in our prior work [32]. Given a nite set of elements, called the universeX =fx 1 ;x 2 ;:::;x m g, a collection of setsS =fS 1 ;S 2 ;:::;S n g;S i X , whose union equals the universe, i.e., S n i=1 S i =X , and a budget valuek the Maximum Coverage is the problem of nding a subset ofS 0 S, s.t.,jS 0 jk andj S S i 2S 0S i j is maximized. For anyMCP instance, we reduce it to an instance ofSC in polynomial time. 82 First, we create a dummy edge devicee, model classm and model versionmv. Elementsx j in the MCP are represented as cellsc j in theSC,X =G m . This construction takesO(jXj) time. Additionally, for each setS i (a collection of cells) inMCP , we create the FOVf e i and the FOV Coverage SetC e i inSC. We assign a weightw B i;j = 1 when elementx j 2S i , andw B i;j = 0 otherwise, according to Eq. (5.1). The cell weight functionf which denes the new weight for overlapping FOVs, is set to return 1 whenc j (thus x j ) is not part of the current solution, otherwise 0 to prevent double-counting an element. This mapping takesO(nm) time. The construction ensures that elements covered byS i are represented as covered cells by FOVf i inC i (S =C f ). Based on the number of maximum budgetk, we setIL e mv = jF e j k , s.t.K e mv =k. Therefore,SC’s output of FOVs is exactlyS 0 inMCP , which completes the proof. Given thatSC cannot be solved in polynomial time, we propose an approximate solution to the prob- lem. The outline of the algorithm is shown in Algorithm 5.13. The algorithm starts with an empty global tentative solutionGS and at each iteration, the pairhe;mvi is added that improves the coverage by the largest margin across all model classes. To pick the best placement eciently from multiple candidate he;mvi pairs, we use a priority queuePQ (implemented with a maximum heap) that keeps the candidate solutions in order of their largest weight dierence. Candidate solutions inPQ are updated as needed in each iteration. At rst, Algorithm 5.13 initializes the global solutionGS, the solution stateMS, uncovered cellsMU, FOV-Cell pairsMF for each model class and priority queuePQ (Lines 2-13). Then, the subroutinecandidateSolutions is called in Algorithm 5.14 to construct the candidate solu- tions for each model class and insert them in a priority queue. In Algorithm 5.14, for each pairhe;mvi that can be placed (Line 5), calculates a device solutionDS by callinggetDeviceSolution in Algorithm 5.16. TheDS contains the best subset of FOVs (limited byK e mv ) that increase the coverage the most. A new candidate solutioncps is created and added to the setCPS. Finally all candidate solutions inCPS are returned. ThePQ, implemented as a maximum heap, keeps them in the order ofcps:weightDiff. 83 1: procedureSC(E;M;MV) 2: GS ; . Global Solution 3: MS ; . Solution state for each model class 4: MU ; . Uncovered cells for each model class 5: MF ; . FOV-Cell pairs for each model class 6: PQ ; . Priority Queue for all candidate solutions 7: forallm2Mdo . InitializeMS,MU,MF 8: MS:insert(m;;) 9: MU:insert(m;C) 10: foralle2E do 11: MF:insert(m;(F e ;C e )) 12: endfor 13: endfor 14: forallm2Mdo . Initial candidates construction 15: S MS:get(m) 16: U MU:get(m) 17: (F;C) MF:get(m) 18: CPS candidateSolutions(m;U;F;C;S;GS) 19: PQ:insertAll(CPS) 20: endfor 21: while 1do . Iteratively improve solution 22: updateSolutions(PQ;GS;MS;MU;MF) 23: bestCPS PQ:peek() 24: ifbestCPS ==nullthen 25: break 26: endif 27: DS bestCPS:DS 28: e bestCPS:e 29: mv bestCPS:mv 30: m mv:m 31: PQ:poll() 32: U MU:get(m) 33: U UnDS:C 34: GS GS S (e;mv;DS:F) 35: endwhile 36: endprocedure Figure 5.13:SC algorithm 84 1: procedurecandidateSolutions(m,U,F,C,S,GS) 2: CPS =; 3: foralle2E do 4: forallmv2MV m do 5: if (p e mv i == 1 violates any (5.5), (5.6), (5.8) (5.17)then 6: continue 7: endif 8: K e mv getMaxNumberOfSelectedFOVs(GS;e;mv) 9: DS getDeviceSolution(S;K e mv ;F e ;C e ;U) 10: previousWeight getSolutionWeight(S) 11: newWeight getSolutionWeight(DS) 12: weightDiff newWeightpreviousWeight 13: cps candidateSolution(weightDiff;DS;mv;e) 14: CPS CPS S cps 15: endfor 16: endfor 17: returnCPS 18: endprocedure Figure 5.14: Candidate solution construction algorithm Algorithm 5.13 enters the main loop (Line 21) which iteratively improves theGS until no improvement can be made (Line 24). At each iteration, a search for the best candidate solution begins (Line 22). The procedure updateSolutions in Algorithm 5.15 makes the necessary updates to ensure that at the top of thePQ is the best candidate solution. Algorithm 5.15 starts by polling candidate solutions from thePQ until it does not violate any con- straints (Line 7). Given the current global solutionGS (which might have been updated from previous rounds) and the pairhe;mvi the procedure getMaxNumberOfSelectedFOVs returns the K e mv . Sub- sequently, a new device solutionDS is generated (Line 14) and the candidate solution is updated. The getDeviceSolution considers both the existing coverage solutionS and current uncovered cellsU to se- lect the best subset of FOVs. The candidate solution updates its weight dierence (Lines 16-19) which is the dierence in coverage weight between the previous and new device solution. An important observation is that this weight dierence is always monotonically decreasing, since the coverage of cells is increased at each iteration. Hence, if the next candidate solution has less weight than the updated solution, we do not need to update it and the loop exits (Line 30), otherwise we keep updating candidate solutions in thePQ. 85 1: procedureupdateSolutions(PQ;GS;MS;MU;MF) 2: whilePQ6=;do 3: cps PQ:poll() 4: e cps:e 5: mv cps:mv 6: m mv:m 7: if (p e mv == 1 violates any (5.5), (5.6), (5.8) (5.17)then 8: continue 9: endif 10: S MS:get(m) . Current solution state for model 11: U MU:get(m) . Set of uncovered cells for model 12: (F;C) MF:get(m) . Coverage info for model 13: K e mv getMaxNumberOfSelectedFOVs(GS;e;mv) 14: DSgetDeviceSolution(S;K e mv ;F e ;C e ;U) 15: cps:DS =DS 16: previousWeight getSolutionWeight(S) 17: newWeight getSolutionWeight(DS) 18: weightDiff newWeight -previousWeight 19: cps:weightDiff weightDiff 20: ocps null 21: whilePQ6=;do 22: ocps PQ:peek() 23: if (p e mv == 1 does not violate any (5.5), (5.6), (5.8) (5.17)then 24: break 25: endif 26: PQ:poll() 27: endwhile 28: PQ:insert(cps) 29: ifocps ==null ORweightDiffocps:weightDiff then 30: break 31: endif 32: endwhile 33: endprocedure Figure 5.15: Update candidate solutions algorithm 86 1: proceduregetDeviceSolution(S,K,F,C,U) 2: DS S 3: U 0 U 4: F 0 F 5: currSolutionSize jDSj 6: while (jDSjcurrSolutionSize)K do 7: bestFOV null 8: bestCellSet null 9: bestWeight 0 10: forallf i 2F 0 do 11: uncoveredCells C i \U 0 12: weight computeWeight(f;uncoveredCells) 13: coveredCells C i nU 0 14: ifcoveredCells6=;then 15: weight+ =computeResidual(f;DS;coveredCells) 16: endif 17: ifweight>bestWeightthen 18: bestFOV f i ,bestCellSet C i ,bestWeight weight 19: endif 20: endfor 21: ifbestFOV ==nullthen 22: break 23: endif 24: DS DS[bestFOV ,F 0 F 0 nbestFOV ,U 0 U 0 nbestCellSet 25: endwhile 26: returnDS 27: endprocedure Figure 5.16: Device solution algorithm 87 Table 5.3: Devices and Model Versions Prole DeviceType e:mc e:k ModelVersion mem acc JetsonNano 4 3 MobileNetV2 0.5 0.72 CoralEdgeTPU 1 3 ResNet 1 0.76 RaspberryPI 1 3 EcientNet 2 0.81 5.3 Experiments To evaluate the eectiveness and compare the proposed heuristics, we have conducted experiments in a real-world setting. 5.3.1 ExperimentalSetup All experiments were performed on an Ubuntu OS 18.04 equipped with a Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz and 128GB of RAM. We have collected 165 bus routes from the San Francisco Municipal Transportation Agency in General Transit Feed Specication (GTFS) format [115]. On each bus, we assume there is an edge device installed, with its camera collecting video as the bus travels on its pre-dened trajectory. To generate more ne- grained trajectories, the locations on the route were interpolated every 5 meters. For each point in the bus trajectory, we generated a FOV with sampled from a uniform distribution(40°50°), = 51 and R = 50m. To emulate the performance of edge devices, we introduce 3 types of devices currently used in the market: a) Jetson Nano, b) Coral Edge TPU and c) Raspberry PI Model B. In all experiments we split the types of edge devices equally to the total number of participating devices. For the set of objects of interestP, we have collected 190 locations of 3 POI types: a) 42 Starbucks, b) 63 UPS and c) 85 Landmarks. The dataset of bus routes and POIs is illustrated in Figure 5.17a. For each FOV and each bus route, we generate the set of covered POIs. Figure 5.17b plots the histogram of object ids and their covering bus routes and Figure 5.17c the histogram of object ids and the number of FOVs. The plots suggest a power law distribution, where objects at the city center (37.79, -122.40) are covered by multiple routes, while objects away from the city center have few FOVs and routes covering them. 88 Table 5.4: Devices and Model Versions Latency DeviceType ModelVersion IL e mv (ms) e mv JetsonNano MobileNetV2 15 0.1 ResNet 224 0.2 EcientNet 631 0.3 CoralEdgeTPU MobileNetV2 8 0.3 ResNet 59 0.6 EcientNet 31 0.9 RaspberryPI MobileNetV2 400 0.5 ResNet 1661 1 EcientNet 8144 1.5 122.50 122.48 122.46 122.44 122.42 122.40 122.38 longitude 37.72 37.74 37.76 37.78 37.80 37.82 latitude starbucks ups landmarks (a) The map of bus trajectories and POIs 0 25 50 75 100 125 150 175 object ids 0 5 10 15 20 25 30 \# of unique videos (b) Bus routes objects ids. 0 25 50 75 100 125 150 175 object ids 0 20 40 60 80 100 \# of unique FOVs (c) FOVs and objects ids. Figure 5.17: The dataset consists of 165 trajectories and 190 objects of interest. 89 BIN DIR SP UA DWL Weighting scheme 0.0 0.2 0.4 0.6 0.8 1.0 Avg. Recall 0.86 0.89 0.89 0.88 0.90 BIN DIR SP UA DWL (a) Weighting schemes recall 15 20 25 30 35 40 45 50 Cell Size 0.0 0.2 0.4 0.6 0.8 1.0 Avg. Recall 0.90 0.90 0.86 0.80 0.78 0.72 0.66 0.62 15 20 25 30 35 40 45 50 (b) Cell size Recall Figure 5.18: Weighting schemes impact onSC. For each object type, we generated a model classm2M and for each model classm 3 dierent model versions, one for each MobileNetV2, ResNet and EcientNet. The proles for both edge devices and model versions are shown in Table 5.3. Without loss of generality, the model versions for eachm share the same characteristics. The inference latency IL e mv and interference coecient a e mv for all device type, model version pairs are listed in Table 5.4. The maximum latency for each modelm:max_l was deliberately set to a high enough value to relax the constraint in Eq. 5.8. 5.3.2 Results Weighting schemes: SC heuristic supports several weighting schemes, discussed in Section 5.1.6. Fig- ure 5.18a illustrates the recall for all weighting schemes, weighted averaged for varying number of devices jEj =f15165g in increments of 15.DWL which combines both the cell weight and the distance from the FOV location yields to better recall. We adopt this weighting scheme for subsequent experiments. Impact of cell size: The eect of the grid cell size on the SC heuristic is shown in Figures 5.18b and 5.19a. We vary the cell size fromw = R = 50 (the viewable FOV distance) tow = 15 decrements of 5. Decreasing the cell size allowsSC to assess the weighted spatial coverage at a ner granularity, which improves the recall, however it comes at a higher performance cost which grows quadratically. This is 90 15 20 25 30 35 40 45 50 Cell Size 0 50 100 150 200 250 Avg. Time (ms) 285.40 128.47 65.08 33.67 19.19 12.17 7.59 5.24 15 20 25 30 35 40 45 50 (a) Cell size performance 0 100 200 300 400 500 Variance 0.0 0.2 0.4 0.6 0.8 1.0 Avg. Recall 0.90 0.89 0.86 0.86 0.85 0.84 0 100 200 300 400 500 (b)SC recall for varying noisy data Figure 5.19: Cell size impact onSC. 15 30 45 60 75 90 105 120 135 150 165 Number of devices 0.0 0.2 0.4 0.6 0.8 1.0 Recall MAF R LMF MPF MPORF MIF MPRF SC (a) Heuristics Recall 15 30 45 60 75 90 105 120 135 150 165 Number of devices 10 1 10 0 10 1 10 2 Avg. Time (sec) MAF R LMF MPF MPORF MIF MPRF SC (b) Heuristics Avg. Time (ms) Figure 5.20: Recall and Performance of Placement Heuristics. because the coverage grid is larger which impacts the performance of thegetDeviceSolution procedure. Atw = 20, decreasing the cell size further does not yield to any improvement in recall but comes at2:2x cost in required time. We adopt this cell size in the experiments below. Trade-obetweenRecallandHeuristics: In Figure 5.20a we compare the dierent heuristics in terms of their eectiveness to detect objects by measuring the recall (y-axis) at dierent total number of par- ticipating edge devices (x-axis). At each edge device number the set of objects taken into account in the recall calculation is obtained from the ground truth. Then after the placement occurs, if any of the model versions is placed and any FOVs selected matches the type of the object, we consider it as a hit. There 91 is a general upward trend where the recall increases with the number of participating devices. This is because as the number of participating devices increase, the overlapping FOVs of multiple bus trajectories increase. Hence, objects that are previously missed by one edge device, now another device can detect it. MAF performs the worse (< 20% recall), regardless of the number of devices. The reason is twofold: placing the most accurate model versions rst, consumes more memory which prevents other model ver- sions to be placed on the same device and the placement might end up at routes with dierent objects. Not surprisingly, the random placement ofR performs somewhere in the middle. The random assignment of model versions suers from the same drawbacks asMAF, however it performs better due to the place- ment of some lightweight model versions on devices which allows to co-place others.LMF, which places the model versions consuming the least memory rst, can t multiple model versions to devices which increases the recall to 68% atjEj = 165.MPF,MPORF,MIF andMPRF require perfect knowledge of object locations. MPF sorts model versions based on the most popular object type, whileMPORF performs slightly better since it adapts the popularity order based on each edge device route. SC does not require perfect knowledge of objects and still outperforms (reaching 90% recall) the other heuristics. SC considers the spatial coverage weight to decide how to place the model versions on the edge devices which improves the placement. Trade-o between Performance and Heuristics: In Figure 5.20b we run each placement heuristic 10 times and plot the average time (y-axis in logarithmic scale) required to place the model versions on a vary- ing number of edge devices. In general the required time grows linearly with the number of edge devices. For theSC heuristic, the high recall eciency comes at a cost of performance time.SC requires to search for the candidate solution at each round that increases the coverage by the largest margin. However,SC requires less than 5 minutes to place models on hundreds of devices ( 2:72 minutes forE = 165). This is acceptable because the model placement runs oine and needs to updated only when the system setup changes (i.e., new devices or models change) or the underlying location context changes. 92 MAF R LMF MPF MPORF MIF MPRF SC Heuristic 0 20 40 60 80 100 120 140 160 Number of missed objects no coverage latency accuracy (a) Placement misses MAF R LMF MPF MPORF MIF MPRF SC Heuristic 0.0 0.2 0.4 0.6 0.8 1.0 Percentage of misses objects no coverage latency accuracy (b) Placement misses % Figure 5.21: Placement misses Placement Heuristics. Placement Object Misses: To better understand the recall in Figure 5.20a, we x the number of partic- ipating devices toE = 165 and categorize the reasons of missed objects in 3 classes: a) no coverage, b) latency and c) accuracy. No coverage means that the object was missed because the model that detects it was not placed on the route covering that object. Latency refers to the missed objects that although the correct model was placed on their route, the co-located models increased the inference latency and hence the FOV was not selected. Accuracy refers to the inherent model version accuracy placed on the device. Despite the model version was placed on the right device and the FOV was selected, the accuracy of the model version did not detect the object. The results are plotted in terms of the number of missed objects in Figure 5.21a and the percentage of each category overall in Figure 5.21b.MAF has 0 misses due to accu- racy as expected, however, the low recall is due to the fact that 80% of the time the model was not placed on the right edge device. The majority of missed objects forR placement are split between no coverage and latency. This is due to the fact thatR, 60% of the cases the model was not placed on the right device and about 35% the latency was high (due to placing a large model on a slow device).LMF by placing the smaller model versions rst, it allows other model versions to be co-placed. For example, the Jetson Nano has enough memory to t 3 model versions of smaller size. Coral Edge TPU and Raspberry PI can t two MobileNetV2, instead of the larger ResNet. The increase of the overall number of model versions placed 93 15 30 45 60 75 90 105 120 135 150 165 Number of devices 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 Avg. Utility MAF R LMF MPF MPORF MIF MPRF SC (a) Utility 15 30 45 60 75 90 105 120 135 150 165 Number of devices 7 8 9 10 11 12 13 Avg. Utilization Cost MAF R LMF MPF MPORF MIF MPRF SC (b) Utilization Cost 15 30 45 60 75 90 105 120 135 150 165 Number of devices 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Avg. Coverage MAF R LMF MPF MPORF MIF MPRF SC (c) Coverage Figure 5.22: Utility, Utilization Cost and Coverage of heuristics. decreases the number of missed objects. Nevertheless,LMF misses are 80% due to the fact that models are not placed on the correct bus routes. Having perfect knowledge about the objects that are more probable or will appear in the route, decreases the number of missed objects. MPORF andMIF show similar results, with lower misses due to no coverage thanMPF, because they prioritize model versions per edge device. SC outperforms the other heuristics by a factor of at least 2x in the number of missed objects. SC, unlikeMPF,MPORF andMIF, does not assume the exact location of objects. The improvement comes from adjusting the coverage of each particular model class m. By continuously maximizing the weighted coverage while considering the existing coverage, it allows models to be placed in routes that can detect the objects. Robustnesstonoisyinput: To evaluate the robustness ofSC, we have added 2D Gaussian noise to each latitude and longitude coordinate of the locations of objects of interestP. For eachp i 2P, we gener- ate a noise vector with a random uniform direction[0;360) and Gaussian-distributed magnitudeN(0; 2 ). In Figure 5.19b we plot the recall (y-axis) while varying the variance of locations of objects forSC heuris- tic. SC recall drops 6% on average when variance=500. The discrete grid cell coverage structures allow theSC to capture the importance of cells more accurately, even in the presence of noise. Utility, Utilization Cost and Coverage: The utility, utilization cost and coverage are plotted in Fig- ures 5.22a, 5.22b and 5.22c, respectively. MPF, MPORF, MIF share the same utility and utilization 94 cost withLMF of 0.72 and 8.5, respectively, because all of them rst identify which model class is suit- able based on the distribution of objects (either overall or on the trajectory) and then place the models with least memory requirement rst. The strategy to place as many smaller model versions as possible helpsLMF,MPF,MIF,MPORF to achieve high coverage, however they require to know the object locations accurately.SC achieves better coverage, with balanced utility and utilization cost without exact prior knowledge of objects’ locations. 95 Chapter6 RelatedWork 6.1 RelatedWorkforEdgeAI This section reviews the literature and explains how our approach moves the state-of-the-art eorts in the EC domain. Balanetal. [19] propose the concept of cyber foraging for resource-constrained mobile platforms. The idea of cyber foraging refers to the ooading of computation and storage-heavy tasks to nearest resource- rich servers. Our approach applies the principles of cyber foraging to resource-constrained mobile and IoT platforms. Flinnetal. [43] contribute Spectra for ooading computation based on the resource availability of the neighboring servers. Spectra deals with the trade-o among performance, energy consumption, and the Quality-of-Service requirements. Unlike Spectra, our work focuses on reducing the bandwidth require- ment when ooading media streams. Pillai et al. [104] introduce Sprout, which provides a framework for gesture-driven gaming applications, which parallelizes the media processing by distributing the computa- tions to nearby servers with a focus on increasing the throughput while minimizing the latency. Sprout concentrates on ooading challenges in a static environment while our work concentrates on dynamic applications with evolving processing and storage requirements. Iidaetal. [62] present GPUrpc, which is a remote procedure call extension for ooading computation to powerful Graphics Processing Units (GPU). GPUrpc is tested on wired networks for data mining and 96 image processing application, whereas our approach is in the context of wireless networks with severe resource constraints. Ra et al. [105] introduce Odessa, a lightweight run-time for mobile devices, which ooads computation-intensive tasks to Internet-connected servers and parallelize the computation tasks in multi-core processing platforms. Odessa continuously monitors the resource availability of the platform, and adaptively ooad computations and parallelize computations to improve accuracy and responsiveness. Saurezetal. [114] contribute a programming infrastructure to eciently o-load computations using APIs while providing support for dynamically adding or removing computation platforms in the fog. ECforIoT: Due to the resource-constrained nature of IoT platforms, some solutions have been proposed to ooad the computation processes to an edge server [34, 113, 111, 2]. An ooading approach is introduced for environmental monitoring application, in which the computations are transformed into a lightweight process and ooaded to an edge server which is close to the data sources [34]. M. Satyanarayanan et al. [113] highlight increasing use of video cameras in smart spaces and contribute GigaSight for searching media data at Internet-scale using cloudlets, which sits between the mobile phones and cloud, for band- width optimization. Renart et al. [111] propose an edge processing framework for the IoT that schedules the tasks based on the location of the source device and the availability of computation resources. Crowdsourcing, Crowd Learning, and Crowd-based Learning: Crowdsourcing [23] is the practice of engag- ing a “crowd” or group of people for performing a task (e.g., translating an article and labeling an image). Due to the ubiquity of smartphones, crowdsourcing has been extended to spatial crowdsourcing [69, 36, 4] which requires workers to be physically present at the location of a task for performing it (e.g., asking a journalist to record an event occurring at a certain place). With the increasing commercial adoption of both crowdsourcing (e.g., Amazon Mechanical Turk [12]) and spatial crowdsourcing (e.g., TaskRabbit [126]) in the industry, researchers started investigating incentive-based learning algorithms to maximize the total reward of workers exploiting the historical activities. This problem is known as crowd learning [101, 87]. In this work, we focus on integrating crowdsourcing, ML, and EC seamlessly. In particular, crowd-based 97 learning refers to the mechanism of evolving a supervised ML algorithm (e.g., a learning model for classi- cation or object detection) by exploiting the power of crowdsourced data (i.e., data collected by a crowd) and the crowd feedback. 6.2 RelatedWorkforKeyframeExtraction Video frame extraction can be classied in three broader categories based on the methods used to extract frames: 1) visual, 2) machine learning, and 3) spatial metadata analysis. Visual-basedVideoFrameExtraction: Dimitrova et al. [39] presented a method to compare the lumi- nance (Y) and chrominance (Cr and Cb) of sequential video frames in order to identify signicant dier- ences and then used image histograms to lter keyframes. Ejazetal. [40] proposed a strategy that initially samples frames with a skip factor. Afterwards it calculates a contributing value of consecutive frames based on color and structural properties, and extract frames with contributing value larger than a thresh- old. ML-basedVideoFrameExtraction: The rst ML approach to extract video frames utilized unsupervised clustering-based techniques. The main idea is to create clusters of visually similar images based on image content (e.g., color histogram) and select a frame from each cluster based on a similarity metric (e.g., frame closer to the cluster mean). De Avila et al. [14] presented VSUMM (Video SUMMarization) mechanism which uses pre-sampling to reduce the number of frames, constructs a 16-bin histogram from the HSV color space to obtain the feature vector and uses k-means algorithm to extract keyframes by selecting the frame closer to the cluster’s centroid. Wu et al. [135] presented AdaFrame, an ML framework that adap- tively selects frames for video recognition. AdaFrame contains a LSTM network augmented with a global memory that provides context information for searching which frames to use over time. SpatialMetadata-basedVideoFrameExtraction: Kimetal. [74] exploits the spatial metadata of frames 98 to automatically generate panoramic images from crowdsourced mobile videos. Each video frame is aug- mented with a eld-of-view [17] to calculate the coverage and overlap ratio. 6.3 RelatedWorkforFLclientselection This section discusses the related work in federated learning and object detection. Object Detection: Recent advancements in DNNs enable the development of high performance deep learning-driven object detection such as Faster R-CNN [110], Single Shot Multibox Detector (SSD) [86], RetinaNet [83], EcientDet [123] and You Only Look Once (YOLO) [109], to name a few. These state-of- the-art detection techniques aim to achieve high accuracy while running at high speed. Huang et al. [59] analyze the trade-o between accuracy and speed. The authors implement Faster R-CNN [110], SSD [86] and R-FCN (Region-based Fully Convolutional Networks [35]) meta-architectures in a unied framework and vary the underlying feature extractor for fair comparison. FederatedLearning(FL): FL was recently proposed by Google [76, 94]. The authors introduceFederated Averaging (FedAvg) algorithm to aggregate the client models to a single global model, by averaging the resulting model updates at the central server periodically. Experiments were performed on MNIST, CIFAR- 10 for classication models and LSTM for language model, which proved the eectiveness of FedAvg. The authors investigated the eect of distributing the training dataset to clients in IID and Non-IID fashion, along with other hyperparameters, such as client fractionC batch sizeB, and local epochE. Since its introduction in [94], FL gained much attention, focusing on solving its statistical and system challenges [119, 139, 98, 22, 133]. In [119], MOCHA, a federated multi-task learning framework is proposed to train separate but related models simultaneously. To improve the convergence of the trained model for Non-IID datasets, a small global training dataset was proposed in [139] and theoretical convergence analysis for Non-IID datasets is established in [82]. 99 FedCS [98] proposes a client selection protocol to improve the training procedure in the presence of stragglers (devices with limited computational resources). Initially, the protocol requires the clients to share their resource capacities and then the centralized server estimated the time required to distribute, update and upload the trained model. The problem is formalized as an optimization problem, reduced to knapsack problem, with the number of selected clients as the value to maximize and the time to complete the federated round as the constraint. A greedy algorithm is proposed to solve it approximately. Agnostic Federated Learning (AFL) [96] introduces a minimax optimization scheme to optimize the performance of the single worst device. q-Fair Federated Learning [81] extends [96] and aims to improve the accuracy distribution uniformly across devices by minimizing an aggregated reweighted loss and ad- justing fairness parameterq. Open problems in FL and resource-contrained IoT devices in FL are surveyed in [67] and [63], respectively. FLforObjectDetection: Despite the popularity of FL, research in applying federated learning to resolve object detection problems is somewhat limited [88, 90, 136]. A federated object detection platform FedVi- sion is introduced in [88]. The authors present FedVision from a system’s perspective and discuss its main components: 1) crowdsourced image annotation, 2) federated model training, and 3) federated model up- date. It allows clients to locally annotate their crowdsourced data and adopts YOLOv3 [109] as the detector to detect hazards (re, smoke and disaster) at deployed CCTV cameras. Authors from [88] also created a real-world image dataset [90] for federated object detection. It consists of 900 annotated images containing 7 object labels generated from 26 street cameras. Experiments on YOLOv3 and Faster-RCNN compare the accuracy and communication eciency of the two models. A method to optimize federated object detec- tion learning is introduced in [136]. The authors quantify the eect of Non-IID data using Kullback-Leibler divergence (KLD) and propose theAbnormalWeightsSuppression to trim the weights that are far way from the nearest clients to reduce divergence. 100 The closest studies to our work are in the area of client selection. Initial work on client selection [94] focused on simulating the dynamic nature of federated learning where some clients may not be available to participate. For example, some clients can be oine due to battery level, network connectivity, etc. Hence, clients were selected uniformly at random (with replacement). This unbiased sampling scheme, used by recent studies [82, 132] share the same convergence properties with centralized SGD methods, where in expectation, it equals to the full client participation. In this work, we introduce biased client selection methods. Although our methods do not guarantee full client participation, our experiments show faster convergence compared to the random selection strategy. In [98], client selection was used to tackle the presence ofstragglers. Stragglers areonline clients who, due to limited capabilities or restrictions (e.g., low processing power, low memory, limited bandwidth, low battery), are unable to complete the training round on time. Hence, in synchronous FL procedure training is delayed until the server receives response from all selected clients. The authors in [98] select as many clients as possible which are able to train and upload the locally trained model within a predened time schedule. Other studies [82, 68] consider selecting clients from the distributionp k , the fraction of dataset at each client, independently and with replacement. Our work focuses on selecting clients based on the label distribution of the training images and across geographical areas where the data were collected. The client selection approaches that are based on the device’s capabilities (such as those proposed in [98]) are orthogonal to our work and can be combined with our selection algorithms. 6.4 RelatedWorkforModelPlacement Prior research on mobile edge computing focused on ooading the processing cost of highly computation- ally demanding applications to edge nodes or cloud servers (e.g., [18]) due to the inability of the devices to run complex tasks. However, the recent hardware developments of edge devices, which are equipped with powerful CPU and GPU, along with the fast growing research on edge-optimized Deep Neural Networks, 101 have made it possible to execute computationally demanding models on the edge (e.g., [46]). Organizations are increasingly interested in combining ML with EC to reduce network trac and improve the response time of their applications. Sending camera frames to the cloud for inference introduces additional net- work/queuing delays and is unsuitable for applications requiring strict low latencies. Additionally, send- ing raw data, which may include sensitive information such as license plates, faces, speech, to the cloud is prone to privacy leaks as users may be wary how these data are being used be the application. As a re- sult, new solutions are continuously emerging by the AI research community and industry that eciently integrate ML with EC. 6.4.1 DeepLearningwithEdgeComputing Various architectures have been proposed for performing DNN inference on edge devices. They can be categorized in three broader designs depending on where the computation is performed, as shown in Figure 6.1: a) at the edge nodes close to the edge devices, b) computation is divided between edge devices, edge nodes and cloud servers and c) at the edge device. Other variations of these architectures are reported in [27]. 6.4.1.1 EdgeNodeInference Large and complex DNNs require a lot of computing resources for real-time inference which may not be available on the edge devices. Hence, instead of computing the inference at the edge, the data are ooaded to a nearby edge node which computes and returns the result to the edge device as shown in Figure 6.1a. The edge node, as opposed to a cloud server, is close to the edge device to reduce the communication latency and answer the inference queries quickly. In this setting, studies focus on two aspects: a) data preprocessing at the edge device and b) resource management at the edge node. With data preprocessing, the raw data are processed at the edge device to 102 reduce their size and subsequently reduce the communication cost of uploading them to the edge node. For instance, images of low quality are dropped, cropped to include objects of interest and uploaded to identify types of food [85]. To support real-time object detection on mobile Glimpse [28] selectively ooads video frames to an edge node by tracking the camera frames and ltering them based on a change detection lter. Edge nodes receive inference requests from many edge devices. Hence, to ensure stable performance, their resources need to be eciently managed and shared. There are various works in this domain, which examine the tradeos between latency and accuracy to choose the best DNN conguration [137], consider computation across hierarchical edge and cloud nodes [60] and share common layers with transfer learning to reduce the overall computational resources for each request [65]. 6.4.1.2 Cross-DeviceInference In the previous section, the whole DNN computation is ooaded and performed on the edge node. How- ever, as edge devices are becoming more powerful they are capable to selectively execute DNN inference locally as shown in Figure 6.1b. The studies focus on methods to intelligently ooad the computation from edge devices to edge nodes and cloud servers are classied depending on the ooad method: a) a binary decision is made whether to ooad the whole DNN computation or not, b) ooad a part of the computation, c) hierarchical ooading and d) computation is distributed across multiple peer devices. Several studies that follow a binary decision whether to ooad the computation of DNN to the edge nodes rely on factors such as power consumption, network bandwidth, latency and cost [53, 107, 106]. In these studies, an optimization-based approach uses the empirical measurements of the accuracy, latency and power consumption to intelligently decide whether to ooad or not. In a typical case of binary of- oading, edge nodes are equipped with a powerful version of a model, while on the edge device a simpler model is used. 103 Partial ooading partitions the model at dierent layers of the DNN. For example, the base layers of the model are computed at the edge device and the top layers are computed on an edge node. Partial ooading can increase the bandwidth savings because raw data, like images or video frames, are larger in size than intermediate results calculated by the base layers [140]. Some studies shift the ooading process from edge nodes to cloud servers [79] or use a hierarchy of devices [127]. Despite increasing the communication latency, the powerful cloud servers can process the inference requests much faster, hence reducing the overall processing time and potentially improve the processing rate of requests reaching the edge nodes. Other studies follow a distributed computing approach which ooad the DNN computation from an edge device to other peer edge devices according to their processing and memory constraints [140, 93]. 6.4.1.3 EdgeDeviceInference To reduce the communication cost and to support real-time applications, researchers focus on optimizing the models’ architecture and the hardware of resource-constrained edge devices, to enable fast on-device inference as shown in Figure 6.1c. Various DNN models have been designed aiming to reduce the number of model parameters and hence reducing the required memory required to load and feed-forward the model. MobileNet V1 [57] and their subsequent improvements [112, 58], EcientNet [124], SqueezeNet [61], Single Shot Detector (SSD) [86] and YOLO [109] are designed to execute inference computation fast and with high accuracy. These open- source models and their pre-trained weights are available for download in many popular platforms such as Tensorow [1]. The models can be further reduced in size and complexity while preserving similar accuracy with model compression techniques such as parameter quantization, pruning and knowledge distillation. With quantization, oating-point model parameters are converted to types with reduced precision, such as 16 104 bit oats or 8 bit integers, reducing the model size up to 75% and speed-up2x. With model pruning, less important parameters which have low impact on the accuracy are removed from the model, reducing the model complexity further [54]. With knowledge distillation a cumbersome DNN model is used to "teach" a more compact model [56]. Typically, the logits of the larger network (the inputs to the nal softmax layer) are used to train the smaller network to approximate its learned function. The emergence of devices such as NVIDIA graphics processing units (GPUs) and Google tensor pro- cessing units (TPUs) along with improvements on specialized AI chips embedded on edge devices have sped up the inference latency. Customized AI accelerator application-specic integrated circuits (ASICs) are now integrated in mobile and IoT devices, such as Google’s Pixel Neural Core which instantiates the Edge TPU architecture, Apple’s Neural Engine and Intel’s Movidius Vision Processing Units (VPUs). Addi- tionally, vendors oer products such as the Nvidia Jetson, Intel Neural Compute Stick 2 and AWS Deeplens that brings AI computing at the edge out-of-the-box. This dedicated neural network hardware are able to compute hundreds of billion operations per second while being more energy-ecient than using either the mobile CPU or GPU. IoT&SensorPlacement: Optimal sensor placement is a well studied eld. Closer to our work is sensor placement research which considers eective coverage [38, 37]. However, this problem is dierent from our work. In our problem, the edge devices are either already placed or being operated in some predened trajectories. We are investigating how to place the DNN models on them considering their resources, model complexity, and geospatial coverage. Data and Services Placement: Many research eorts in the context of Fog computing focus on reduc- ing latency of IoT applications by either optimizing the placement of IoT services [92, 117, 125] or their data [97], considering the limited resources of the devices. The main focus of these works is to reduce latency between the nodes and devices to promote the Quality of Service (QoS), whereas our focus is to be able to detect more objects at the edge device given their movement in a geographical region. 105 Data Edge Device Inference Result Edge Node (a) Inference on Edge Node, close to edge device. Inference Result Cloud Server Intermediate Result Edge Device Inference Result Edge Node Intermediate Result (b) Inference across devices. Edge Device (c) Inference on edge device. Figure 6.1: DNN inference architectures with Edge Computing. 106 DNNPlacement: The closest study related to the DNN placement is by Bensalemetal. [21] which consid- ers utilization cost and latency for performing inference at edge nodes close to devices. Our work considers utilization cost, incorporates latency as part of the utility function and in addition models the location and coverage during placement. 107 Chapter7 ConclusionandFutureWork 7.1 Summary In this thesis, we investigated how the exploitation of geospatial metadata in multimedia content improves visual learning frameworks: • First, we presented a crowd-based learning framework, which combines the principles of EC with ML to enable heterogeneous edge devices to participate in and contribute to smart city applications. The proposed framework dispatches the ML model to the edge device based on its resource capacity. Furthermore, the proposed intelligent retraining algorithm enables the edge device to share new information back to the edge server to enhance the model by assessing the novelty of the data, by assessing the uniqueness of the geospatial and temporal information at the edge device. By reducing the model complexity, the inference accuracy and performance varies, which shows that the models can be tuned to support wide classes of edge devices. • Second, models deployed at edge devices require new methods to select frames for inference. We presented a variety of approaches to extract frames from urban mobile videos to detect objects. TheGreedy-SKE highlights the importance of using spatial metadata and optimizes the frame selection based on the coverage area of the video. Experiments show that it outperforms other 108 approaches by achieving a higher hit-ratio with less number of selected keyframes in a short amount of time. • Third, we proposed lightweight client selection methods to learn object detection models accurately and eectively. Our fair client selection method based on the object data distribution of clients achieve signicant reduction in required federated rounds compared to conventional approach. We further extended this method by leveraging the metadata of the training images to select clients which maximize the coverage of diverse geographical regions. We have conducted extensive ex- periments with both real and synthetic datasets which show signicant improvements compared to other baselines. • Fourth, we investigate ecient placement techniques of models to edge devices. We mathemati- cally formulate the model placement as an optimization problem, which is proved to be NP-Hard and thereafter, propose several heuristics to solve it eciently and evaluate them with a real-world dataset of 165 bus routes’ trajectories obtained from the City of San Francisco. OurSC heuristic, which prioritizes achieving maximum geospatial coverage, is superior compared to all other ap- proaches, with higher recall, 2x less missed objects and is robust against noisy cell weights. 7.2 FutureWork Crowd-basedLearningFramework: Test the proposed crowd-based learning framework in real-world application scenarios in the City of Los Angeles, while maintaining and expanding the model repository with support for more classes of edge devices, and integrating this solution in our visionary framework (Translational Visual Data Platform, TVDP [70]) for smart cities. ModelPlacement: DNN multi-model placement on edge devices for eective inference is relatively new research direction. Regarding our proposedSC heuristic, it is desirable to prove theoretical bounds and 109 improve its performance to allow dynamic updates to be handled eciently when new devices are added or removed. Additionally, auxiliary data, such as trac data, will allow more accurate sampling of mobile edge devices’ FOVs. The problem of placing DNN models on mobile edge devices with unknown trajecto- ries (i.e., adhoc movement), such as edge devices on cars, smartphones and drones is still open. Moreover, the existing approach for modeling the interference of co-placement assumes linear inference latency cost. Estimating how to eectively share resources between DNN models is an open issue. 110 Bibliography [1] M. Abadi, P. Barham, J. Chen, et al. “Tensorow: a system for large-scale machine learning.” In: OSDI. Vol. 16. 2016, pp. 265–283. [2] S. G. Ahmad, C. S. Liew, M. M. Raque, E. U. Munir, and S. U. Khan. “Data-Intensive Workow Optimization Based on Application Task Graph Partitioning in Heterogeneous Computing Systems”. In: BdCloud. 2014, pp. 129–136. [3] A. Alfarrarjeh, S. Agrawal, S. H. Kim, and C. Shahabi. “Geo-Spatial Multimedia Sentiment Analysis in Disasters”. In: DSAA. Mountain View, California, USA: IEEE, 2017, pp. 193–202. [4] A. Alfarrarjeh, T. Emrich, and C. Shahabi. “Scalable spatial crowdsourcing: A study of distributed algorithms”. In: MDM. IEEE. 2015, pp. 134–144. [5] A. Alfarrarjeh, S. H. Kim, S. Agrawal, M. Ashok, S. Y. Kim, and C. Shahabi. “Image Classication to Determine the Level of Street Cleanliness: A Case Study”. In: 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM). Xi’an, China: IEEE, Sept. 2018, pp. 1–5. [6] A. Alfarrarjeh, S. H. Kim, A. Deshmukh, S. Rajan, Y. Lu, and C. Shahabi. “Spatial Coverage Measurement of Geo-Tagged Visual Data: A Database Approach”. In: BigMM. Xi’an, China: IEEE, 2018, pp. 1–8. [7] A. Alfarrarjeh, S. H. Kim, S. Rajan, A. Deshmukh, and C. Shahabi. “A Data-Centric Approach for Image Scene Localization”. In: Big Data. Seattle, WA, USA: IEEE, 2018. [8] A. Alfarrarjeh, C. Shahabi, and S. H. Kim. “Hybrid Indexes for Spatial-Visual Search”. In: Thematic Workshops of ACM MM 2017. ACM. 2017, pp. 75–83. [9] A. Alfarrarjeh, D. Trivedi, S. H. Kim, H. Park, C. Huang, and C. Shahabi. “Recognizing Material of a Covered Object: A Case Study With Grati”. In: 2019 IEEE International Conference on Image Processing (ICIP). Taipei, Taiwan: IEEE, Sept. 2019, pp. 2491–2495. [10] A. Alfarrarjeh, D. Trivedi, S. H. Kim, and C. Shahabi. “A Deep Learning Approach for Road Damage Detection from Smartphone Images”. In: 2018 IEEE International Conference on Big Data (Big Data). Seattle, WA, USA: IEEE, Dec. 2018, pp. 5201–5204. 111 [11] Abdullah Alfarrarjeh, Dweep Trivedi, Seon Ho Kim, Hyunjun Park, Chao Huang, and Cyrus Shahabi. “Recognizing Material of a Covered Object: A Case Study With Grati”. In: 2019 IEEE International Conference on Image Processing (ICIP). 2019, pp. 2491–2495.doi: 10.1109/ICIP.2019.8803286. [12] Amazon Mechanical Turk. [Last Accessed: Dec. 20, 2018].url: http://www.mturk.com/. [13] Deeksha Arya, Hiroya Maeda, Sanjay Kumar Ghosh, Durga Toshniwal, Hiroshi Omata, Takehiro Kashiyama, and Yoshihide Sekimoto. “Global Road Damage Detection: State-of-the-art Solutions”. In:2020IEEEInternationalConferenceonBigData(BigData). 2020, pp. 5533–5539.doi: 10.1109/BigData50022.2020.9377790. [14] Sandra Eliza Fontes de Avila, Ana Paula Brandão Lopes, Antonio da Luz, and Arnaldo de Albuquerque Araújo. “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method”. In: Pattern Recognition Letters 32.1 (2011). Image Processing, Computer Vision and Pattern Recognition in Latin America, pp. 56–68.issn: 0167-8655. [15] AWS. AWS DeepLens. 2021.url: https://aws.amazon.com/deeplens/. [16] AWS. AWS Trainium. 2021.url: https://aws.amazon.com/machine-learning/trainium/. [17] Sakire Arslan Ay, Roger Zimmermann, and Seon Ho Kim. “Viewable Scene Modeling for Geospatial Video Search”. In: Proceedings of the 16th ACM International Conference on Multimedia. MM ’08. Vancouver, British Columbia, Canada: ACM, 2008, pp. 309–318.isbn: 978-1-60558-303-7. [18] Tayebeh Bahreini and Daniel Grosu. “Ecient Placement of Multi-Component Applications in Edge Computing Systems”. In: Proceedings of the Second ACM/IEEE Symposium on Edge Computing. SEC ’17. San Jose, California: Association for Computing Machinery, 2017.isbn: 9781450350877.doi: 10.1145/3132211.3134454. [19] R. Balan, J. Flinn, M. Satyanarayanan, S. Sinnamohideen, and H. Yang. “The Case for Cyber Foraging”. In: ACM SIGOPS EW. Saint-Emilion, France: ACM, 2002, pp. 87–92. [20] Chandni Balchandani, Rakshith Koravadi Hatwar, Parteek Makkar, Yanki Shah, Pooja Yelure, and Magdalini Eirinaki. “A Deep Learning Framework for Smart Street Cleaning”. In: 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService). 2017, pp. 112–117.doi: 10.1109/BigDataService.2017.49. [21] Mounir Bensalem, Jasenka Dizdarević, and Admela Jukan. “DNN Placement and Inference in Edge Computing”. In: 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO). 2020, pp. 479–484.doi: 10.23919/MIPRO48935.2020.9245108. [22] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečn` y, Stefano Mazzocchi, H Brendan McMahan, et al. “Towards federated learning at scale: System design”. In: arXiv preprint arXiv:1902.01046 (2019). [23] D. C Brabham. “Crowdsourcing as a model for problem solving: An introduction and cases”. In: Convergence 14.1 (2008), pp. 75–90. 112 [24] Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2. 2020. arXiv: 2001.10773 [cs.CV]. [25] Y. Cai, Y. Lu, S. H. Kim, L. Nocera, and C. Shahabi. “Gift: A geospatial image and video ltering tool for computer vision applications with geo-tagged mobile videos”. In: ICMEW. IEEE. 2015, pp. 1–6. [26] Marcel Caria, Tamal Das, Admela Jukan, and Marco Homann. “Divide and conquer: Partitioning OSPF networks with SDN”. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM). 2015, pp. 467–474.doi: 10.1109/INM.2015.7140324. [27] J. Chen and X. Ran. “Deep Learning With Edge Computing: A Review”. In: Proceedings of the IEEE 107.8 (2019), pp. 1655–1674.doi: 10.1109/JPROC.2019.2921977. [28] Tiany Yu-Han Chen, Hari Balakrishnan, Lenin Ravindranath, and Paramvir Bahl. “GLIMPSE: Continuous, Real-Time Object Recognition on Mobile Devices”. In: GetMobile: Mobile Comp. and Comm. 20.1 (July 2016), pp. 26–29.issn: 2375-0529.doi: 10.1145/2972413.2972423. [29] Reuven Cohen and Liran Katzir. “The Generalized Maximum Coverage Problem”. In: Information Processing Letters 108.1 (2008), pp. 15–22.issn: 0020-0190.doi: https://doi.org/10.1016/j.ipl.2008.03.017. [30] G. Constantinou, G. Sankar Ramachandran, A. Alfarrarjeh, S. H. Kim, B. Krishnamachari, and C. Shahabi. “A Crowd-Based Image Learning Framework using Edge Computing for Smart City Applications”. In: 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). Singapore: IEEE, Sept. 2019, pp. 11–20. [31] G. Constantinou, C. Shahabi, and S. H. Kim. “Spatial Keyframe Extraction Of Mobile Videos For Ecient Object Detection At The Edge”. In: 2020 IEEE International Conference on Image Processing (ICIP). 2020, pp. 1466–1470.doi: 10.1109/ICIP40778.2020.9190786. [32] George Constantinou, Cyrus Shahabi, and Seon Ho Kim. “Spatial Keyframe Extraction Of Mobile Videos For Ecient Object Detection At The Edge”. In: 2020 IEEE International Conference on Image Processing (ICIP). 2020, pp. 1466–1470.doi: 10.1109/ICIP40778.2020.9190786. [33] Coral. Coral Edge TPU. 2021.url: https://coral.ai/. [34] V. Cozzolino, A. Y. Ding, J. Ott, and D. Kutscher. “Enabling Fine-Grained Edge Ooading for IoT”. In: SIGCOMM Posters and Demos. Los Angeles, CA, USA: ACM, 2017, pp. 124–126. [35] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. “R-fcn: Object detection via region-based fully convolutional networks”. In:Advancesinneuralinformationprocessingsystems. 2016, pp. 379–387. [36] D. Deng, C. Shahabi, and L. Zhu. “Task matching and scheduling for multiple workers in spatial crowdsourcing”. In: SIGSPATIAL GIS. ACM. 2015, p. 21. [37] S.S. Dhillon and K. Chakrabarty. “Sensor placement for eective coverage and surveillance in distributed sensor networks”. In: 2003 IEEE Wireless Communications and Networking, 2003. WCNC 2003. Vol. 3. 2003, 1609–1614 vol.3.doi: 10.1109/WCNC.2003.1200627. 113 [38] S.S. Dhillon, K. Chakrabarty, and S.S. Iyengar. “Sensor placement for grid coverage under imprecise detections”. In: Proceedings of the Fifth International Conference on Information Fusion. FUSION 2002. (IEEE Cat.No.02EX5997). Vol. 2. 2002, 1581–1587 vol.2.doi: 10.1109/ICIF.2002.1021005. [39] Nevenka Dimitrova, Thomas McGee, and Herman Elenbaas. “Video Keyframe Extraction and Filtering: A Keyframe is Not a Keyframe to Everyone”. In: Proceedings of the Sixth International Conference on Information and Knowledge Management. CIKM ’97. Las Vegas, Nevada, USA: ACM, 1997, pp. 113–120.isbn: 0-89791-970-X. [40] Naveed Ejaz, Tayyab Bin Tariq, and Sung Wook Baik. “Adaptive Key Frame Extraction for Video Summarization Using an Aggregation Mechanism”. In: J. Vis. Comun. Image Represent. 23.7 (Oct. 2012), pp. 1031–1040.issn: 1047-3203. [41] Esri. LA Clean Streets. 2018.url: https://www.esri.com/en-us/maps-we-love/gallery/la-clean-streets. [42] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. “The Pascal Visual Object Classes (VOC) Challenge”. In: International Journal of Computer Vision 88.2 (June 2010), pp. 303–338. [43] J. Flinn, S. Park, and M. Satyanarayanan. “Balancing performance, energy, and quality in pervasive computing”. In: ICDCS. IEEE, 2002, pp. 217–226. [44] F. D. Fomeni and A. Letchford. “A Dynamic Programming Heuristic for the Quadratic Knapsack Problem”. In: INFORMS J. Comput. 26 (2014), pp. 173–182. [45] B. Fortz and M. Thorup. “Internet trac engineering by optimizing OSPF weights”. In: Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064). Vol. 2. 2000, 519–528 vol.2.doi: 10.1109/INFCOM.2000.832225. [46] A. C. Franco da Silva, P. Hirmer, and B. Mitschang. “Model-Based Operator Placement for Data Processing in IoT Environments”. In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP). 2019, pp. 439–443.doi: 10.1109/SMARTCOMP.2019.00084. [47] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. “Virtual worlds as proxy for multi-object tracking analysis”. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016, pp. 4340–4349. [48] Giorgio Gallo, Peter L Hammer, and Bruno Simeone. “Quadratic knapsack problems”. In: Combinatorial optimization. Springer, 1980, pp. 132–149. [49] Gartner. Gartner Identies Top 10 Strategic IoT Technologies and Trends. 2018.url: https://www.gartner.com/en/newsroom/press-releases/2018-11-07-gartner-identifies-top-10- strategic-iot-technologies-and-trends. 114 [50] Gartner. Gartner Says 8.4 Billion Connected "Things" Will Be in Use in 2017, Up 31 Percent From 2016. 2017.url: https://www.gartner.com/en/newsroom/press-releases/2017-02-07-gartner-says- 8-billion-connected-things-will-be-in-use-in-2017-up-31-percent-from-2016. [51] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. “Vision meets Robotics: The KITTI Dataset”. In: International Journal of Robotics Research (IJRR) (2013). [52] G. Grin, A. Holub, and P. Perona. Caltech-256 Object Category Dataset. Tech. rep. 7694. California Institute of Technology, 2007.url: http://authors.library.caltech.edu/7694. [53] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy. “MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints”. In: Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. MobiSys ’16. Singapore, Singapore: Association for Computing Machinery, 2016, pp. 123–136.isbn: 9781450342698.doi: 10.1145/2906388.2906396. [54] Song Han, Huizi Mao, and William J. Dally. “Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Human Coding”. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2016.url: http://arxiv.org/abs/1510.00149. [55] Vinuta Hegde, Dweep Trivedi, Abdullah Alfarrarjeh, Aditi Deepak, Seon Ho Kim, and Cyrus Shahabi. “Yet Another Deep Learning Approach for Road Damage Detection using Ensemble Learning”. In: 2020 IEEE International Conference on Big Data (Big Data). 2020, pp. 5553–5558.doi: 10.1109/BigData50022.2020.9377833. [56] Georey Hinton, Oriol Vinyals, and Jerey Dean. “Distilling the Knowledge in a Neural Network”. In: NIPS Deep Learning and Representation Learning Workshop. 2015.url: http://arxiv.org/abs/1503.02531. [57] A. G Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. “MobileNets: Ecient Convolutional Neural Networks for Mobile Vision Applications”. In: CoRR abs/1704.04861 (2017). [58] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. “Searching for MobileNetV3”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Oct. 2019. [59] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. “Speed/Accuracy Trade-Os for Modern Convolutional Object Detectors”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp. 3296–3297. [60] C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu, P. Bahl, and M. Philipose. “VideoEdge: Processing Camera Streams using Hierarchical Clusters”. In: 2018 IEEE/ACM Symposium on Edge Computing (SEC). 2018, pp. 115–131.doi: 10.1109/SEC.2018.00016. 115 [61] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size”. In: CoRR abs/1602.07360 (2016). arXiv: 1602.07360.url: http://arxiv.org/abs/1602.07360. [62] Y. Iida, Y. Fujii, T. Azumi, N. Nishio, and S. Kato. “GPUrpc: Exploring Transparent Access to Remote GPUs”. In: TECS 16.1 (2016), 17:1–17:25. [63] Ahmed Imteaj, Urmish Thakker, Shiqiang Wang, Jian Li, and M. Hadi Amini. Federated Learning for Resource-Constrained IoT Devices: Panoramas and State-of-the-art. 2020. arXiv: 2002.10610 [cs.LG]. [64] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. “Cae: Convolutional architecture for fast feature embedding”. In: ACM MM. ACM. 2014, pp. 675–678. [65] Angela H. Jiang, Daniel L.-K. Wong, Christopher Canel, Lilia Tang, Ishan Misra, Michael Kaminsky, Michael A. Kozuch, Padmanabhan Pillai, David G. Andersen, and Gregory R. Ganger. “Mainstream: Dynamic Stem-Sharing for Multi-Tenant Video Processing”. In: 2018 USENIX Annual Technical Conference (USENIX ATC 18). Boston, MA: USENIX Association, July 2018, pp. 29–42.isbn: 978-1-931971-44-7.url: https://www.usenix.org/conference/atc18/presentation/jiang. [66] Zhentian Jiao, Youmin Zhang, Jing Xin, Lingxia Mu, Yingmin Yi, Han Liu, and Ding Liu. “A Deep Learning Based Forest Fire Detection Approach Using UAV and YOLOv3”. In: 2019 1st International Conference on Industrial Articial Intelligence (IAI). 2019, pp. 1–5.doi: 10.1109/ICIAI.2019.8850815. [67] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, K. A. Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G.L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. “Advancements and Open Problems in Federated Learning”. In: 2019.url: https://arxiv.org/abs/1912.04977. [68] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. “Scaold: Stochastic controlled averaging for federated learning”. In: International Conference on Machine Learning. PMLR. 2020, pp. 5132–5143. [69] L. Kazemi and C. Shahabi. “Geocrowd: enabling query answering with spatial crowdsourcing”. In: SIGSPATIAL GIS. ACM. 2012, pp. 189–198. 116 [70] S. H. Kim, A. Alfarrarjeh, G. Constantinou, and C. Shahabi. “TVDP: Translational Visual Data Platform for Smart Cities”. In: ICDEW. IEEE. 2019, pp. 45–52. [71] S. H. Kim, Y. Lu, G. Constantinou, C. Shahabi, G. Wang, and R. Zimmermann. “MediaQ: Mobile Multimedia Management System”. In: MMSys. ACM, 2014, pp. 224–235. [72] S. H. Kim, J. Shi, A. Alfarrarjeh, D. Xu, Y. Tan, and C. Shahabi. “Real-time trac video analysis using intel viewmont coprocessor”. In: DNIS. Springer. 2013, pp. 150–160. [73] Seon Ho Kim, Ying Lu, Giorgos Constantinou, Cyrus Shahabi, Guanfeng Wang, and Roger Zimmermann. “MediaQ: Mobile Multimedia Management System”. In: Proceedings of the 5th ACM Multimedia Systems Conference. MMSys ’14. Singapore, Singapore: Association for Computing Machinery, 2014, pp. 224–235.isbn: 9781450327053.doi: 10.1145/2557642.2578223. [74] Seon Ho Kim, Ying Lu, Junyuan Shi, Abdullah Alfarrarjeh, Cyrus Shahabi, Guanfeng Wang, and Roger Zimmermann. “Key Frame Selection Algorithms for Automatic Generation of Panoramic Images from Crowdsourced Geo-tagged Videos”. In: Web and Wireless Geographical Information Systems. Ed. by Dieter Pfoser and Ki-Joune Li. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014, pp. 67–84.isbn: 978-3-642-55334-9. [75] M. K. Kocamaz, J. Gong, and B. R. Pires. “Vision-based counting of pedestrians and cyclists”. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, NY, USA: IEEE, Mar. 2016, pp. 1–8. [76] Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. “Federated Learning: Strategies for Improving Communication Eciency”. In: NIPS Workshop on Private Multi-Party Machine Learning. 2016.url: https://arxiv.org/abs/1610.05492. [77] B. Krishnamachari, J. Power, S. H. Kim, and C. Shahabi. “I3: An IoT marketplace for smart communities”. In: MobiSys. ACM. 2018, pp. 498–499. [78] LASAN. Progress Report. 2018.url: %5Curl%7Bhttp://clkrep.lacity.org/onlinedocs/2017/17-0878-S1_misc_02-02-2018.pdf%7D. [79] H. Li, K. Ota, and M. Dong. “Learning IoT in Edge: Deep Learning for the Internet of Things with Edge Computing”. In: IEEE Network 32.1 (2018), pp. 96–101.doi: 10.1109/MNET.2018.1700202. [80] Pu Li and Wangda Zhao. “Image re detection algorithms based on convolutional neural networks”. In: Case Studies in Thermal Engineering 19 (2020), p. 100625.issn: 2214-157X.doi: https://doi.org/10.1016/j.csite.2020.100625. [81] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. “Fair Resource Allocation in Federated Learning”. In: International Conference on Learning Representations. 2020.url: https://openreview.net/forum?id=ByexElSYDr. [82] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the Convergence of FedAvg on Non-IID Data. 2019. arXiv: 1907.02189[stat.ML]. 117 [83] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. “Focal Loss for Dense Object Detection”. In: 2017 IEEE International Conference on Computer Vision (ICCV). 2017, pp. 2999–3007. [84] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. “Microsoft COCO: Common Objects in Context”. In: Computer Vision – ECCV 2014. Ed. by David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars. Cham: Springer International Publishing, 2014, pp. 740–755.isbn: 978-3-319-10602-1. [85] C. Liu, Y. Cao, Y. Luo, G. Chen, V. Vokkarane, M. Yunsheng, S. Chen, and P. Hou. “A New Deep Learning-Based Food Recognition System for Dietary Assessment on An Edge Computing Service Infrastructure”. In: IEEE Transactions on Services Computing 11.2 (2018), pp. 249–261.doi: 10.1109/TSC.2017.2662008. [86] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. “SSD: Single Shot MultiBox Detector”. In: Computer Vision – ECCV 2016. Ed. by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling. Cham: Springer International Publishing, 2016, pp. 21–37.isbn: 978-3-319-46448-0. [87] Y. Liu and M. Liu. “Crowd learning: improving online decision making using crowdsourced data”. In: IJCAI. AAAI Press. 2017, pp. 317–323. [88] Yang Liu, Anbu Huang, Yun Luo, He Huang, Youzhi Liu, Yuanyuan Chen, Lican Feng, Tianjian Chen, Han Yu, and Qiang Yang. “FedVision: An Online Visual Object Detection Platform Powered by Federated Learning.” In: AAAI. 2020, pp. 13172–13179. [89] Y. Lu, C. Shahabi, and S. H. Kim. “Ecient indexing and retrieval of large-scale geo-tagged video databases”. In: GeoInformatica 20.4 (2016), pp. 829–857. [90] Jiahuan Luo, Xueyang Wu, Yun Luo, Anbu Huang, Yunfeng Huang, Yang Liu, and Qiang Yang. Real-World Image Datasets for Federated Learning. 2019. arXiv: 1910.11089[cs.CV]. [91] Hiroya Maeda, Yoshihide Sekimoto, Toshikazu Seto, Takehiro Kashiyama, and Hiroshi Omata. “Road Damage Detection and Classication Using Deep Neural Networks with Smartphone Images”. In: Computer-Aided Civil and Infrastructure Engineering 33.12 (2018), pp. 1127–1141. [92] Adyson M. Maia, Yacine Ghamri-Doudane, Dario Vieira, and Miguel F. de Castro. “Optimized Placement of Scalable IoT Services in Edge Computing”. In: 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). 2019, pp. 189–197. [93] J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen. “MoDNN: Local distributed mobile computing system for Deep Neural Network”. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2017. 2017, pp. 1396–1401.doi: 10.23919/DATE.2017.7927211. [94] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. “Communication-ecient learning of deep networks from decentralized data”. In: Articial Intelligence and Statistics. 2017, pp. 1273–1282. 118 [95] C. Mertz, S. Varadharajan, S. Jose, K. Sharma, L. Wander, and J. Wang. “City-wide road distress monitoring with smartphones”. In: ITS World Congress. 2014, pp. 1–9. [96] M. Mohri, Gary Sivek, and A. T. Suresh. “Agnostic Federated Learning”. In: ICML. 2019. [97] Mohammed Islam Naas, Philippe Raipin Parvedy, Jalil Boukhobza, and Laurent Lemarchand. “iFogStor: An IoT Data Placement Strategy for Fog Infrastructure”. In: 2017 IEEE 1st International Conference on Fog and Edge Computing (ICFEC). 2017, pp. 97–104.doi: 10.1109/ICFEC.2017.15. [98] T. Nishio and R. Yonetani. “Client Selection for Federated Learning with Heterogeneous Resources in Mobile Edge”. In: ICC 2019 - 2019 IEEE International Conference on Communications (ICC). 2019, pp. 1–7. [99] NVIDIA. Jetson Nano. 2021.url: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/. [100] Samuel S. Ogden and Tian Guo. “MDINFERENCE: Balancing Inference Accuracy and Latency for Mobile Applications”. In: 2020 IEEE International Conference on Cloud Engineering (IC2E). 2020, pp. 28–39.doi: 10.1109/IC2E48712.2020.00010. [101] N. Padhariya and K. Raichura. “Crowdlearning: An incentive-based learning platform for crowd”. In: IC3. IEEE. 2014, pp. 44–49. [102] Cheng-Sheng Pan, Shi-Long Sui, Chun-ling Liu, and Yu-Xin Shi. “Proportional fair scheduling algorithm based on trac in satellite communication system”. In: Fourth Seminar on Novel Optoelectronic Detection Technology and Application. Ed. by Weiqi Jin and Ye Li. Vol. 10697. International Society for Optics and Photonics. SPIE, 2018, pp. 1330–1336.doi: 10.1117/12.2307297. [103] A. Parra, M. Boutin, and E. J. Delp. “Automatic gang grati recognition and interpretation”. In: Journal of Electronic Imaging 26.5 (2017), p. 051409. [104] P. S. Pillai, L. B. Mummert, S. W. Schlosser, R. Sukthankar, and C. J. Helfrich. “SLIPstream: Scalable Low-latency Interactive Perception on Streaming Data”. In:NOSSDAV. Williamsburg, VA, USA: ACM, 2009, pp. 43–48. [105] M. Ra, A. Sheth, L. Mummert, P. Pillai, D. Wetherall, and R. Govindan. “Odessa: Enabling Interactive Perception Applications on Mobile Devices”. In: MobiSys. Bethesda, Maryland, USA: ACM, 2011, pp. 43–56. [106] X. Ran, H. Chen, X. Zhu, Z. Liu, and J. Chen. “DeepDecision: A Mobile Deep Learning Framework for Edge Video Analytics”. In: IEEE INFOCOM 2018 - IEEE Conference on Computer Communications. 2018, pp. 1421–1429.doi: 10.1109/INFOCOM.2018.8485905. [107] Xukan Ran, Haoliang Chen, Zhenming Liu, and Jiasi Chen. “Delivering Deep Learning to Mobile Devices via Ooading”. In: Proceedings of the Workshop on Virtual Reality and Augmented Reality Network. VR/AR Network ’17. Los Angeles, CA, USA: Association for Computing Machinery, 2017, pp. 42–47.isbn: 9781450350556.doi: 10.1145/3097895.3097903. 119 [108] J. Redmon and A. Farhadi. “YOLO9000: Better, Faster, Stronger”. In: CoRR abs/1612.08242 (2016). [109] Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. 2018. arXiv: 1804.02767 [cs.CV]. [110] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. In: Advances in Neural Information Processing Systems 28. Ed. by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett. Curran Associates, Inc., 2015, pp. 91–99.url: http://papers.nips.cc/paper/5638-faster-r-cnn-towards- real-time-object-detection-with-region-proposal-networks.pdf. [111] E. G. Renart, J. Diaz-Montes, and M. Parashar. “Data-Driven Stream Processing at the Edge”. In: ICFEC. 2017, pp. 31–40. [112] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. “Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classication, Detection and Segmentation”. In: CoRR abs/1801.04381 (2018). [113] M. Satyanarayanan, P. Simoens, Y. Xiao, P. Pillai, Z. Chen, K. Ha, W. Hu, and B. Amos. “Edge Analytics in the Internet of Things”. In: IEEE Pervasive Computing 14.2 (2015), pp. 24–31. [114] E. Saurez, K. Hong, D. Lillethun, U. Ramachandran, and B. Ottenwälder. “Incremental Deployment and Migration of Geo-distributed Situation Awareness Applications in the Fog”. In: DEBS. Irvine, California: ACM, 2016, pp. 258–269. [115] SFMTA. SFMTA Transit Data. 2013.url: https://www.sfmta.com/reports/gtfs-transit-data. [116] Jie Shen, Xin Xiong, Zhiyuan Xue, and Yinglong Bian. “A convolutional neural-network-based pedestrian counting model for various crowded scenes”. In: Computer-Aided Civil and Infrastructure Engineering 34.10 (2019), pp. 897–914. [117] Olena Skarlat, Matteo Nardelli, Stefan Schulte, Michael Borkowski, and Philipp Leitner. “Optimized IoT Service Placement in the Fog”. In: Serv. Oriented Comput. Appl. 11.4 (Dec. 2017), pp. 427–443.issn: 1863-2386.doi: 10.1007/s11761-017-0219-8. [118] SmartCow. ROADMASTER: An Intelligent Trac Management System. 2021.url: https://www.smartcow.ai/en/solutions/roadmaster/. [119] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. “Federated Multi-Task Learning”. In: Advances in Neural Information Processing Systems 30. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 4424–4434.url: http://papers.nips.cc/paper/7029-federated-multi-task-learning.pdf. [120] John P Snyder. Flattening the earth: two thousand years of map projections. Chicago, USA: University of Chicago Press, 1997. 120 [121] G. Somasundaram, V. Morellas, and N. Papanikolopoulos. “Counting pedestrians and bicycles in trac scenes”. In: 2009 12th International IEEE Conference on Intelligent Transportation Systems. St. Louis, MO, USA: IEEE, Oct. 2009, pp. 1–6. [122] C. Szegedy, V. Vanhoucke, S. Ioe, J. Shlens, and Z. Wojna. In: CVPR. 2016, pp. 2818–2826. [123] M. Tan, R. Pang, and Q. V. Le. “EcientDet: Scalable and Ecient Object Detection”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, pp. 10778–10787. doi: 10.1109/CVPR42600.2020.01079. [124] Mingxing Tan and Quoc V. Le. “EcientNet: Rethinking Model Scaling for Convolutional Neural Networks”. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 6105–6114.url: http://proceedings.mlr.press/v97/tan19a.html. [125] Mohit Taneja and Alan Davy. “Resource aware placement of IoT application modules in Fog-Cloud Computing Paradigm”. In: 2017 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). 2017, pp. 1222–1228.doi: 10.23919/INM.2017.7987464. [126] TaskRabbit. [Last Accessed: Dec. 20, 2018].url: https://www.taskrabbit.com/. [127] S. Teerapittayanon, B. McDanel, and H. T. Kung. “Distributed Deep Neural Networks Over the Cloud, the Edge and End Devices”. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). 2017, pp. 328–339.doi: 10.1109/ICDCS.2017.226. [128] TensorFlow Federated: Machine Learning on Decentralized Data. https://www.tensorflow.org/federated. [129] TensorFlow Object Detection API. https://github.com/tensorflow/models/tree/master/research/object_detection. [130] E. K. Tokuda, R. M. Cesar, and C. T. Silva. “Quantifying the Presence of Grati in Urban Environments”. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). Kyoto, Japan: IEEE, Feb. 2019, pp. 1–4. [131] T. Vincenty. “DIRECT AND INVERSE SOLUTIONS OF GEODESICS ON THE ELLIPSOID WITH APPLICATION OF NESTED EQUATIONS”. In: Survey Review 23.176 (1975), pp. 88–93. [132] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. “Tackling the objective inconsistency problem in heterogeneous federated optimization”. In: arXiv preprint arXiv:2007.07481 (2020). [133] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan. “Adaptive Federated Learning in Resource Constrained Edge Computing Systems”. In: IEEE Journal on Selected Areas in Communications 37.6 (2019), pp. 1205–1221. [134] W. Wolf. “Key frame selection by motion analysis”. In: ICASSP. Vol. 2. IEEE. 1996, pp. 1228–1231. 121 [135] Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, and Larry S. Davis. “AdaFrame: Adaptive Frame Selection for Fast Video Recognition”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019. [136] Peihua Yu and Yunfeng Liu. “Federated Object Detection: Optimizing Object Detection Model with Federated Learning”. In: Proceedings of the 3rd International Conference on Vision, Image and Signal Processing. ICVISP 2019. Vancouver, BC, Canada: Association for Computing Machinery, 2019.isbn: 9781450376259.doi: 10.1145/3387168.3387181. [137] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. “Live Video Analytics at Scale with Approximation and Delay-Tolerance”. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston, MA: USENIX Association, Mar. 2017, pp. 377–392.isbn: 978-1-931971-37-9.url: https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/zhang. [138] Pengcheng Zhang, Qi Zhao, Jerry Gao, Wenrui Li, and Jiamin Lu. “Urban Street Cleanliness Assessment Using Mobile Edge Computing and Deep Learning”. In: IEEE Access 7 (2019), pp. 63550–63563.doi: 10.1109/ACCESS.2019.2914270. [139] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated Learning with Non-IID Data. 2018. arXiv: 1806.00582[cs.LG]. [140] Z. Zhao, K. M. Barijough, and A. Gerstlauer. “DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37.11 (2018), pp. 2348–2359.doi: 10.1109/TCAD.2018.2858384. 122
Abstract (if available)
Abstract
With the availability of massive amounts of visual media covering large geographical regions, several applications have emerged, including classifying the street cleanliness level, detecting forest fires or road hazards. Such applications share similar characteristics as they need to 1) detect specific objects or events (what), 2) associate the detected object with a location (where), and 3) know the time that the event happened (when). Advancements in image-based machine learning (ML) benefit these applications as they can automate the detection of objects of interest. Along with the edge computing (EC) paradigm, we are entering the Edge AI era, where the processing cost of ML is offloaded to the edge devices where the data are generated, hence reducing latency, communication cost and risk of privacy leaks. Moreover, at the acquisition time, sensors on the edge devices (e.g., GPS, gyroscope) enrich the collected data with other metadata. However, a shortcoming of existing approaches is that they rely on pre-trained "static" models. Nonetheless, crowdsourced data at diverse locations can be leveraged to iteratively improve the effectiveness and efficiency of a model. We refer to the aforementioned strategy as "spatial crowd-based visual learning." ❧ To support this class of applications, we design a Crowd-based Visual Learning Framework that integrates ML, crowdsourcing, and EC which allows edge devices with diverse resource capabilities to perform machine learning. The framework highlights the importance of maintaining versions of variable complexity models that can support heterogeneous edge devices with uncertain resource capacities at different geographical regions (e.g., model owner may not be the edge device owner, which is especially true for crowd-based and incentive-driven smart city applications). ❧ However, because of the high video frame rate, edge devices are unable to process deep learning models at the same speed. Any approach to consecutively feed frames to the model compromises both the quality (by missing important frames) and the efficiency (by processing redundantly similar frames) of video analysis. Focusing on outdoor urban videos, we utilize the spatial metadata of frames to select an optimal subset of frames that maximizes the coverage area of the footage. ❧ We further extend the idea of exploiting the spatial coverage of the collected data in the Federated Learning (FL) domain. Federated Learning provides a promising solution to learn a model from decentralized data. Despite the advances in FL, the diversity of client regions in which they operate and the Non-IID nature of the crowdsourced datasets reduces the accuracy of models significantly. To address this problem, we introduce a novel FL object detection system to efficiently train models with heterogeneous client datasets by designing lightweight client selection methods to learn object detection models faster. Our method leverage the metadata of the training images (e.g., location, direction, depth), to select clients which maximize the coverage of diverse geographical regions. ❧ The pervasive deployment of IoT devices along with the advancements in Deep Neural Network (DNN) models require novel solutions to decide how to place models to edge devices effectively. Current approaches require the model developer to take the placement decision by manually assigning models to edge devices. Models deployed on moving IoT devices (e.g., attached on vehicles) operating in a large geographical region poses new challenges because models can have varying effectiveness depending on the location. To tackle these challenges we mathematically formulate and propose a location-aware algorithm to place models to edge devices and quantify the effectiveness of our techniques.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
GeoCrowd: a spatial crowdsourcing system implementation
PDF
Differentially private learned models for location services
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
Responsible AI in spatio-temporal data processing
PDF
Efficient pipelines for vision-based context sensing
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Privacy-aware geo-marketplaces
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Dispersed computing in dynamic environments
PDF
Enabling spatial-visual search for geospatial image databases
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Dynamic pricing and task assignment in real-time spatial crowdsourcing platforms
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Query processing in time-dependent spatial networks
Asset Metadata
Creator
Constantinou, Giorgos
(author)
Core Title
Efficient crowd-based visual learning for edge devices
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-12
Publication Date
09/27/2021
Defense Date
08/25/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
crowd-based learning,crowdsourcing,Edge AI,edge devices,Internet of Things,IoT,model placement,OAI-PMH Harvest,smart cities,spatial coverage,spatial keyframe extraction,spatial metadata,weighted coverage
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Kuo, C.-C. Jay (
committee member
)
Creator Email
gconstan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15965009
Unique identifier
UC15965009
Legacy Identifier
etd-Constantin-10103
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Constantinou, Giorgos
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
crowd-based learning
crowdsourcing
Edge AI
edge devices
Internet of Things
IoT
model placement
smart cities
spatial coverage
spatial keyframe extraction
spatial metadata
weighted coverage