Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enabling spatial-visual search for geospatial image databases
(USC Thesis Other)
Enabling spatial-visual search for geospatial image databases
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENABLING SPATIAL-VISUAL SEARCH FOR GEOSPATIAL IMAGE DATABASES by Abdullah Alfarrarjeh A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2019 Copyright 2019 Abdullah Alfarrarjeh Dedication To my beloved parents To the light of my life, my children ii Acknowledgments First and foremost, all praise and gratitude to Allah (God) for giving me the opportu- nity, determination, and strength to complete my research work successfully throughout my graduate study. Having now gone through the process, I would like to express my gratitude to those who helped me reach this point. I have enjoyed the past six (and half!) years with the privilege of having shared experiences, conversations, and relationships with a number of extraordinary people along the way. I would like to thank my Ph.D. supervisor, Prof. Cyrus Shahabi, for giving me the opportunity to be a doctoral student under his supervision. I am greatly thankful that he guided me through every step of doing research, such as choosing the topic, approaching a problem, analyzing results, and making presentations. I am truly honored to have the opportunity to directly learn from Prof. Shahabi as he is a great teacher and mentor. Moreover, my work under the supervision of Prof. Shahabi enabled me to work in multiple research projects (e.g., MediaQ and TVDP). For these projects, I had the privilegeofworkingwithDr. SeonHoKim. Dr. Kimwasagreatmentorandalsoopento long research discussions. He supported and encouraged me to work harder continuously for developing solutions for research problems associated with real applications. I would like to thank my proposal and dissertation committee members Prof. C.-C. Jay Kuo, Prof. Aiichiro Nakano, Prof. Craig Knoblock, and Prof. Bhaskar Krishnamachari for their valuable comments, suggestions, and guidance. Being at the USC information laboratory (infoLab), I was fortunate to share my Ph.D. experiences with a group of talented students. Special thanks go to Ying Lu, iii Hien To, and Giorgos Constantinou for their fruitful collaboration in multiple research projects. Also, being affiliated at the USC integrated media systems center (IMSC) as a Ph.D. student, I had the opportunity to work with many master’s and undergraduate students. Their willingness to learning more motivated me to work harder for publishing research papers in different venues. Any acknowledgment would be incomplete without expressing my gratitude to all of my friends for their support throughout my graduate study. Especially, I would like to thank my friends at USC: Anas Al Majali, Ayman Khalil, Daoud Burghal, Haitham Sumrain, KhaldunAldarabsah, LaithAlshalalfeh, MamonHatmal, MohamedAbdelbarr, MohammadAbdel-Majeed, MostafaAyesh, NaumaanNayyar, WaleedDweik, andYasser Ali. The gatherings (e.g., for dinner, coffee, and long conversations after Friday (Jumaa) prayers!), which we had frequently, made my Ph.D. life joyful. Your true friendships are appreciated and I wish you all success in their life. Lastbutnotleast, Iwouldliketoexpressmydeepappreciationtomybelovedparents (Mohammad and Naela), brothers (Moataz and Moath), and sisters (Haneen, Samah, and Hadeel) for their kindly understanding, unreserved support and unselfish love over the years. Thank you for always believing in me, supporting me to go towards the Ph.D. study. Especially, my mom and dad, you have instilled in me the determination to always be better, and today I will be exclusively happy when you call me a doctor to feel how much you are proud of your son. I am also thankful to my wife, Ola, for her love, help, and support during my graduate study journey. I also would like to thank my sweet daughters, Taleen and Raneem, who have lighted my life and always provide me with hope and love when seeing their smiles on their beautiful faces. Daddy, I love you too much, and I wish that you live healthy, wealthy, and successful. iv Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures x Abstract 1 1 Introduction 3 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Image Scene Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Spatial Aggregation of Visual Features for Image Search . . . . . . . . . . 7 1.5 Hybrid Indexes for Spatial-Visual Search . . . . . . . . . . . . . . . . . . . 10 1.6 A Class of R*-tree Indexes for Spatial-Visual Search of Geo-tagged Street Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Background and Preliminaries 14 2.1 Image Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Spatial Representation . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Visual Representation . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 The State-of-the-art Indexes for Spatial and Visual Data . . . . . . . . . . 16 2.2.1 R*-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Locality Sensitive Hashing (LSH) . . . . . . . . . . . . . . . . . . . 17 3 Image Scene Localization 19 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Image Scene Location . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.2 Estimating Image Scene Location . . . . . . . . . . . . . . . . . . . 20 3.1.3 Image Scene Localization . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Image Localization Using CNN . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Image Camera Localization with Quadtree (ICL) . . . . . . . . . 25 v 3.2.2 Image Scene Localization with R-tree (ISL) . . . . . . . . . . . . . 26 3.3 Hierarchical Classification for Enhanced Scene localization . . . . . . . . . 28 3.3.1 Design Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Prediction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Spatial Aggregation of Visual Features for Image Search 39 4.1 Background and Problem Description . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Geo-tagged Image Model . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Spatially Aggregated Visual Features . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Spatial-Visual Image Group Selection . . . . . . . . . . . . . . . . 42 4.2.2 Visual Feature Aggregation Methods . . . . . . . . . . . . . . . . . 44 4.3 Image Search withSVD-based Indexing Approach . . . . . . . . . . . . . 46 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.2 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Hybrid Indexes for Spatial-Visual Search 56 5.1 Preliminaries and Problem Description . . . . . . . . . . . . . . . . . . . . 56 5.2 Indexing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Baseline Index Structures . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.2 Hybrid Index Structures . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3.2 Result Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.3 Index Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6 A Class of R*-tree Indexes for Spatial-Visual Search of Geo-tagged Street Images 69 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.1 Image Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.2 Queries on Geo-tagged Images . . . . . . . . . . . . . . . . . . . . 70 6.2 Spatial-Visual Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2.1 Baseline Index Structures . . . . . . . . . . . . . . . . . . . . . . . 71 6.2.2 Hybrid Index Structures . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 88 vi 7 Smart-City Applications using Spatial-Visual Search 95 7.1 Image Classification to Determine the Level of Street Cleanliness: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.1.1 Image Dataset and Background on Classifiers . . . . . . . . . . . . 98 7.1.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.2 Recognizing Material of A Covered Object: A Case Study with Graffiti . 106 7.2.1 Problem Definition and Dataset . . . . . . . . . . . . . . . . . . . . 107 7.2.2 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 A Deep Learning Approach for Road Damage Detection . . . . . . . . . . 115 7.3.1 Road Damage Detection Solution . . . . . . . . . . . . . . . . . . . 116 7.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8 Related Work 124 8.1 Related Work for Image Scene Localization . . . . . . . . . . . . . . . . . 124 8.2 Related work for Spatially Aggregated Visual Features . . . . . . . . . . . 125 8.3 Related Work for Spatial-Visual Search . . . . . . . . . . . . . . . . . . . 126 8.4 Related Work for Smart-City Applications . . . . . . . . . . . . . . . . . . 128 8.4.1 Street Cleanliness Classification . . . . . . . . . . . . . . . . . . . . 128 8.4.2 Covered Material Recognition . . . . . . . . . . . . . . . . . . . . . 129 9 Conclusions and Future Work 131 9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Reference List 136 vii List of Tables 3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Sizes (km 2 ) of Regions Generated from Spatial Organization of Datasets . 36 4.1 Geo-tagged Image Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1 Notation Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Space Cost of Various Index Structures . . . . . . . . . . . . . . . . . . . 59 5.3 Query I/O Cost of Various Index Structures . . . . . . . . . . . . . . . . . 59 5.4 Result Accuracy of Various Index Structures . . . . . . . . . . . . . . . . 62 5.5 Query Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.1 Asymptotic Analysis of the Hybrid Indexes . . . . . . . . . . . . . . . . . 82 6.2 Street Geo-tagged Image Datasets . . . . . . . . . . . . . . . . . . . . . . 82 6.3 Index Construction Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.4 Query Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5 Indexes for SF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.1 LASAN’s Categories of Street Scenes based on the Cleanliness Level . . . 98 7.2 Dataset Distribution among Image Labels . . . . . . . . . . . . . . . . . . 103 7.3 Parameter Values in Geo-spatial LCS . . . . . . . . . . . . . . . . . . . . 103 7.4 Graffiti Dataset Distribution among the Labels of Covered Materials . . . 111 7.5 Road Damage Types [117] . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 viii 7.6 TheDistributionofTrainingDatasets(Original,Augmented,andCropped) among the Road Damage Classes . . . . . . . . . . . . . . . . . . . . . . . 119 7.7 Parameter Values for Experiments . . . . . . . . . . . . . . . . . . . . . . 119 7.8 F1 Scores for the Model Trained using D . . . . . . . . . . . . . . . . . . 121 7.9 F1 Scores for the Model Trained using D a . . . . . . . . . . . . . . . . . . 121 7.10 F1 Scores for the Model Trained using D c . . . . . . . . . . . . . . . . . . 121 ix List of Figures 1.1 Data Flow of Geo-tagged Image for Smart-City Applications . . . . . . . 5 1.2 Image Camera Location vs. Image Scene Location . . . . . . . . . . . . . 6 1.3 Image Scene Localization Framework . . . . . . . . . . . . . . . . . . . . . 7 1.4 Image Organization using LSH . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 SVD-based Indexing Approach for Image Search . . . . . . . . . . . . . . 9 1.6 Analysis of the locality of similar visual features of street images at four locations using a spatial radius (S) = {200m,300m,400m,500m} and top-k similar images (k)=10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Image Field-of-View (FOV) Representation . . . . . . . . . . . . . . . . . 15 3.1 Image Scene Location Using Metadata . . . . . . . . . . . . . . . . . . . . 21 3.2 A Vision-based Pipeline for Constructing Scene Location-tagged Image Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Examples of Scene Location Estimation: FOV (Green), Metadata-based Scene Location (Blue), and Vision-based Scene Location (Red) . . . . . . 22 3.4 Organizing Images Spatially for Image localization . . . . . . . . . . . . . 25 3.5 Examples of Inaccuracies of Camera Localization . . . . . . . . . . . . . . 26 3.6 An Example of the R-tree Spatial Hierarchical of D . . . . . . . . . . . . 27 3.7 Example of Preliminary Classification Results using ISL−PHC . . . . . 30 3.8 Example of Preliminary Classification Results using ISL−LHC . . . . . 30 x 3.9 Heat Map of Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . 36 3.10 Training Time for Localization Approaches . . . . . . . . . . . . . . . . . 36 3.11 The Efficiency of Image Localization Approaches . . . . . . . . . . . . . . 37 4.1 A Framework for GeneratingSVD Descriptors for Geo-tagged Images . . 41 4.2 A Sequenced Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 A Panorama Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Evaluation of the Effectiveness ofSVD-based Indexing Approach for Sim- ilarity Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Evaluation of the Aggregation Methods (C,P, andA): (a)-(c) for all Datasets, while (d) for the OR Dataset . . . . . . . . . . . . . . . . . . . 51 4.6 Evaluation of the SelectedG(I R ) using the Selection Methods (S ψ ,S ξ , S Ωs , andS Ωv ) for all Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 Double Index Structure (R*-tree (left), LSH (right)) . . . . . . . . . . . . 60 5.2 Two-level Index Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Recall of Baseline vs. Hybrid w/ SU-VU . . . . . . . . . . . . . . . . . . . 65 5.4 Impact of Query Selectivity on Hybrid w/ Flickr . . . . . . . . . . . . . . 65 5.5 Impact of Exploration on Recall w/ Flickr . . . . . . . . . . . . . . . . . . 65 5.6 Impact of Exploring Spatially (Bottom X Axis) and Visually (Top X axis) on Recall w/ Flickr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.7 Impact of Spatial Range (Bottom X Axis) and Visual Range (Top X axis) on Recall w/ Flickr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.8 Performance of Baseline vs. Hybrid w/ SU-VU) . . . . . . . . . . . . . . 67 5.9 Impact of Query Selectivity on Hybrid w/ Flickr . . . . . . . . . . . . . . 67 5.10 Impact of Exploration on Performance w/ Flickr . . . . . . . . . . . . . . 67 5.11 Impact of Exploring Spatially/Visually on Performance w/ Flickr . . . . . 67 6.1 An Example Geo-tagged Image Dataset and Spatial-Visual Query . . . . 72 xi 6.2 Constructing and Querying SRI and VRI using the Image Dataset and Q sv Defined in Figure 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3 Constructing and Querying PSV andASV using the Image Dataset and Q sv Defined in Figure 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4 An Illustration Example for the Inaccuracy Issue of PSV . . . . . . . . . 80 6.5 Comparison of Indexes w.r.t. # of Nodes & Index Size . . . . . . . . . . . 83 6.6 Comparison of Indexes w.r.t. Optimization Criteria . . . . . . . . . . . . . 83 6.7 Baseline vs. Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.8 The Impact of Varying Query Visual Range ofQ sv on the Index Efficiency 87 6.9 TheImpactofVaryingQueryVisualRangeofQ sv ontheIndexEffectiveness 88 6.10 The Impact of Varying Query Spatial Range ofQ sv on the Index Efficiency 89 6.11 The Impact of Varying Query Spatial Range ofQ sv on the Index Effec- tiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.12 The Effect of Varying # of Clusters for CSV . . . . . . . . . . . . . . . . 90 6.13 The Impact of Varying Weight of image spatial properties (α) on ASV . . 91 6.14 The Impact of Varying Dimensionality of Data Space of I.v 0 on ASV . . 91 6.15 The Impact of Varying Large-scale Dataset (SF) . . . . . . . . . . . . . . 91 7.1 Image Examples for the Categories of Street Scenes based on the Clean- liness Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2 The Effectiveness of the Global Classification Scheme (GCS) Approach . 101 7.3 The Effectiveness of the Geo-spatial Local Classification Scheme (LCS) using . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4 Image Examples for Materials Covered by Graffiti . . . . . . . . . . . . . 107 7.5 One-Phase Learning Approach (OLA) . . . . . . . . . . . . . . . . . . . . 110 7.6 Two-Phase Learning Approach (TLA) . . . . . . . . . . . . . . . . . . . . 110 7.7 Proposed Approaches Evaluation . . . . . . . . . . . . . . . . . . . . . . . 111 7.8 Annotation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 xii 7.9 Image Examples for the Road Damage Types [117] . . . . . . . . . . . . . 117 7.10 A Deep Learning Approach for Road Damage Detection and Classification 119 7.11 The Impact of Varying the Value of NMS using Different Models . . . . 122 7.12 An Image Containing Overlapped Ground-truth Boxes . . . . . . . . . . 123 xiii Abstract Due to continuous advances in camera technologies as well as camera-enabled devices (e.g., CCTV, smartphone, vehicle blackbox, and GoPro), urban streets have been docu- mented by massive amounts of images. Moreover, nowadays, images are typically tagged with spatial metadata due to various sensors (e.g., GPS and digital compass) attached to or embedded in cameras. Such images are known as geo-tagged images. The availabil- ity of such geographical context of images enables emerging several image-based smart city applications. Developing such smart city applications requires searching for images, among the massive amounts of collected images, especially to be used for training various machine learning algorithms. Thus, there is an immense need for a data management system for geo-tagged images. Towards this end, it is paramount to build a data management system that organizes the images in structures that enable searching and retrieving the images efficiently and accurately. On one hand, the data management system should overcome the challenge of lacking an accurate spatial representation of legacy images that were collected without spatial metadata, as well as representing the content of an image accurately using an enriched visual descriptor. On the other hand, the system should also enable efficient storageofimagesutilizingboththeirspatialandvisualpropertiesandthustheirretrieval based on spatial-visual queries. To address these challenges we present a system which includes three integrated modules: a) modeling an image spatially by its scene location using a data-centric approach, b) extending the visual representation of an image with 1 the feature set of multiple similar images located in its vicinity, and c) designing index structures that expedite the evaluation of spatial-visual queries. 2 Chapter 1 Introduction 1.1 Motivation Due to continuous advances in camera technologies as well as camera-enabled devices (e.g., CCTV, smartphone, vehicle blackbox, and GoPro), urban streets have been docu- mented by massive amounts of images. Moreover, nowadays, images are typically tagged with spatial metadata due to various sensors (e.g., GPS and digital compass) attached to or embedded in cameras. Such images are known as geo-tagged images. The availabil- ity of such geographical context of images enables emerging several image-based smart city applications such as street cleanliness classification [14, 39, 100], material recog- nition [22, 40, 159], road damage detection [23, 117, 170], and situation awareness of events [12, 104, 113, 116]. Developing such smart city applications requires searching for images, among the massive amounts of collected images, especially to be used for training various machine learning algorithms. Thus, there is an immense need for a data management system to enable efficient and accurate image search. One form of image search uses only visual features of images (referred to as visual queries) while another utilizes only the spatial metadata of images (referred to as spatial queries). Each form requires organizing images differently using an index structure to support accessing images efficiently with respect to the query form. In particular, indexes for visual queries include locality sensitive hashing (LSH) index [77], inverted file-based Index [146], and vocabulary tree [122]. Meanwhile, for spatial queries, several indexes based on R-tree [98, 111, 92, 110] or grid [115, 33] have been proposed. However, emerging smart-city applications and image-based ML systems requires a hybrid of the two forms, called spatial-visual queries. 3 The task of image search is primarily affected by the effective representation, which models an image spatially and visually to depict its location and content properly. For the spatial representation, an image can be represented straightforwardly using its cam- era location or its geographical extent (known an image field-of-view (FOV) [32]). The shortcoming of the camera location is that the scene depicted in an image can be located in any direction and distance from the camera. Meanwhile, the FOV descriptor may describe the spatial extent of an image loosely especially when the camera capturing the image is far from the image scene. Therefore, there is a demand for a better descriptor that provides a more accurate spatial representation of an image. For visual representa- tions of images, various techniques have been proposed to extract visual features of an image content such as color, texture, shape, and objects. Such representations include color histogram [123], Gabor [123], SIFT [109], SURF [37], and convolutional neural net- works (CNN) [143, 35, 156] features. Conventionally, a high-dimensional visual descrip- tor of an image has been extracted independently from a single image. However, most state-of-the-art indexes utilize low-dimensional visual descriptors to avoid the computing overhead of high-dimensionalityinimagesearch, whichsacrificessearchaccuracy. Hence, there is a need for a visual descriptor that balances the trade-off between accuracy and performance in image search. Especially with the availability of the spatial metadata tagged with images, the conventional visual descriptor of an image can be enriched with the visual features of the similar images located in the same neighborhood. However, many legacy images still lack location data, which limits their utility for spatial search. Hence, many researchers have investigated approaches to estimate the locations of such images, i.e., where on the Earth an image was captured, utilizing their similarity to the available set of geo-tagged images [158, 73, 166, 93, 139, 31, 174, 155]. This task is referred to as image localization. The task of image localization impacts the efficiency of image search. Thus, it is crucial to localize images precisely to describe the spatial context of the image. 4 1.2 Thesis Statement Figure 1.1: Data Flow of Geo-tagged Image for Smart-City Applications Towards exploiting geo-tagged images in various smart-city applications (see Fig- ure 1.1), it is paramount to build a data management system that organizes the images in structures that enable searching and retrieving the images efficiently and accurately. On one hand, the data management system should overcome the challenge of lacking an accurate spatial representation of legacy images that were collected without spatial metadata, as well as representing the content of an image accurately using an enriched visual descriptor. On the other hand, the system should also enable an efficient storage of images utilizing both of their spatial and visual properties and thus their retrieval based on spatial-visual queries. More formally, the thesis statement is: Enabling spatial-visual search for images requires effective representation of their spatial and visual features, and then, developing efficient and accurate index structures to organize these spatial and visual features in tandem. 1.3 Image Scene Localization Image localization requires analyzing a reference set of geo-tagged images. Thus, a precise representation to describe the spatial context of the reference images is crucial for accurate localization. There are different ways to represent the spatial context of an image. For example, the most widely-used one is using point location, which reflects cameralocation. However, cameralocationdoesnotpreciselyreflecttheregionallocation of the actual visual content of an image (referred to as image scene location). The differencebetweenthe cameralocationandthescenelocationofanimagecanbeclarified 5 by the example shown in Figure 1.2. Image A was taken by a user standing at Gantry Plaza State Park, NY, and this image captures a scene of skyline buildings located in Manhattan, NY. Meanwhile, Image B shows the scene viewed at the location of the user who took Image A, illustrating that the view of the image camera location can be completely different from the image scene location. Moreover, the distance between the camera and the scene locations can be significant. In summary, scene location is a more accurate spatial representation of an image. Figure 1.2: Image Camera Location vs. Image Scene Location Several researchers tackled the challenge of image localization. Some proposed image retrieval-based approaches [73, 166, 93] while others proposed classification-based approaches [158, 155]. The retrieval approaches search for similar images in a reference set of geo-tagged images to infer the location of a query image based on the locations of the most similar retrieved images. Meanwhile, the classification-based approaches generate geographical regions of the reference images and train a classifier to predict the region containing a query image. Note that all of these approaches utilize the camera locations of reference images and hence estimate the camera location of a query image. 6 Alternatively, we aim at estimating the scene location of a query image, which is a more accurate representation of the image location. In this thesis, we introduce a novel image scene localization framework [18] (see Fig- ure 1.3) using spatial-visual classification by addressing three challenging issues. First, scene localization requires a reference set of images tagged with scene locations; however, the publicly available reference images are tagged mostly with only camera point loca- tions. We propose two approaches for estimating scene locations: metadata-based and vision-based. Second, spatial organization structure, which generates the geographical classes of images should be modified to accommodate the new spatial representation, using images’ scene locations, which are represented by small regions, not by points. Thus, we select to use R-tree as the data organization structure. Third, any classifica- tion algorithm must deal with the error in estimating image scene location. To alleviate our classification inaccuracy, we propose multiple hierarchical classification approaches by taking advantage of the spatial hierarchy of the R-tree structure to enable learning image features at different granularities. Figure 1.3: Image Scene Localization Framework 1.4 Spatial Aggregation of Visual Features for Image Search The search problem in a big image dataset has been the focus of various research efforts [77, 146, 122, 21]. In general, the task of image search is associated with two 7 issues: a) modeling an image with a visual descriptor that depicts its content properly, and b) organizing an image dataset efficiently to expedite their access. To address the first issue, many visual representations have been proposed to capture various features of an image content. To address the second issue, various index structures (e.g., locality sensitive hashing (LSH) index [77]) have been devised to enable efficient image organi- zation to speed up image search. However, there exists a fundamental challenge; due to the high dimensionality of visual descriptors, image index structures have been traded search accuracy for performance. Conventionally, a visual descriptor of an image has been extracted independently from a single image (referred to as conventional visual descriptor (CVD)). CVD usually consists of a high dimensional feature vector to pre- cisely represent an image; however, they add significant complexity in searching images in a database using them. To speed up searching, most image indexes use various forms of dimension reduction techniques (e.g., hashing, clustering, and principal component analysis (PCA)), which may result in missing some of the images of interest. Figure 1.4: Image Organization using LSH It is critical to overcome this challenge to achieve high performance and accurate searching using an index, especially for a big image database. One approach is to con- sider a better visual descriptor of images that is capable of minimizing the impact of dimension reduction techniques used for image indexing. Thus, we introduce a new way of defining a visual descriptor of an image (referred to as reference image) considering 8 Figure 1.5:SVD-based Indexing Approach for Image Search multiple images (referred to as image group) which are similar to the reference image and also located in its spatial neighborhood. Identifying such images is now feasible by utilizing the spatial metadata tagged with images, especially since the number of geo- tagged images is growing recently due to the ubiquity of GPS-equipped cameras (e.g., smartphones). We referto thisnovelvisualdescriptoras Spatially Aggregated Visual Fea- ture Descriptor (SVD). UsingSVD, the feature set of an image is expanded to include the features of other images located in its vicinity. Hence, our main hypothesis is that usingSVD for organizing images in an index structure ensures that the result of the similar search in a low-dimensional space potentially remains the same as the one in a high-dimensional space (i.e., minimizing the impact of dimension reduction techniques), and thus significantly improves the accuracy of image search. To prove our hypothesis, we have empirically evaluated the use ofSVD for an index- based similarity search. As ground truth, we performed a linear search usingCVD in a high-dimensional space. Thereafter, we organized images in an index structure using two approaches: one usingCVD and the other usingSVD. However, theSVD-based indexing approach only changes the index construction mechanism to obtain a better image organization (i.e., ensuring that similar images are stored together, for an example see Figure 1.4) while the search mechanism remains the same. As a result, both indexing approaches (i.e.,SVD-based andCVD-based) enable retrieving the images based on the similarity of theirCVD descriptors. TheSVD-based indexing approach is illustrated in Figure 1.5. 9 To obtainSVD descriptors for geo-tagged images, we introduce a framework for pre- processing images using their spatial and visual properties in a two-step pipeline [15]. For a given reference image, the first step aims at selecting a subset of images that satisfies two criteria: spatial proximity and visual similarity. To fulfill the two selection criteria, a hybrid query (referred to as spatial-visual query) finds the top k images that are spatially and visually close to the reference image. The second step aims at obtaining a rich visual descriptor depicting the scene captured by the reference image along with the selected top k images. In particular, this step aggregates the visual features of the selected images with that of the reference image to generate a new visual descriptor for the reference image. Aggregating visual features can be performed by combining the selected images into a synthesized image (either sequenced or panorama image) or by manipulating their visual features directly. 1.5 Hybrid Indexes for Spatial-Visual Search The large scale of geo-tagged image datasets and the demand for a real-time response make it critical to develop efficient spatial-visual query processing mechanisms. Towards this end, we focus on designing index structures that expedite the evaluation of spatial- visualqueries[21]. Thefirstintuitiveindexingapproachistocreatetwoseparateindexes, one for spatial data (i.e., R*-tree [38]) and another for visual data (i.e., locality sensitive hashing (LSH) [77]). Another naive approach is to organize the dataset using one of its datatypes, eitherspatialorvisual, andthenaugmentthedatastructurewiththefeatures of the other type. We study both variations of this approach: augmented R*-tree and augmented LSH. In addition to these three baseline approaches, we propose a set of novel hybrid index structures based on R*-tree and LSH. We also propose a class of hybrid approaches with a two-level index structure consisting of one primary index associated with a set of secondary structures. In particular, there are two variations to this class: using R*-tree as a primary structure (termed as Augmented Spatial First Index) or using 10 LSH as primary (termed as Augmented Visual First Index). In Chapter 5, we study and evaluate all these variations. 1.6 A Class of R*-tree Indexes for Spatial-Visual Search of Geo-tagged Street Images Based on our prior work [21], throughout our various experiments with real-world street images, we observed that counter to our intuition, the “spatial” filter first, always wins over the “visual” filter first. This led to a key observation that nearby street images are also visually similar 1 . We call this phenomenon as “Locality of similar visual features of street images”. As an example, we analyzed several datasets of geo-tagged urban street images by evaluating the difference of the set of similar images from the ones located in the spatial proximity of a query image using the root mean square error (RMSE) metric. As shown in Figure 1.6, across all of the examined datasets, at least 80% of the top-10 similar images are located in the neighborhood (ranging from 200 to 500 meters) of a reference image. Consequently, these results confirm the validity of the “locality of similar visual features” observation. Thisobservationisimportantsinceitshowsthattraditionalapproachesmaynotwork for new spatial-visual search. Moreover, this characteristic distinguishes spatial-visual search from the spatial-textual search [62, 175, 87, 70, 153], which has been studied in the past. In particular, the text in a geo-tagged tweet does not necessarily depict an event happening in the user location who posted the corresponding tweet. On the contrary, the visual content of a geo-tagged image depicts a scene at the image location. Hence, conventional hybrid indexing for spatial-textual search is not feasible for a spatial-visual search. Alternatively, a spatial index (e.g., R*-tree [38]) alone can be more a likely candidate for spatial-visual search. 1 In retrospect, this observation sounds obvious now! 11 Figure 1.6: Analysis of the locality of similar visual features of street images at four locations using a spatial radius (S) = {200m,300m,400m,500m} and top-k similar images (k)=10. To support spatial-visual search of geotagged images, we propose a class of spatial- visual indexes [17] based on R*-tree [38], which is one of the state-of-the-art spatial indexes. Our first approach is to use R*-tree to organize images using either spatial or visual features while augmenting the leaf nodes with both types of image features (referred to as Spatial R*-tree Index (SRI) and Visual R*-tree Index (VRI), respec- tively). In addition to these two baseline approaches, we propose a set of novel hybrid index structures. One is to maintain both spatial and visual features in R*-tree (dubbed Plain Spatial-Visual R*-tree (PSV)). To enable distinguishing between the spatial and visual properties of images and to allow for emphasizing on one property over the other, we propose another approach that organizes images adaptively using both properties (referred to as Adaptive Spatial-Visual R*-tree (ASV)). Finally, we propose another design by optimizing the representation of the visual features by clustering images in the tree structure (referred to as Clustered Adaptive Spatial-Visual R*-tree (CSV)). 1.7 Thesis Overview The structure of the thesis is organized as follows. Chapter 2 reviews the well- known representations for describing an image spatially and visually and also provides a background on the state-of-the-art indexes for either organizing spatial data or visual data. Chapter 3 presents our image scene localization framework. Chapter 4 introduces an enriched visual descriptor of images by aggregating the visual features of nearby 12 images. Chapter 5 presents novel hybrid indexes for spatial-visual search for geo-tagged images, while Chapter 6 presents novel approaches for extending R*-tree for spatial- visual search. Chapter 7 discusses several smart city applications that can benefit from spatial-visual search. Chapter 8 reviews related work to provide the context for the contributions of the thesis. Finally, Chapter 9 summarizes the thesis and outlines the possible directions for future work. 13 Chapter 2 Background and Preliminaries 2.1 Image Model In an image database, each image is represented by two types of properties: spatial and visual 2.1.1 Spatial Representation One simple descriptor comprises the geographical coordinates (i.e., latitude and lon- gitude) of the camera used at image capture time (referred to as camera location). The shortcoming of this descriptor is that it does not describe accurately the spatial extent of the scene depicted in an image (e.g., the scene of an image can be located in any viewing direction and within any distance from the camera location). Alternatively, an image can be described spatially by a richer descriptor known as image field-of-view (FOV) [32] (see Figure 2.1) comprising camera position, viewing direction, viewable angle, and the maximum visible distance. The FOV descriptor is a more accurate representation as compared to the camera location. These two spatial representations are formally defined in Definitions 1 and 2. Definition 1 (Image Camera Location). Given an image I, the image camera location includes the geographical coordinates of the camera position when capturing the image I. The geographical coordinates can be obtained by a GPS point in latitude and longitude. Definition 2 (Image Field of View [32]). Given an image I, the image field of view is represented with four parameters (acquired at the image capturing time), FOV ≡ 14 hp,θ,R,αi, where p is the camera position in latitude and longitude obtained from the GPS sensor, θ is the angle with respect to the North obtained from the digital compass sensor to represent the viewing direction ~ d, R is the maximum visible distance at which an object can be recognized, and α denotes the visible angle obtained from the camera lens property at the current zoom level. Figure 2.1: Image Field-of-View (FOV) Representation 2.1.2 Visual Representation To enable image search, several methods in the multimedia and computer vision research areas have been developed to extract the visual features of an image. Such visual features include color, edges, texture, and SIFT [109]. These features can be used to represent images either locally or globally by utilizing aggregation methods (e.g., bag- of-words[146]andVLAD[79]). Recently, thestate-of-the-artcomputervisionalgorithms areutilizedtoextractrichvisualfeaturesfromconvolutionalneuralnetworks(CNN).The CNNfeaturesaretypicallyextractedfromthelastfully-connectednetworklayerandused as a descriptor (a vector of 4096 dimensions) for image search [143, 35, 156]. Moreover, several approaches were developed for improving CNN-based descriptors [34, 151, 83] to overcome traditional limitations, including image scaling, cropping, and cluttering. One of the most promising approaches is regional maximum activations of convolutions (R-MAC) [151], which aggregates several image regions into a compact feature vector of 512 dimensions. 15 2.2 The State-of-the-art Indexes for Spatial and Visual Data 2.2.1 R*-tree R*-tree [38] is one of the state-of-the-art index structures for spatial data. R*-tree is a multiway hierarchical tree that recursively covers data spatially. Each node has at least m and at most M entries and comprises the minimum bounding rectangle (MBR) that surrounds its child entries. R*-tree is an extension of R-tree [68] by using optimization techniques in correspondence of three criteria: minimizing the nodes overlap, margin (i.e., perimeter length), and area. To insert a data object in an R*-tree, the tree is traversed recursively from the root node. At each step, the MBRs of the child nodes are examined, and a candidate is chosen using a heuristic strategy aiming at satisfying one or a combination of the optimization criteria. Thereafter, the traversal descends with the chosen child node until reaching a leaf node. If the leaf node overflows (i.e., the number of entries is larger than M), it is divided into sub-nodes using a Split function. This function employs the optimization criteria to first find the best data dimension (e.g., choose among latitude and longitude in case of GPS data) for performing the division, and then find the best data distribution among the two sub-nodes. Thereafter, the split might be propagated up to the root node. On the other hand, in range searching, the tree is recursively traversed from the root node using the query range. At each step, every MBR of the child nodes is inspected for overlaps with the query range. If overlaps, the corresponding child node is also searched. When a leaf node is reached, the corresponding objects are examined again to report the ones that satisfy the query parameters. 16 2.2.2 Locality Sensitive Hashing (LSH) Given that images are represented by feature-rich vectors, finding relevant images is defined as a similarity search ina high-dimensionalspace. Several tree-basedindex struc- tures (e.g., M-Tree [55]) have been proposed for exact-match similarity search with low- dimensional space. However, with high-dimensional space, their performance degrades to less than that of the linear scan approaches [61]. To perform approximate similarity search, several methods have been proposed (e.g., [30] [77]), among which, locality- sensitive hashing (LSH) [77] is widely used because of its theoretical guarantee and empirical performance. The key notion of LSH is to use a set of hash functions, from a hash familyH in a metric spaceM, which maps similar objects into the same bucket with higher probabilities than dissimilar objects. GivenM andH, LSH index structure maintains a number of hash tables containing references to the objects in the dataset. LSH [77] was originally introduced for Hamming metric space, and later other variants have been devised for other metric spaces [25]. In this thesis, we assume thatM is the d-dimensional Euclidean space R d andH is defined as follows [61]: H(o) =hh 1 (o),h 2 (o),...h F (o)i h i (o) =b ~ a i .~ o +b i W c, 1≤i≤F. Given an object ~ o∈ R d , h i ∈H first projects ~ o onto a random vector ~ a i ∈ R d whose entries are chosen independently from the standard normal distribution N(0, 1). Subsequently, the projected vector is shifted by a real numberb i drawn from the uniform distribution [0, W) where W is a user-specified constant. LSH consists of T hash tables with independent F hash functions. Given a query image I Q , its visual descriptor is first extracted, and then, projected and hashed to find the buckets which potentially contain similar images. The stored 17 images in these candidate buckets are considered as relevant images. Subsequently, a sub-list of relevant images whose distance to I Q .v is within visual range σ is retrieved. Since LSH uses the projection operation for dimensionality reduction, LSH results in a best-match list of relevant images that are not necessarily all relevant images. 18 Chapter 3 Image Scene Localization In this chapter, we propose a framework for image scene location. First, we describe how to construct a reference dataset after defining spatial representations of images, and a formal description of image scene localization. Following, we introduce our a CNN- based localization approach using the spatial decomposition of image locations. Next, we extend our localization approach by utilizing a hierarchical classification. 3.1 Problem Formulation We first describe some preliminaries for designing an image scene localization frame- work. Next, we give a formal description of our problem. 3.1.1 Image Scene Location As discussed in Chapter 2, an image cane represented spatially using the location of the camera capturing an image (referred to as camera location) or using its spatial extent (referred to as image field-of-view (FOV)). However, an FOV might be loosely describing the spatial extent of an image, especially when the scene depicted in an image is far from the camera location. Hence, we propose a more accurate spatial representation that specifies the scene location depicted in the image (referred to as Image Scene Location), which is formally defined in Definition 3. Definition 3 (Image Scene Location I S ). Given an image I, the image scene location I S is the spatial extent of the scene depicted in the image. The scene location is defined by a minimum bounding box MBB which surrounds the geographical region R covering the image scene such that MBB(R) = <R min , R max > where 19 R min = (min latitude(R), min longitude(R)), and R max = (max latitude(R), max longitude(R)). Image scene location most accurately represents the spatial context of an image; however, it requires extra spatial metadata to be collected at the image capturing time. Unfortunately, such specific spatial metadata is not available for most images. Thus, we will introduce an approach to estimate this extra metadata and construct a scene location tagged reference dataset in the following section. 3.1.2 Estimating Image Scene Location Using the existing geo-tagged image datasets such as Flickr [1, 149], Google Street View [2], GeoVid [29], and MediaQ [89, 112], we provide two approaches for pre- processing such datasets to estimate the scene location of an image: metadata-based and vision-based. A Metadata-based Approach This approach assumes that each image is tagged with its FOV descriptor (i.e., camera location, viewing direction, angle, viewable distance [32]). The shape of the FOV can be approximated by a triangle. In this triangle, one corner is the camera location while the scene exists closer to the other two corners. Thus, for estimating the scene location, we geometrically calculate the circle inscribed in the triangle since the circle provides a tighter bound on the scene location rather than that of the triangle of the FOV. For simplicity, the axis-aligned square inscribed in the circle is calculated to represent the image scene location 1 . Figure 3.1 shows an example of the scene location estimated based on an FOV descriptor. 1 This is a conservative definition. One can create a different one, such as a minimum bounding box of the circle. 20 Figure 3.1: Image Scene Location Using Metadata A Vision-based Approach If an image dataset does not have FOV descriptors but only tagged with camera loca- tions, a computer vision-based approach can be used for estimating the scene locations. In what follows, we present a pipeline (see Figure 3.2) of our vision-based approach for scene location estimation. The pipeline consists of three stages: spatial-visual search, 3D reconstruction, and scene geo-registration. Figure 3.2: A Vision-based Pipeline for Constructing Scene Location-tagged Image Dataset Spatial-Visual Search: Assuming that every image in the dataset captures a scene containing distinctive objects, we aim at retrieving the images which capture a similar scene in the nearby area. The retrieval using only either the spatial filter or the visual filter is not efficient. Using only spatial search might retrieve irrelevant images whose contentisnotsimilartothescenecapturedbytheimageofinterestwhileusingonlyvisual search might retrieve similar images but within different geographical regions. Hence, we use both of these filters in tandem to retrieve a set of images taken in the same region 21 (a) Building A (b) Building B Figure 3.3: Examples of Scene Location Estimation: FOV (Green), Metadata-based Scene Loca- tion (Blue), and Vision-based Scene Location (Red) with a similar scene. Towards this end, we utilize an index structure, proposed in [21], which organizes a given dataset both spatially and visually to perform spatial-visual search efficiently. 3D Reconstruction: Once a group of similar images in the same region is selected, we estimate extra spatial properties using 3D reconstruction. Capturing an image for a viewable scene projects the 3D points composing the scene into 2D points in the image plane. The pinhole camera models this transformation using Eq. 3.1, which requires information about the camera parameters: the intrinsic parameter matrix K and the extrinsic parameters (i.e., the rotation matrixR and the translation vectorT). Inversely, for a given image, the 3D points of the captured scene can be obtained if the parameters of the image’s camera are known. However, when providing multiple images capturing a similar scene, the 3D points of the scene can be reconstructed using the well-known Structure from Motion (SfM) algorithm [137]. Under geometry constraints, a 3D pointx can be inferred geometrically by finding two corresponding points (q, q 0 ) in two images; I and I 0 (i.e., (q≡ q 0 )−→ x| q∈ I∧ q 0 ∈ I 0 ). By acquiring a sufficient number of common 3D points with their corresponding 2D points in a set of similar images, the 3D points of the captured scene, as well as the camera parameters (i.e., K, R, and T) can be estimated algebraically. 22 2 4 q 1 3 5 =K(Rx +T ) (3.1) Scene Geo-registration: In 3D reconstruction, 3D points of the scene inferred from an image are estimated in a local coordinate system (referred to as SfM system). However, the goal is to obtain the 3D points in the global coordinate system (referred to as GPS system). Transforming between these two systems can be performed using Eq. (3.2) where x and g are 3D points in the SfM and GPS coordinate systems, respectively, and M is a similarity transformation matrix. Estimating M requires using the camera positions of a set of images in both SfM and GPS coordinates [165]. Next, M is used to transform the 3D points of the image scene into the GPS coordinates. Finally, the image scene location is the minimum bounding box surrounding the resulted GPS coordinates. 2 4 g 1 3 5 =M 2 4 x 1 3 5 (3.2) The Accuracy of Vision-based Approach: For evaluating the accuracy of the pro- posed approach, we collected images in two regions (Figure 3.3 2 ) and compared the out- puts of two scene estimation approaches: metadata-based and vision-based. The FOV descriptors were collected using a smartphone application 3 . We calculated the distance between the center of the scene locations generated by the two approaches. Our results show that the average distance for all images was around 12 meters, demonstrating that both approaches represent close scene locations for the same image. Scene Location-tagged Dataset Finally, we generate an image datasetD (see Definition 4) where each image is tagged with both its camera and scene locations. 2 In Google Map, we concealed building and street names for anonymity purpose 3 We hide the name of the smartphone application for anonymity purpose 23 Definition 4 (Scene Location-tagged Image Dataset). A Scene Location-tagged Image Dataset D is a set of n images{I 1 ,I 2 ,...I n }, where each image I j is tagged with its camera location I C j and its estimated scene location I S j . 3.1.3 Image Scene Localization Given D where every image I j ∈ D is tagged with its camera I C j and scene I S j locations, and a query image I q whose location is unknown, image scene localization is formulated as estimating where the scene depicted in I q is located. Image Scene Localization(D,I q ) =⇒I S q 3.2 Image Localization Using CNN Due to the technical advance of convolutional neural networks (CNN) in the mul- timedia field [96], Weyand et al. [158] have successfully framed the image localization problem as a classification problem with a CNN-based solution and reported more accu- rate results compared to computer vision based approaches. In particular, the solution first generates a set of geographical regions containing images, using the camera locations of images, and then a CNN is trained to predict the region which potentially contains the image location. Designing a CNN-based solution mainly depends on the underlying spatial represen- tation of images used to organize D spatially. When using the camera location repre- sentation, CNN is trained to predict I C q (i.e., Image Camera Localization). Meanwhile, training a CNN based on scene locations of D enables predicting I S q (i.e., Image Scene Localization). Below, we describe two CNN-based approaches. 24 (a) Quadtree (Camera Location) (b) R-tree (Scene Location) Figure 3.4: Organizing Images Spatially for Image localization 3.2.1 Image Camera Localization with Quadtree (ICL) The CNN-based approach proposed by Weyand et al. [158] (known as PlaNet) orga- nizes D based on images camera locations using Quadtree. Quadtree [138] adaptively divides the space into four equal-size regions (see Figure 3.4a), dividing a region when it containsmorethanacertainnumber(M)ofimages. Consequently, thegeneratedregions (which contain images) are considered as the classes when training a CNN. When a query image I q is classified, the model generates a probability distribution of assigning I q to each region (i.e., a specific cell in Quadtree). Subsequently, the center of the region with the highest probability is reported as the camera location of I q , i.e., I C q . Since the classification approach relies on the geographical regions of images gen- erated based on the camera location organization of D, intuitively classifying I q leads to finding I C q . Since the distance between I C q and I S q can be either marginal or signifi- cant, the localization output of the approach is always accompanied by a distance-error estimation. Moreover, the classification approach based on camera location may suffer from two other types of inaccuracy. The prediction output through such classification can be false-positive (Figure 3.5a) or false-negative (Figure 3.5b) as formally defined in Definitions 5, and 6, respectively. Definition 5 (False-positive Image Localization). Given a query image I q and localiza- tion kernel Φ based on a spatial distribution R ={r 1 ,r 2 ,...r k }, the localization of I q to r i (i.e., < Φ(I q ) :=r i >) is false-positive if I C q ∈ r i but I S q ∈ r j and r i 6=r j . 25 Definition 6 (False-negative Image Localization). Given a query image I q and a local- ization kernel Φ based on a spatial distribution R ={r 1 ,r 2 ,...r k }, the localization of I q to r i (i.e., < Φ(I q ) :=r i >) is false-negative if I C q / ∈ r i but I S q ∈ r i . The localization inaccuracy ofICL is not only due to the distance between the cam- era and the scene locations, but also due to class noise [43] (i.e., incorrect labeling of the reference images). Multiple images with different camera locations might capture the same scene, but they might be assigned to different regions (see Figure 3.5c). Conse- quently, theclassesthatrepresenttheirassignedregionswillhavecommonvisualfeatures and thus the observation in the learning phase will suffer from noise which negatively affects the classification accuracy. (a) False-Positive (b) False-Negative (c) Class Noise Figure 3.5: Examples of Inaccuracies of Camera Localization 3.2.2 Image Scene Localization with R-tree (ISL) To address the inaccuracy problems encountered withICL, we propose a new classi- fication technique utilizing the image scene locations of D. In particular, we organizeD spatially based on the scene location (region) instead of the camera location (point). To organize images based on their scene locations, Quadtree is not an appropriate structure for two main reasons. First, Quadtree divides the space into regions, thus storing an image whose scene location overlaps over multiple regions makes an image belonging to multiple regions. Given that regions are used to represent classes, the classification accu- racy can be affected negatively. Second, Quadtree does not limit the minimum number of images contained in each region. Consequently, some of the generated regions can be dense while the others are very sparse. Hence, the classifier becomes biased to better 26 learn the visual features of the dense regions rather than the sparse ones. In our pro- posed approach, we use R-tree [68] which is a well-known data structure for organizing spatial data (regions) because it is an object-based index structure rather than space- based (such as Quadtree). With R-tree, the spatial boundaries of the generated regions are adapted based on the scene locations of their contained images (see Figure 3.4b). Moreover, R-tree has a control on the node capacity where each node contains at least m and at most M images. Thus, R-tree is a more appropriate structure to efficiently organize D as shown in our experimental results. R-tree generates a spatial hierarchy of regions at different granularities. Similar to PlaNet [158], we can consider only the regions at the lowest level which correspond to the leaf nodes of the tree and train a CNN classifier to learn the visual features of the images of each region. At prediction time, the trained model classifies I q and outputs a discrete probability distribution over the regions. We choose the region with the highest probability. ThescenelocalizationapproachwithR-treeavoidsthepredictioninaccuracy encountered with ICL and minimizes the class noise problem since the images that are capturing the same scene will be assigned to the same region. However, it still incurs the distance estimation error due to the unavoidable learning error associated with any classification algorithm. We address this problem in the following section. Figure 3.6: An Example of the R-tree Spatial Hierarchical of D Example: Figure 3.6 shows an example of the spatial organization of D which is composedof3hierarchicallevelscomposedof15nodes(7internaland8leafnodes). ISL 27 constructs one classifier to discriminate among the images originated from the regions corresponding to the leaf nodes belonging to the third level. 3.3 Hierarchical Classification for Enhanced Scene local- ization The classification with ISL follows a naïve paradigm which builds one classifier to predict only the classes at leaf nodes which represent certain geographical regions. This approach has two major shortcomings. First, it builds one model to discriminate among a large number of classes. The classes corresponding to fine-grained regions potentially have common visual scenes within regions which increase the sensitivity to noised learn- ing of visual features. Second, this approach ignores learning the discriminative global characteristics across neighboring regions at a higher level. Combining small neighboring regions into a larger region enables learning the common features across small regions as well as the discriminative features within the large region. Thus, designing an approach which learns the visual characteristics at both local and global levels would improve the classification mechanism ofISL as we will show in our experimental results. Our frame- work utilizes the spatial hierarchy of geographical regions generated by the R-tree data structure for this purpose. 3.3.1 Design Strategies To construct the hierarchical learning of D for localizing new images, we propose two strategies for hierarchical classification: parent-based and level-based. The parent- based strategy utilizes the parent-child relationship among geographical regions; hence building a classifier for each internal node to discriminate between the image regions of the descending child nodes. Meanwhile, the level-based strategy utilizes the hierarchical levels which map to the granularity of the image regions; hence building a classifier for every level. 28 Parent-based Hierarchical Classification Scheme (ISL−PHC) Through the R-tree spatial hierarchy, when an image I belongs to a node X (i.e., I∈X), this implies thatI∈parent(X). Therefore, designing a classifier for parent(X) supports learning the characteristics of its contained images and discriminating between the local characteristics of the classes representing the child nodes. Moreover, devising another classifier onparent(parent(X)) enables recognizing the characteristics of images comprehensively at a higher level. This strategy lowers the chances of misclassification due to the decrease in the number of classes for each classifier. Example: Based on the spatial hierarchy of D depicted in Figure 3.6, ISL−PHC builds a classifier for every internal node. In total it generates 7 classifiers. For example, the classier corresponding to nodeA includes the classes corresponding to its child nodes A.1 and A.2. Level-based Hierarchical Classification Scheme (ISL−LHC) With the ISL−PHC scheme, each classifier considers locally only the observable characteristics of the images within its geographical scope discarding the images from other geographical scopes. Therefore, we propose another strategy which utilizes the levels of the spatial hierarchy of R-tree. For every level, the union of the geographical sub-regions referenced by every node covers the overall geographical region ofD but with various granularities. Hence, designing a classifier per level supports learning the char- acteristics of all images belonging to regions at different granularities (see Algorithm 1). In particular, the classifier for the level containing the leaf nodes is able to learn the local and detailed characteristics of the smallest regions, while the classifier of a higher level captures more comprehensive characteristics of the covered regions. Since the depth of R-tree is relatively small (i.e.,O(logn)), ISL−LHC potentially produces a fewer number of classifiers compared to ISL−PHC. Moreover, the classifier at the higher levels contains less number of classes which improves the classification accuracy of the corresponding classifier. 29 Example: Based on the spatial hierarchy of D depicted in Figure 3.6, ISL−LHC builds a classifier for each level. In total, it generates 3 classifiers. For example, the classier corresponding to level 1 includes the classes corresponding to sibling nodes A and B. 3.3.2 Prediction Strategies Figure 3.7: Example of Preliminary Classification Results using ISL−PHC Figure 3.8: Example of Preliminary Classification Results using ISL−LHC PredictionusingISL−PHC orISL−LHC isnotstraightforwardsincetheyproduce multiple results from the classifiers. In particular, using ISL−LHC, I q has to be classified using all classifiers at various granularities to predict the region which most likely contains I q . Since every classifier in ISL−PHC is limited to the geographical scope of the parent node, using one classifier (or a subset of classifiers) is not sufficient to find the scene location; thus I q has to be classified using all classifiers to consider the overall geographical scope. Consequently, all classifiers of ISL−PHC or ISL−LHC 30 has to be used at the prediction time and thus a strategy is needed to aggregate their preliminary results to decide the image scene location. Since classifiers in the hierarchical schemes are trained independently, their prelimi- nary classification results might include conflicts with respect to their spatial hierarchy. In particular, if the preliminary classification results imply thatI q is part of a child node X, the preliminary results should also imply thatI q is part ofparent(X) due to the spa- tial relationship between X and parent(X). However, this implication may not always hold. Suppose that the preliminary results of classification using both ISL−PHC and ISL−LHC are as shown in Figures 3.7 and 3.8, respectively. Based on the results of ISL−PHC, I q is potentially assigned to the region B.1 when considering the prelim- inary results of all level 2. However, B (i.e., parent(B.1)) unlikely contain I q based on the preliminary results of level 1. Similarly, based on the results of ISL−LHC, I q is potentially assigned to B.2.1 using the classifier corresponding to level 3. However, B.2 (i.e.,parent(B.2.1)) unlikely containI q using the classifier of level 2. Such inconsis- tency should be addressed. Hence, we investigate three prediction strategies as explained below. Greedy Prediction (GP) GP is a top-down approach which traverses the spatial hierarchy using the depth-first search and recursively selects the child node whose prediction probability is the highest. This strategy is simple but suffers from the drawback of classification error propagation. If the classifier at a higher level misclassifies I q , the final prediction output using GP is consequently inaccurate. Example: Based on the preliminary classification results ofISL−PHC (Figure 3.7), GP assigns I q to the A.2.1 region. Meanwhile, GP assigns I q to A.1.2 based on the classification results of ISL−LHC (Figure 3.8). 31 Heuristic Prediction (HP) HP exhaustively traverses the spatial hierarchy to assign a score for every path. The path score is heuristically the product of the prediction probabilities of all nodes composing the path (from the root to a leaf node). Thereafter, I q is assigned to the region of the leaf node whose path score is the highest. By considering all paths, HP avoids the classification error propagation problem. However,HP discards the hierarchy concept by treating all nodes (from various levels) in a path similarly while calculating the path score. Example: Based on the preliminary results of ISL−PHC and ISL−LHC (Fig- ures 3.7 and 3.8), HP assigns I q to B.1.2. Bayesian Prediction (BP) BP utilizes Bayesian network which is a probabilistic graphical model to represent how causes generate effects and then enable prediction and decision making under uncer- tainty. In our case, the causes and effects are analogous to parent-child relations of regions in R-tree. Bayesian network is used to aggregate the preliminary classification results for I q and perform collaborative error correction over possibly inconsistent pre- dictions. The BP prediction is based on the conditional probability of parent nodes; hence the final output of prediction is influenced by the prediction probabilities of par- ent nodes. Aggregating the preliminary classification results of the classes corresponding to a path P which composes the nodes{X 1 ,X 2 ...X n } is shown in Eq. 3.3. Similar to HP,BP avoids the classification error propagation problem since it traverses the whole spatial hierarchy. Nonetheless, BP also considers the relation inherited from the spatial hierarchy to perform the aggregative prediction. 32 Example: Based on the preliminary results of ISL−PHC and ISL−LHC, BP assigns I q to B.1.2. P (X 1 ,X 2 ...X n ) = n Y i=1 P (X i |parent(X i )). (3.3) Algorithm 1 Level-based Hierarchical Classification Scheme 1: Let D denote the training set of a scene location-tagged image dataset, and R denote the R-tree which organizes D spatially. 2: for l i ∈Levels(R) do 3: C i ←−{} 4: for n j ∈Nodes(R,l i ) do 5: Define a class c ij ={∀ I|I∈D∧I∈n j subtree} 6: C i ←−C i ∪{c ij } 7: end for 8: Build a classifier g i (corresponds to l i ) on the classesC i 9: end for 3.4 Experiments 3.4.1 Experimental Methodology Dataset In our experiments, we used geo-tagged datasets [167] obtained from Google Street View at three particular geographical regions: a) part of Manhattan, New York City(referred to as GSV MAN ); b) downtown Pittsburgh, Pennsylvania (referred to as GSV PIT ); and c) downtown Orlando, Florida (referred to as GSV ORL ). A summary of the three datasets is shown in Table 3.1. Rather than evaluating localization in the world level which requires a huge amount of reference images, we selected to evaluate our localization approaches at finer granularities such asGSV MAN ,GSV PIT , andGSV ORL which contain a sufficient number of images with respect to the area of their geographical regions. GSV MAN contains the largest number of images but the number of images per square meter is the lowest. Moreover, the images are unevenly distributed throughout 33 the area (see Figure 3.9a). Meanwhile, GSV PIT and GSV ORL contain a smaller num- ber of images and their areas are small while having a dense and even distribution (see Figures 3.9b, 3.9c). Scene Location-tagged Dataset In our experiments, to estimate the scene location of every I∈GSV MAN ,GSV PIT , and GSV ORL , the vision-based approach was performed since FOV’s metadata is not available. For performing 3D reconstruction, we used the open source OpenSfM tool [5]. In the vision-based approach, the spatial-visual phase retrieved on average 4 similar images to perform 3D reconstruction. Note that, while a general 3D reconstruction in computer vision requires many images for a realistic 3D objects, our approach does not require the details of an object but simple estimation of outline might be enough. Thus, a small number of images in a region is satisfactory in our estimation of scene locations. The 3D reconstruction phase was the most time-consuming phase. In particular, it took on average 2 minutes per group of similar images. By running 4 parallel processes, the total time for performing the 3D reconstruction was around 8.8, 2, and 1.5 days for GSV MAN , GSV PIT , and GSV ORL , respectively. Subsequently, scene location tagged reference datasets were created. Classifiers To generate the geographical regions of each dataset, we organized them spatially using Quadtree (M=100) and Rtree (M=100 and m=50). The sizes of the regions gen- erated from both trees are summarized in Table 3.2. Thereafter, the regions of images are used as classes and 70% of images per region are selected to train a CNN-based classifier. For training a CNN classifier, we adopt Caffe [81] which is an open source CNN architecture. To avoid building a trained model from scratch, we customized the ImageNet pre-trained model provided with Caffe and fine-tuned the last three original 34 fully-connected layers; fc6, fc7, and fc8. In particular, the fc8 layer is modified to repre- sent a layer of a number of neurons equal to the number of image classes (representing the generated regions from Quadtree or R-tree). The weights in these three layers were initialized from a zero-mean Gaussian distribution with a standard deviation of 0.01 and zero bias. The rest of the layers were initialized using weights from the pre-trained model. The network was trained for 500K iterations using stochastic gradient descent with the momentum of 0.9, gamma of 0.1 and a starting learning rate of 0.00001 which we decay by a factor of 5e-4 every 10 epochs. Figure 3.10 shows the elapsed time for training our approaches: ICL, ISL, ISL−PHC, and ISL−LHC. Since R-tree gen- erates the regions tightly to a set of scene locations and restricts the minimum number of images per region, the total number of regions generated by R-tree is less than that of Quadtree; hence minimizing the number of classes for training. This implies that training ISL using the R-tree regions takes less time than that of ICL using Quadtree. Regarding the hierarchical schemes, ISL−PHC took the longest training time since it contains the largest number of classifiers. Evaluation Metrics To evaluate the accuracy of our localization approaches, we calculate the localization error distance (LED) which is the Euclidean distance between the center of the predicted region and the center of the original scene location ofI q . We also calculate LED from the image camera (LED-Camera) to evaluate ICL. In addition, we calculate the F1 score which is a harmonic average between recall and precision for classification accuracy. The precision is the percentage of the reported regions which actually contain I S q while the recall is the percentage of I S q which are located in the reported regions. We implemented our framework using Python and Java 1.7. All experiments were performed on a GPU-enabled Oracle Cloud Computing infrastructure (an instance run- ning on Ubuntu 14.04 LTS and equipped with 56 CPUs (2.2 GHz Intel Core), 187 GB memory, and 2 GPUs (1.4805 GHz Tesla)). 35 Table 3.1: Datasets Dataset Area (km 2 ) # of images Avg. # images / km 2 GSV MAN 13.7× 7.3 17, 825 178 GSV PIT 1.5× 3.6 4, 825 893 GSV ORL 2.1× 1.2 3, 204 1, 271 (a) Manhattan (b) Pittsburgh Downtown (c) Orlando Downtown Figure 3.9: Heat Map of Dataset Distribution Table 3.2: Sizes (km 2 ) of Regions Generated from Spatial Organization of Datasets Dataset Size of Rtree Leaf Nodes Size of Quadtree Cells GSV MAN Max: 1.41, Avg: 0.35 Max: 3.11, Avg: 1.70 GSV PIT Max: 0.40, Avg: 0.13 Max: 0.80, Avg: 0.21 GSV ORL Max: 0.33, Avg: 0.10 Max: 0.90, Avg: 0.18 Figure 3.10: Training Time for Localization Approaches 3.4.2 Experimental Results The Impact of Scene Location Figure 3.11a shows the F1 scores of both ICL and ISL across the GSV MAN , GSV ORL , and GSV PIT datasets. In general, ISL obtained a better F1 score than 36 (a) Classification Accuracy using ICL vs. ISL (b) Localization Error Distance using ICL vs. ISL (c) ISL w/ Hierarchical Classification (d) Hierarchical ISL w/ Various Prediction Strategies Figure 3.11: The Efficiency of Image Localization Approaches ICL due to minimizing the inaccuracies associated with ICL (as discussed earlier Fig- ure 3.5). From the perspective of localization error distance as shown in Figure 3.11b, ISL obtained a smaller error distance than ICL using both metrics: LED and LED- Camera. This shows the benefit of utilizing scene locations for spatially organizing the dataset which enables generating proper classes of images; hence improving learn- ing the visual features of regions efficiently. The error distance for GSV MAN was the largest for bothICL andISL due to the uneven distribution of images. Meanwhile, the even and dense distribution of images in GSV ORL decreased the error distance using ISL and thus achieved the best localization accuracy. On the other hand, ICL using LED-Camera obtained better results compared to ICL using LED since the ICL was organized and trained using the camera location which implies that the predicted region can be potentially closer to the camera location than the scene location. 37 The Impact of The Hierarchical Classification Scheme Figure 3.11c shows the evaluation of the two types of hierarchical classification schemes (i.e., PHC and LHC) with ISL in terms of LED using BP. In general, both ISL−PHC and ISL−LHC obtained smaller values for error distance compared to the non-hierarchical approach (i.e.,ISL) since the hierarchical schemes make the predic- tion on a set of trained models that learned both local and global features of regions at different granularities rather than using only one trained model to discriminate the fine- grained regions visually. Moreover, the impact of hierarchical classification on decreasing theerrordistancewasevidentespeciallyforGSV MAN whichcontainsthelargestnumber of images distributed in a large geographical region; hence generating a spatial hierar- chy of diverse geographical regions in different granularities and attaining better visual learning with hierarchical classification. In particular, the highest decrease percent- age of LED was obtained by GSV MAN reaching 31% and 13% using ISL−LHC and ISL−PHC, respectively. Meanwhile, the lowest error distance decrease was obtained for GSV ORL reaching 11% and 4% using ISL−LHC and ISL−PHC, respectively. Among ISL−PHC and ISL−LHC, ISL−LHC achieved the best accuracy with a LED value less than 300 meters for GSV ORL . The Impact of The Prediction Strategy Figure 3.11d shows the average LED forISL−PHC andISL−LHC with different prediction strategies (GP, HP, and BP) across GSV MAN and GSV ORL . In general, GP with bothISL−PHC andISL−LHC obtained the largest LED due to the effect of the classification error propagation while BP obtained the minimum LED due to the utilization of the probabilistic model based on the spatial relations of regions. With GSV MAN , the improvement obtained by BP compared to GP was the best because of the expensive cost of the classification error propagation in a large geographical region such as Manhattan. 38 Chapter 4 Spatial Aggregation of Visual Features for Image Search In this Chapter, we introduce a new descriptor that balances the trade-off between accuracy and performance in image search by extending the representation of an image with the feature set of multiple similar images located in its vicinity (referred to as Spatially-Aggregated Visual Feature Descriptor (SVD)). SVD potentially preserves the visual features of images in both high and low-dimensional spaces better than conven- tional visual descriptors. Through an empirical evaluation on big datasets, indexing images usingSVD provided a significant improvement in search accuracy comparing to using conventional descriptors while maintaining the same performance. 4.1 Background and Problem Description In what follows, we describe the geo-tagged image model and formally define the spatially aggregated visual descriptor of an image. 4.1.1 Geo-tagged Image Model Our research focuses on geo-tagged images represented by two descriptors: visual and spatial (as defined in Definition 7). In this chapter, without loss of generality, the visual descriptor of an image is the CNN-based features extracted using the R-MAC algorithm [151]. This visual descriptor is referred to as the conventional visual descriptor 39 (i.e.,CVD). Meanwhile, the spatial descriptor is the camera location (GPS coorindates of the camera) 1 . Definition 7 (Geo-tagged Image Dataset). A geo-tagged image datasetD is a set of m images (i.e.,D ={I 1 ,...,I m }) where each image I is represented by a pair of spatial I.s and visual I.v descriptors. I.s is composed of a pair of values: latitude and longitude coordinates. Meanwhile, I.v (also referred to asCVD) is a high-dimensional vector composed of n visual features. 4.1.2 Problem Description Our main hypothesis is that the accuracy of image search in a big dataset can be improved by representing each image using an enriched visual descriptor that depicts the visual features of multiple similar images within its spatial proximity. Constructing such a descriptor for an image (referred to as reference image I R ) requires two steps: first selecting a group of images (referred to as image groupG(I R ), see Definition 8) located in the same neighborhood whose visual contents are similar, then aggregating the visual features ofI R with those of its associatedG(I R ) into a new visual descriptor (referred to as Spatially Aggregated Visual Feature DescriptorSVD, see Definition 9). Selecting images ofG for anI R requires defining spatial and visual distance functions to assess their satisfiability of the selection criteria (i.e., spatial proximity and visual similarity). Without loss of generality, we use the Euclidean distance 2 as the metric to calculatethedistancebetweenthespatialandvisualdescriptorsoftwoimages, asdefined in Equations 4.1 and 4.2, where MaxD s and MaxD v denote the maximum distance of any pair of images spatially and visually, respectively, in a normalized form 3 . Generating 1 Other spatial and visual descriptors are discussed in Chapter 2. 2 Other distances (e.g., Cosine or Manhattan) can be used. However, evaluating which distance is the best for the spatial and visual descriptors is out of the scope of this thesis. 3 The spatial and visual distances (ds anddv) can be calculated without normalization. However, they are normalized intentionally to be used in the hybrid distance function defined in Section 4.2. 40 anSVD for an I R also requires defining an aggregation kernel function, which will be discussed in Section 4.2. Figure 4.1: A Framework for GeneratingSVD Descriptors for Geo-tagged Images Figure 4.2: A Sequenced Image Figure 4.3: A Panorama Image Definition 8 (Image GroupG). Given a reference imageI R ∈D, an image groupG(I R ) is a subset ofD (i.e.,G(I R )⊆D) such that each image inG(I R ) satisfies two criteria: • Spatial Proximity: each image inG(I R ) is located in the vicinity of I R , • Visual Similarity: the visual content of each image inG(I R ) is similar to that of I R . Definition 9 (Spatially Aggregated Visual Feature DescriptorSVD). Given a reference image I R ∈D and its associated image groupG(I R )⊆D, the spatially aggregated visual descriptorSVD for I R is defined by the following formula: SVD(G(I R )) =∀I j ∈G(I R )|I j 6=I R :I R .v ⊕ I j .v where⊕ is a kernel which aggregates the visual features of different images. d s (I i ,I j ) = p (I i .s x −I j .s x ) 2 + (I i .s y −I j .s y ) 2 MaxD s (4.1) 41 d v (I i ,I j ) = p P n w=1 (I i .v w −I j .v w ) 2 MaxD v (4.2) 4.2 Spatially Aggregated Visual Features In this section, we propose a pipeline to generateSVD for geo-tagged images (see Figure 4.1). 4.2.1 Spatial-Visual Image Group Selection Selecting aG(I R ) using only the spatial properties of images may result in a set of nearby images which are not necessarily visually similar to I R . Conversely, selecting G(I R ) using only visual properties retrieves similar images, but they may be located in different geographical locations. Hence, a proper group selection should consider both types of image properties. Consequently, such group selection implies the need for a hybrid query (referred to as spatial-visual query) which consists of both filtering parts: spatial and visual. Spatial-visual queries for group selection can be performed using two search mechanisms: range search (i.e., finding a list of images which are within a specific visual/spatial distance threshold fromI R ) and kNN search (i.e., finding the top-k similar/closest images to I R ). Therefore, each part of the spatial-visual query can be based on either a range search or a kNN search. Using the range search for both parts of the spatial-visual query may lead to selecting aG(I R ) that consists of too many or too few images. A group composed of few images may not properly cover the visual features of a scene, while a group of many images may include a large number of visual features which potentially increases their processing time. Three strategies of spatial-visual queries for group selection are investigated as fol- lows. The first strategy is to use a composition query. It first prunes out all images which are visually dissimilar from I R , and thereafter sorts the images spatially with respect to the location ofI R to select the top-k closest images. This strategy is referred to as Visual Range Then Spatial kNN Query (S ψ ). The range search of theS ψ query might result in 42 images that are not necessarily located in the I R ’s vicinity; hence, the final result might contain some images that do not satisfy the spatial proximity criteria. Such a case may happen when I R is a generic image (i.e., an image capturing a scene in which the main visible objects can be located anywhere, such as a tree or the sky). The second strategy is to use a similar composition query that first prunes out the images which are too far away from the location ofI R and then selects the top-k images which are visually similar to I R . This strategy is referred to as Spatial Range Then Visual kNN Query (S ξ ). The selected images usingS ξ potentially satisfy both the spatial and visual selection criteria because the images located in the same spatial neighborhood probably capture similar views. Nonetheless, the results of bothS ψ andS ξ strategies are influenced by the choice of the spatial or visual range thresholds (i.e., σ s and σ v ) which might lead to selecting images that compromise the selection criteria corresponding to the range search (i.e., the visual similarity criteria in the case ofS ψ and the spatial proximity criteria in the case ofS ξ ). Alternatively, the strategy of the spatial-visual query can be to devise a spatial-visual kNN search (referred to asS Ω ). Performing spatial-visual kNN search requires designing a hybrid distance function that is capable of measuring the distance between two images using both their spatial and visual properties. However, defining an optimal hybrid distance function is not feasible; hence, a heuristic function needs to be used, and the validity of the selected images for the selection criteria depends on the quality of the chosen heuristic function. Our proposed heuristic function (see Equations 4.3 - 4.6) is based on an exponential decay function which either decreases the visual similarity of two images based on their spatial distance (S Ωs ), or vice versa (i.e., decreases the spatial closeness between two images based on their visual distance (S Ωv )). The hybrid distance between two images is equal to their visual distance if they are located at the same location, and it gradually decreases (or increases) if one is located farther apart from the other. Conversely, it is equal to their spatial distance if they have the same visual 43 content, and it gradually decreases (or increases) if one’s visual content is different from that of the other. BothS ψ andS ξ strategies use the spatial and visual distance functions defined in Equations 4.1 and 4.2, respectively. Meanwhile, theS Ω strategy uses a hybrid distance as defined in Equation 4.6. sim v (I R ,I k ) = 1−d v (I R ,I k ) (4.3) sim s (I R ,I k ) = 1−d s (I R ,I k ) (4.4) sim h (I R ,I k ) = 8 > > < > > : sim s (I R ,I k )e −γdv (I R ,I k ) ifS Ω =S Ωv sim v (I R ,I k )e −γds(I R ,I k ) ifS Ω =S Ωs (4.5) d h (I R ,I k ) = 1−sim h (I R ,I k ) (4.6) Towards evaluating the quality of the images select by the above selection methods, an assessment function (referred to as λ, see Equation 4.7) is used, which measures how far the images ofG(I R ) are fromI R both spatially and visually. This assessment function is analogous to the variance function considering I R as the mean value. λ = 1 |D| X I R ∈D X x∈G(I R ) (||x.s−I R .s|| 2 +||x.v−I R .v|| 2 ) (4.7) 4.2.2 Visual Feature Aggregation Methods Aggregating the visual features of G(I R ) (i.e., ∀I j ∈ G(I R )) into the new visual descriptor (i.e.,SVD) ofI R can be achieved using several methods as described in what follows. 44 Sequence-based Aggregation MethodC. One straightforward method (referred to asC) is to first generate a composite image comprising the sequence ofG(I R ) images aligned either vertically or horizontally (see Figure 4.2) and then extract the visual featuresofthegeneratedimage. Thismethoddoesnotrequireadvancedimageprocessing techniques. Moreover, it does not alter or miss any visual feature ofG(I R ) images.C is analogous to the union all operation in the SQL language since it enables retaining all visual features (i.e., including the duplicate ones) ofG(I R ). Panorama-based Aggregation Method P. Another method is to use an image stitching algorithm to generate a panorama image (see Figure 4.3) usingG(I R ) then extract the visual features of the panorama image. Panorama generation produces an image that displays an overall view from multiple-view images (i.e., a synthesized scene). Thus, it provides a wide view of the area captured by aG(I R ) using the overlapping fields of views captured by the images ofG(I R ). Panorama generation has been investigated thoroughly in the computer vision research field [44, 10, 45, 131, 94, 172, 90, 157] and can be generated by many commercial and open source (e.g., AutoStitch [45]) software. This method requires a proper visual overlap between the adjacent input images to successfully generate a panorama image; however, such a visual overlap may not exist in G(I R ). Moreover, the panorama generation requires significant computing power. Average-based Aggregation MethodA. This method does not require generating an intermediate image but simply aggregates the visual features of the images of aG(I R ) mathematically. In particular, it calculates the average visual descriptor of the ones corresponding toG(I R ) images. Towards evaluating the quality of theSVD descriptor generated from each aggre- gation method, an assessment function (referred to as δ, see Equation 4.8) is used. In particular, the assessment function measures how far is anSVD from its corresponding CVD(I R ). 45 δ = 1 |D| X I R ∈D ||CVD(I R )−SVD F (G(I R ))|| 2 whereF =C∨P∨A (4.8) 4.3 Image Search withSVD-based Indexing Approach Examples of well-known index structures for image search include locality sensitive hashing (LSH) index [77], and hierarchical clustering tree (HCT) [71]. In what follows, we describe the use ofSVD in these structures. Locality Sensitive Hashing (LSH). The key notion of LSH is to use a set of hash functions which maps similar images into the same bucket with higher probabilities than dissimilar images. To search for a query image I Q , LSH maps I Q to its corresponding bucket to linearly scan all stored images in this bucket to filter out dissimilar ones. Based on the mapping of hash functions, some similar images can be stored in different buckets; however, I Q is mapped to only one of these several buckets; hence, the accuracy of the image search degrades. Therefore, by usingSVD as a visual descriptor when mapping the images to LSH buckets, LSH potentially maps all images of anG(I R ) to the same bucket; hence, I Q is subsequently mapped to the bucket that contains all of its similar images. Hierarchical Clustering Tree (HCT). HCT 4 is a tree structure which is built using nestedclusteringforvisualdescriptors. Initially,theimagesarepartitionedintocclusters such that each cluster comprises the images closest to the corresponding cluster center. Thereafter, each cluster is recursively partitioned into c sub-clusters. As a result, each internal node (including the root node) in the tree includes a set of cluster centers where each cluster center points to a certain child node. Meanwhile, each leaf node stores a subset of images that are close to its corresponding cluster center. When searching for 4 HCT is a variant of Vocabulary tree [122]. However, Vocabulary tree is built using local visual features, while HCT uses global visual features. 46 I Q , the image is propagated down the tree by comparing I Q to the c candidate cluster centers associated with an internal node and choosing the closest one until reaching a leaf node. Thereafter, the stored images in the corresponding leaf node are linearly scanned to filter out dissimilar images toI Q . In this chapter, we adapt HCT by applying clustering onSVD descriptors instead ofCVD to obtain a better organization of images while the stored images in the leaf nodes are still represented by theirCVD descriptors to perform the search using their conventional descriptors. SVD-basedvs.CVD-basedIndexingApproaches. BothLSHandHCTusedifferent techniques of dimension reduction (i.e., hashing in LSH and clustering in HCT) when organizing images to obtain a better search performance while sacrificing the accuracy. To prove thatSVD minimizes the impact of dimension reduction techniques, we design twoapproachesforeachindexstructure:SVD-basedapproachandCVD-basedapproach. In particular, usingSVD-based LSH, images are partitioned into buckets by hashing the images based on theirSVD descriptors, while usingCVD-based LSH, images are hashed based on theirCVD descriptors. Similarly,SVD-based HCT organizes images based on clustering for theirSVD descriptors, whileCVD-based HCT organizes images based on clustering for their CVD descriptors. Finally, the search results using the indexing approaches are compared to that of a linear search approach (ground truth) which basically filters out dissimilar images using theirCVD descriptors. 4.4 Evaluation Table 4.1: Geo-tagged Image Datasets Dataset # of images Area of Spatial Region Spatial Density (Width× Height km 2 ) (Avg. # images per 1 km 2 ) OR 3, 204 2.1×1.2 1,271 PT 4, 825 1.5×3.6 893 MA 17, 825 13.7×7.3 178 47 (a) using LSH whenθ = 1.25, 0.85, and 0.82 for OR, PT, and MA, respectively (b) for theMA dataset using Various Index Structures when θ = 0.82 (c) for the MA dataset using LSH and varying θ Figure4.4: EvaluationoftheEffectivenessofSVD-basedIndexingApproachforSimilaritySearch 4.4.1 Experimental Setup Datasets. To evaluate the effectiveness of our proposed visual descriptor, we conducted several experiments using three real large-scale datasets obtained from Google Street View API [167] for three particular geographical regions: a) downtown Orlando, Florida (referred to as OR); b) downtown Pittsburgh, Pennsylvania (referred to as PT); and c) part of Manhattan, New York City (referred to as MA). Each image is tagged with a geo-location (i.e.,I.s) represented by latitude and longitude coordinates. RegardingI.v, i.e.,CVD of each image, we extracted the R-MAC descriptor which is a high dimensional vector composed of 512 dimensions. For extracting R-MAC descriptors, we used the implementation available at https://github.com/noagarcia/keras_rmac. Table 4.1 shows a summary of the three datasets in terms of the number of images and the area of the geographical region covered by each dataset. 48 Settings forG Selection Methods. The four selection methods (S ψ ,S ξ ,S Ωv , and S Ωs ) were evaluated to select (Gs of three sizes (i.e., k = {5, 10, 15}). ForS ψ , the value of visual range (σ v ) was chosen by analyzing the visual distribution of each dataset individually. In particular, the selected value ofσ v enables retrieving on average top 10% similar images for eachI R , assuming that the remaining images are potentially dissimilar toI R . As a result, the chosen values ofσ v were 1.25, 0.85, and 0.82 for theOR,PT, and MA datasets, respectively. ForS ξ , the value of the spatial range was 150m which is the radius that constructs a rectangular region of a 200×200 m 2 area approximately. This range can include multiple buildings with their neighborhood 5 ; hence, using this spatial range is sufficient to select aG(I R ). ForS Ωv andS Ωs , the exponential decay exponent constant (i.e., γ) was fixed at 0.7. Settings for Aggregation Methods. To perform theC aggregation method, we combined the images horizontally to construct a sequenced image for eachG(I R ). On the other hand, for theP aggregation method, we used the stitching algorithm available in the OpenCV library to construct panorama images. After generating sequenced and panorama images, their R-MAC descriptors were extracted which representSVD for G(I R ) using theC andP methods. Lastly, for theA method, the individual R-MAC descriptors of the images formingG(I R ) were averaged to construct a new descriptor representingSVD. As a result for eachG(I R ), we extractedSVD C ,SVD P , andSVD A which are the descriptors aggregated using theC,P, andA methods, respectively. Settings for Similarity Search Evaluation. For HCT, k-Means clustering algorithm has been used for clustering the visual descriptors. We fixed the number of clusters at 15, while the clustering is performed recursively until reaching a capacity of less than 300 images. For LSH, we adapted the source code implementation available at https://github.com/JorenSix/TarsosLSH, and we used the following settings: 5 hash tables, 3 hash functions, and the Euclidean distance hash family. Note that varying the 5 Based on the government statistics (https://www.eia.gov/todayinenergy/detail.php?id=21152), until 2012, on average the size of a US home is around 1,800 m 2 . 49 settings of HCT and LSH is not the focus of this chapter; however, we briefly explored it by varying some parameters and found that the observations reported in Subsection 4.4.2 still hold. To perform the evaluation, we executed a set of visual range queries where each query was composed of a query image I Q and a visual query range threshold θ. To perform visual range queries, a sample of images were selected randomly based on a statistical analysis [95] of the size of each dataset where the size of selected samples suffices a confidence score of 95% and a margin error score of 2.5%. As a result, the number of sample queries was 1,106, 1,221, and 1,436 for theOR,PT, andMA datasets, respectively. Moreover, the values ofθ were chosen in a similar way to the mechanism of choosing the values of σ v . To report the evaluation results of the similarity search using LSH and HCT, the results were compared to that of a linear search (i.e., ground truth). The evaluation results are reported using recall scores 6 . 4.4.2 Evaluation Results We will first discuss the impact of using theSVD descriptor in index-based search, and later on discuss the evaluation of generatingSVD using different methods. Evaluation of Similarity Search UsingSVD-based Indexes. Figure 4.4a shows the evaluation of similarity search on different datasets based on the LSH index that uses eitherCVD orSVD descriptors. In general,SVD-based LSH obtained a better recall score compared toCVD-based LSH across all datasets. This observation verifies our hypothesis that using theSVD descriptor ensures indexing similar images together even with the use of dimensionality reduction techniques in index structures. Moreover, among theSVD descriptors generated from the three aggregation methods (i.e.,C,P, andA),SVD obtained fromA enables LSH to produce the best image organization; hence,SVD A -based LSH obtained the best recall scores. In particular, with respect to the recall ofCVD-based LSH, the recall usingSVD A -based LSH improved by 64%, 6 SinceCVDs are used for similarity search using all approaches (i.e., a linear search and the two approaches of an index-based search), the precision of the index-based search is 1 across all datasets. Hence, we only report the recall. 50 (a) δ When k = 5 (b) δ When k = 10 (c) δ When k = 15 (d) Avg. Processing Time Figure 4.5: Evaluation of the Aggregation Methods (C,P, andA): (a)-(c) for all Datasets, while (d) for the OR Dataset (a) λ When k = 5 (b) λ When k = 10 (c) λ When k = 15 (d) Avg. Processing Time (k = 5) Figure 4.6: Evaluation of the SelectedG(I R ) using the Selection Methods (S ψ ,S ξ ,S Ωs , andS Ωv ) for all Datasets 51 46%, and 35% for the OR, PT, and MA datasets, respectively. TheA method enables generating anSVD that represents a centroid for all images ofG(I R ). Therefore, the generatedSVD(G(I R )) has the property of being similar toI R as well as to the remaining images inG(I R ); hence,SVD A enables a better organization ofG(I R ) images in index structures. Meanwhile,SVD P -based index obtained the least recall scores; nonetheless, the results ofSVD P -based index are almost similar to that ofCVD-based index. This observation can be explained by two facts: a) panorama images can include artificial features produced when stitching similar images geometrically; hence, such features may cause a visual difference between the panorama image and the original ones, and b) panorama generation may fail for some image groups; thus,CVD descriptors were used for such cases instead ofSVD. Moreover, the recall improvement achieved bySVD-based LSH was the highest for theOR dataset, while being the lowest for theMA dataset. This is due to the differences in the spatial density of each dataset (see Table 4.1), particularly the OR dataset is the densest, whileMAistheleastdenseone. Whenthespatialdensityofadatasetincreases, the dataset potentially includes more similar images which are located in the same vicin- ity. In such a dataset, the quality of the selected images for eachG(I R ) increases; hence, betterSVD descriptors are generated. Furthermore, as shown in Figure 4.4b, when changing from LSH to HCT index structures, all of the previous observations still hold. Figure 4.4c shows the evaluation of a similarity search using LSH for variousθ values. Ingeneral, therecallofasearchresultdecreaseswhenincreasingθ; nonetheless, therecall ofSVD-based LSH is better thanCVD-based LSH. When searching using a large θ, the search result should include less-similar images which the index structure potentially fails to retrieve. Given thatSVD-based indexing approach includes changes only to the construction mechanism and there is no change to the search mechanism, the search performance using anSVD-based index orCVD-based index is almost similar. 52 Evaluation of Aggregation Methods. Figure 4.5 shows the evaluation of the pro- posed aggregation methods (i.e.,C,P, andA) for generatingSVDs from their corre- spondingGs obtained using theS ξ selection method 7 by calculating δ (i.e., the variance of anSVD(G(I R )) from its correspondingCVD(I R )). In general, theA method gener- atedSVD descriptors with the least δ values across all datasets. This is due to the fact that theA method generates a descriptor that represents a centroid for the images of G(I R ); thus ensures the similarity betweenSVD andG(I R ) images. Moreover, theSVD descriptors generated by bothP andC obtained the largest δ values because these two aggregation methods depend on generating synthesized images which may include arti- ficial features that decrease the similarity betweenSVD (extracted from the synthesized image) andCVD(I R ). Furthermore, the quality of the generatedSVDs is affected by the spatial density of the dataset. In a dense dataset (e.g., OR), each small geographi- cal region contains more images which are potentially similar; thus the similarity of the selected images in eachG(I R ) is high and consequently the generatedSVD(G(I R )) is potentially more similar to its correspondingCVD(I R )). Therefore, δ values for OR are the least (sinceOR is the densest dataset) whileδ values forMA are the largest because MA is the least dense dataset. Moreover, the aggregation methods are evaluated using their execution time using theOR dataset 8 . As shown in Figure 4.5d, theA method obtained the best performance since it does not require any form of image processing but only a simple mathematical operation of the individual descriptors extracted from the images constituting aG(I R ). Meanwhile, both of theC andP methods obtained a worse performance because of the overhead of the image processing techniques used to generate synthesized images. However, P had the highest execution time since it comprises executing a stitching algorithmforpanoramagenerationwhichisacomputing-intensiveprocess. Furthermore, 7 Note that all experimental results reported in Figures 4.4-4.5 are based onGs selected by theS ξ selection method because it enabled selectingGs with the least λ values as will be discussed later. 8 The same observations hold for other datasets. 53 when increasing k, the execution time forC andP also increases noticeably because of the increase of image processing overhead. Evaluation ofG Selection Methods. Figure 4.6 shows the evaluation of the proposed selection methods (i.e.,S ψ ,S ξ ,S Ωs ,S Ωv ) by calculatingλ (i.e., the variance of images in eachG(I R )) for the selectedGs. In general, theS ξ method selected image groups with the least λ values across all datasets and using various sizes ofGs. This indicates the importance of using the spatial range for pruning the search space when selectingGs. With different cases of I R (i.e., whether being an image depicting a common or unique scene), the spatial range enables filtering out the images which are far and then selecting similar images in the neighborhood of I R . On the other hand, the two variants ofS Ω obtained the worst value of λ in all cases. However, the variance ofGs selected byS Ωs andS Ωv is influenced by the characteristics of the dataset and the size ofG because both of them use heuristic functions for performing the spatial-visual kNN query. Such heuristic functions may work differently for different image groups. Moreover,S Ωs is better thanS Ωv in most cases, which implies two observations: a) the images located in the same location are potentially similar, and b) the spatial proximity criteria should be assigned a higher weight than the visual similarity criteria to select an image group, especially for outdoor images depicting street views. In this chapter, the selection methods are implemented straightforwardly by scanning linearly the dataset to calculate the distances (spatial, visual, or hybrid distances) from I R 9 . With such an implementation,S ψ andS ξ require asymptoticallyO(m +m 0 logm 0 ) time where m denotes the number of images in a dataset to perform the first part of the hybrid query (i.e., the execution time of the range search) while m 0 refers to the number of the selected images from the first part. Meanwhile, both variants of Ω require O(m +m logm). Fig 4.6d shows the execution time of the selection methods 10 using all datasets when k = 5. In general, both variants of theS Ω method are slower than 9 The execution of the selection methods can be expedited by using efficient index structures. 10 The same observations hold for other k values 54 S ψ andS ξ , and this observation complies with the asymptotic analysis. Moreover,S ψ obtained a lower performance than that ofS ξ across all datasets because the performance of these two hybrid queries is mainly affected by the performance of the first part of the query (i.e., spatial range and visual range). Given that executing a visual range query whichincludesthedistancecalculationoftwohigh-dimensionalvisualdescriptorsismore expensive than executing a spatial range query,S ψ takes longer time than that ofS ξ . 55 Chapter 5 Hybrid Indexes for Spatial-Visual Search In this chapter, we propose a class of index structures for spatial-visual search. First, we introduce a set of preliminary definitions, the problem statement, and background about the state of the art techniques for spatial and visual indexing. Following, we introduce a number of new index structures for geo-tagged images. 5.1 Preliminaries and Problem Description Definition 10 (Geo-tagged Image Dataset). A set of n geo-tagged images D = {I 0 ,I 1 ,...I n−1 } stored on disk, where each image I is represented by a pair of spatial (I.s) and visual (I.v) descriptors. In this chapter, for the spatial descriptor (i.e., I.s) of an image, we use the camera location descriptor for simplicity. Meanwhile, for the visual descriptor (i.e., I.v) we extract a rich vector of CNN features. For extracting CNN features, a CNN network is utilized for extracting a rich image feature vector consisted of 4096 dimensions. Due to the high dimensionality, dimension reduction techniques (e.g., principal component analysis (PCA)) can be applied to generate a compact representation of each vector. It was shown experimentally [35] that CNN vectors can be considerably reduced in dimension without significantly degrading retrieval quality, so we represent I.v as a 150- d vector derived from its 4096-d CNN vector. A geo-tagged image dataset is formally defined in Definition 10. 56 Definition 11 (Spatial-Visual Query). Q = (Q.s,Q.v), where Q.s is the spatial part of the query (e.g., spatial range or nearest neighbor) and Q.v is its visual part (e.g., topK similar images or similarity search within a threshold σ, given a query image I Q ) In this chapter, we propose a set of disk-resident index structures which can be utilized for any type of spatial-visual queries (see Definition 11), for a focused discussion; however, we concentrate on a spatial-visual range query Q range . With Q range , Q.s is represented by a 2D axis-aligned range, while Q.v is represented by I Q .v∈ R d and the visual similarity distance Φ between I k ∈ D and I Q where Φ (I k .v, I Q .v)≤ σ (σ is referred to as visual range). In this chapter, Φ is defined using the Euclidean distance for simplicity as follows: Φ(I k .v,I Q .v) = Ì d X j=1 (I k .v j −I Q .v j ) 2 . I k is considered as an output of Qrange if and only if I k is inside both Q.s and Q.v. Nonetheless, due to the inaccuracies associated with the spatial-visual search, we categorizetheresultsofQ range (referredtoasRS(Q range ))intothefollowingthreeclasses (see Definitions 12-14). Definition 12 (Spatially-Visually Matched and Relevant Image (SV-Match-Rel)). for a given Q range ,∃ I k ∈ D| I k .s∈ Q.s and Φ (I k .v, I Q .v)≤ σ and I k ∈ RS(Q range ). Definition13 (SpatiallyUnmatchedbutRelevantImage(S-UNMatch-Rel)). for a given Q range ,∃ I k ∈ D| I k .s / ∈ Q.s and Φ (I k .v, I Q .v)≤ σ and I k / ∈ RS(Q range ). S-UNMatch-Rel is due to the spatial inaccuracy where the image location is outside Q.s but still captures a scene inside Q.s. Definition14 (VisuallyUnmatchedbutRelevantImage(V-UNMatch-Rel)). for a given Q range ,∃ I k ∈ D| I k .s∈ Q.s and Φ (I k .v, I Q .v)≤ σ and I k / ∈ RS(Q range ). V-UNMatch-Rel is due to dimension reduction by the image retrieval mechanism. 57 5.2 Indexing Approaches In this section, we first discuss a set of baselines and then our hybrid indexes in terms of their structures and query execution. We also present the space and query time cost for each structure. The notations used in the cost model are listed in Table 5.1. Eq. 5.1 represents the space cost model which is the storage sum of both index entities and data, while Eq. 5.2 represents the query time model in terms of I/O operations which is the time sum for loading the index entities and their corresponding data. The components of both models are summarized in Tables 5.2 and 5.3. SpaceCost =S R +S LSH +S Data , (5.1) SpaceCost =T d isk∗ (T R +T LSH +T Data ), (5.2) Table 5.1: Notation Table Notation Description M The number of leaf nodes in a primary R*-Tree M The average number of leaf nodes in a secondary R*-Tree B The number of buckets in a primary LSH B The average number of buckets in a secondary LSH C(b) The size of bucket b in LSH m b A leaf node m in a secondary R*-tree attached to a bucket b belonging to a primary LSH bm A bucket b in a secondary LSH attached to a leaf node m belonging to a primary R*-tree Ps(x) The size of spatial data referenced by the entity x (i.e., leaf node m or bucket b) Pv (x) The size of visual data referenced by the entity x (i.e., leaf node m or bucket b) P disk The size of a page disk T disk The time cost of one disk access 5.2.1 Baseline Index Structures We consider three baselines. The first baseline, named Double Index (DI) (Fig- ure 5.1), is composed of two separate structures such that for each I∈D,I.s is indexed by R*-tree andI.v is indexed by LSH, independently. With the other two baselines, the dataset is either organized spatially or visually by one index structure which contains 58 Table 5.2: Space Cost of Various Index Structures Index S R S LSH S Data DI O(M) O( P B b=1 C(b)/P disk ) O( P M m=1 Ps(m) + P B b=1 Pv (b)/P disk ) Aug R*-tree O(M) - O( P M m=1 Ps(m)+ P M m=1 Pv (m)/P disk ) Aug LSH - O( P B b=1 C(b)/P disk ) O( P B b=1 Ps(b) + P B b=1 Pv (b)/P disk ) Aug SFI O(M) O( P M m=1 P B b=1 C(bm)/P disk ) O( P M m=1 Ps(m) + P M m=1 P B b=1 Pv (bm)/P disk ) Aug VFI O( P B b=1 M b ) O(1) O( P B b=1 P M m=1 Ps(m b ) + P B b=1 Pv (b)/P disk ) Table 5.3: Query I/O Cost of Various Index Structures Index T R T LSH T Data DI O(M) O(C(b)/P disk ), b∈ [1, B] O( P M m=1 Ps(m) + Pv (b)/P disk ),b∈ [1,B] Aug R*-tree O(M) - O( P M m=1 Ps(m) + P M m=1 Pv (m)/P disk ) Aug LSH - O(C(b)/P disk ), b∈ [1, B] O( P B b=1 Ps(b) + P B b=1 Pv (b)/P disk ) Aug SFI O(logM) O( P M m=1 C(bm)/P disk ),b∈[1,B] O( P M m=1 Ps(bm) + P M m=1 Pv (bm)/P disk ),b∈ [1,B] Aug VFI O(M b ),b∈ [1,B] O(1) O( P M m=1 Ps(m b ) + P M m=1 Pv (bm)/P disk ),b∈ [1,B] pointers to both I.s andI.v, termed Augmented R*-tree (Aug R*-tree) and Augmented LSH (Aug LSH), respectively. Query: To execute Q range , R*-tree is queried using Q.s and LSH is queried using Q.v. With DI, an intersection filter is executed on the intermediate results retrieved from both structures. Meanwhile, with Aug R*-tree and Aug LSH, the intermediate results are inspected sequentially to discard all images that are irrelevant to both Q.v andQ.s. Result Accuracy: All the baseline structures can retrieve SV-Match-Rel images but fail to retrieve S-UNMatch-Rel ones. S-UNMatch-Rel images are missed because of treating the spatial filter strictly. LSH causes retrieval inaccuracy due to dimension reduction; hence both DI and Aug LSH miss V-UNMatch-Rel images. Meanwhile, Aug R*-tree can retrieve V-UNMatch-Rel images because it stores I.v and avoids the LSH inaccuracy. 59 Figure 5.1: Double Index Structure (R*-tree (left), LSH (right)) Index Performance: The main disadvantage of DI is that the intersection filter can be expensive if the size of the intermediate results is large. The extreme case is when the intermediate results do not overlap at all, in which case each index is used to retrieve a set of “useless” intermediate results from the disk. With Aug R*-tree and Aug LSH, the query selectivity affects the performance. In particular, with Aug R*-tree if the spatial selectivity of Q range is low, then the performance of Q range deteriorates because it will retrieve a large number of images that satisfy Q.s but may be discarded later in the visual filtering step. The opposite is true with Aug LSH when the visual selectivity of Q range is low. 5.2.2 Hybrid Index Structures There are two variations of this approach. With the first variation, first, an R*-tree is built on all MBRs covering I.s of all images. Next, all the images in each leaf of R*-tree are indexed by LSH based on their I.v. Consequently, there is one primary R*-tree and a set of secondary LSHs corresponding to R*-tree leaves (Figure 5.2a). The secondary LSHs are augmented with extra pointers to I.s. This structure is referred to as Augmented Spatial First Index (Aug SFI). The second variation (Figure 5.2b) is the oppositewhereLSHistheprimaryindexinwhichallimagesaredistributedintodifferent buckets based on I.v. Each LSH bucket is associated with an R*-tree to organize the images in the bucket based on their I.s. The secondary R*-trees are augmented with 60 extra pointers toI.v. This structure is referred to as Augmented Visual First Index (Aug VFI). Query: When executing a Q range with Aug SFI, Q.s is executed to identify the overlapping leaf nodes without retrieving their corresponding I.s. The secondary LSHs associated with the candidate leaf nodes are queried using Q.v. Since the secondary LSHs are augmented with I.s, they can directly discard results that do not satisfy Q.s. Similarly, with Aug VFI the primary LSH is queried usingQ.v to identify the secondary R*-trees that potentially contain the similar images. Subsequently, R-trees use the augmented I.v to discard the results that do not satisfy Q.v. (a) Spatial First Index (b) Visual First Index Figure 5.2: Two-level Index Structures Explorative Query: To overcome the inherent inaccuracy problem in spatial-visual search, we introduce a new concept of explorative query which can explore extra results in the secondary structures. With Aug SFI, we generate a set of random descriptor I k .v where Φ (I k .v, I Q .v)≤ σ to query more buckets which may contain potentially similar images toI Q .v. The number of the random descriptor is referred to as visual exploration parameter (i.e.,E.v). With this technique, we minimize the effect of LSHś inaccuracy. Meanwhile, with Aug VFI, we expand the Q.s range by a parameter (referred to as spatial exploration parameterE.s=(expanded area - original query area)/original query 61 area) to minimize the spatial inaccuracy. These explorative query variants are referred to as Aug SFI-E and Aug VFI-E, respectively. Result Accuracy: The results of both Aug SFI and Aug VFI are identical to that of DI. They retrieve SV-Match-Rel images but fail to retrieve S-UNMatch-Rel and V- UNMatch-Rel ones. However, Aug SFI-E can retrieve some of V-UNMatch-Rel while Aug VFI-E can retrieve some of S-UNMatch-Rel. The size of V-UNMatch-Rel or S- UNMatch-Rel is controlled by the exploration parameters. Index Performance: Both index structures suffer from the impact of query selec- tivity since they are biased to either spatial or visual dominance of the primary index (e.g., with Aug SFI, the performance deteriorates when the spatial selectivity of Q range is low). Nevertheless, they both can outperform the baseline structures because the pri- maryindexdoesnotretrievethedatabutonlyprunesthesearchspaceandthesecondary index retrieves only the subset of the data pointed by the primary structure. Accuracy of the results of all discussed index structures is summarized in Table 5.4. Table 5.4: Result Accuracy of Various Index Structures Index Structure SV-Match-Rel S-UNMatch-Rel V-UNMatch-Rel Aug R*-tree 3 7 3 Aug SFI-E 3 7 3 Aug VFI-E 3 3 7 DI/Aug LSH/Aug SFI/Aug VFI 3 7 3 5.3 Experiments 5.3.1 Experiment Setup Dataset: We used three real-world geo-tagged image datasets: Flickr referred to as Flickr, Google Street View referred to as GSV, and GeoUGV. Flickr contains 49 million imagesobtainedfrom[149]. GSV contains52kGoogleStreetViewimages. GeoUGV isa public user-generated video dataset with which we processed the video set and extracted a representative frame per second which is tagged with spatial metadata. In total, 62 GeoUGV contains 124k geo-tagged images. To evaluate the scalability of our solutions, we generated a synthetic dataset (SYN) based on Flickr. SYN contains 1 billion images where we randomly selected 20% of the Flickr images and replicated each one 90-110 times by distorting their I.s and I.v while maintaining the spatial proximity and visual similarity to the original image. Regarding the image visual representation, for GSV and GeoUGV, we processed each image using the Caffe [81] framework (with default model) to extract 4096-d CNN descriptors while, for Flickr, we used CNN descriptors providedby[24]. Subsequently, weappliedPCAtoreducedimensionsto150-ddescriptor for each image. The sizes of dataset files (including both extracted visual and spatial descriptors) for GSV, GeoUGV, Flickr, and SYN are 145 MB, 357 MB, 69.6 GB, and 1.4 TB, respectively. Index settings: We implemented our disk-resident index structures using Java 1.7. We used 4KB of page size. The fan-out of Aug R*-tree was 85. Meanwhile, LSH had the settings (T=3,H: Euclidean, W=4, and F=7). All experiments were performed on a 3.6 GHz Intel Core i7 machine with 12 GB memory and 4TB 7200RPM disk drive running on Windows 7. Queries and Metrics: Table 5.5 lists all query parameters with the default values underlined. For our experiments, we needed to construct queries to represent different spatial and visual selectivity factors. Towards this end, we first scan each dataset thor- oughly to classify each image into three spatial groups (dense, uniform, sparse) based on their distributions and then re-scan the dataset to classify each image into another three visual groups (dense, uniform, sparse). Next, we merged these groups to generate new five groups: spatial dense visual dense (SD-VD), spatial dense visual sparse (SD-VS), spatial sparse visual dense (SS-VD), spatial sparse visual sparse (SS-VS), and spatial uniform visual uniform (SU-VU). We refer to these groups as Query Selectivity Groups. For each query selectivity group, referring to [64] we randomly selected a set of query images which guarantees 95% confidence interval with 2.5% margin error [6] (e.g., with Flickr we selected 1,534 queries out of 1M image subset of SD-VD group). For example, 63 a query image selected from SD-VS would generate a query with low spatial selectivity and high visual selectivity. We used SU-VU as the default selectivity to avoid selectivity bias. To evaluate the index structures, we report two metrics when executing Q range : a) result accuracy in terms of recall, and b) index performance in terms of the number of accessed pages. As shown in Table 5.4, for each Q range merging the relevant images retrieved from Aug R*-tree and Aug VFI-E forms a full result set; hence this union set is the ground truth for each Q range when calculating recall. Table 5.5: Query Parameters Query Parameter Values Spatial Range h 1.25×1.25, 3.7×3.7, 6.18×6.18, 8.1×8.1i km 2 Visual Range (σ) h25, 30, 35, 40i Spatial Exploration Parameter (E.s) h0.1, 0.3, 0.5, 0.7i Visual Exploration Parameter (E.v) h9, 15, 21, 27i 5.3.2 Result Accuracy Baseline vs. Hybrid: We used a set of queries with the default values to evaluate the recalls of the baseline structures comparing to the hybrid structures with different datasets with SU-VU. Even though our index structures achieve 100% precision, they might observe lower recalls due to the failure of retrieving all relevant images. As shown in Figure 5.3, among the baseline structures, Aug R*-tree achieved the highest recall since it avoids the inaccuracy caused by LSH. The recalls of the other baselines and the two level hybrid structures were identical because they retrieve the same class of images (i.e., SV-Match-Rel). The impact of Query Selectivity on Hybrid Structures: Next, we varied the query selectivity parameter to evaluate its effect on the recalls of different hybrid struc- tures using Flickr . In Figure 5.4, two-level hybrid index structures had an identical recall for a given query selectivity; however, the recall varied across query selectivity fac- tors. In particular, Aug SFI and Aug VFI had the worst recall value (54%) with SD-VD while they achieved the best recall (89%) with SS-VD. With SD-VD, there are many 64 Figure 5.3: Recall of Baseline vs. Hybrid w/ SU-VU Figure 5.4: Impact of Query Selectivity on Hybrid w/ Flickr Figure 5.5: Impact of Exploration on Recall w/ Flickr Figure 5.6: Impact of Exploring Spatially (Bottom X Axis) and Visually (Top X axis) on Recall w/ Flickr Figure 5.7: Impact of Spatial Range (Bottom X Axis) and Visual Range (Top X axis) on Recall w/ Flickr images that are spatially crowded and visually similar, so the probability of having par- tially similar images is high, which increases the LSH inaccuracy and incurs a low recall. Meanwhile, with SS-VD there are a few images, but they are very similar to each other, which decreases the LSH inaccuracy and achieves a high recall. In Figure 5.5, we show the effect of the explorative technique on the result accuracy when varying the query selectivity. In general, we gained a higher recall when exploring visually compared with spatial exploration. Aug SFI-E showed the highest enhancement with SD-VD increas- ing its recall by 54% compared to Aug SFI. As mentioned earlier, with SD-VD the LSH inaccuracy increases and this deficiency can be minimized by Aug SFI-E which retrieves some of the V-UNMatch-Rel images. The impact of spatial exploration with Aug VFI was the highest with SS-VD increasing its recall by 7%. With SS-VD, there are a few 65 images nearbyI Q , but the number of similar images is potentially high. Hence, exploring spatially with SS-VD might result in more images (i.e., S-UNMatch-Rel) relevant toI Q . The Impact of Exploration Parameter: In Figure 5.6, we varied the visual exploration parameterE.v (the top X-axis) or spatial exploration parameterE.s (the bottom X-axis) to evaluate their effect on the recall of hybrid structures with respect to Aug R*-tree using Flickr. We chose Aug R*-tree as the benchmark because it showed the best recall in the previous experiments. IncreasingE.v improved the recall con- siderably while increasingE.s did not show continuous improvement because the missed S-UNMatch-Rel images might be located near the original spatial range, hence exploring within the near vicinity might be sufficient to retrieve them. The Impact of Visual Range: In Figure 5.7 (consider only the top X-axis with the Y-axis), we varied the visual range (i.e., the top X-axis) to evaluate its effect on the recall of hybrid structures with respect to Aug R*-tree using Flickr. When shrinking the visual range, the recall of hybrid structures considerably improved approaching the recall of Aug R*-tree. Decreasing the visual range restricts the search space only to very similar images that are potentially not affected with the inaccuracy of dimension reduction leading to less retrieval of V-UNMatch-Rel images by Aug R*-tree in favor of retrieving more SV-Match-Rel images. The Impact of Spatial Range: In Figure 5.7 (consider only the bottom X-axis with the Y-axis), we varied the spatial range to evaluate its effect on the recall of hybrid structures with respect to Aug R*-tree using Flickr. When shrinking the spatial range, the recall of hybrid structures improved approaching the recall of Aug R*-tree. Decreas- ing the spatial range restricts the search space to the images which are spatially very close and potentially very similar to I Q ; hence decreases the effect of the visual inaccu- racy. This leads to less retrieval of V-UNMatch-Rel images by Aug R*-tree in favor of retrieving more SV-Match-Rel images. 66 5.3.3 Index Performance Baseline vs. Hybrid: Figure 5.8 (the Y-axis is the number of accessed pages in a base-10logarithmicscale)showstheperformanceofthebaselinescomparedtothehybrid structures across different datasets with SU-VU. Aug R*-tree across all datasets incurred the worst performance. Compared to DI and Aug LSH, Aug R*-tree suffers from lower performance due to two reasons: 1) the large size of the augmented visual descriptors, and 2) potential retrieval of additional images (i.e., V-UNMatch-Rel). Meanwhile, the hybrid structures were superior in performance compared to the baselines for all datasets (e.g., with respect to Aug R*-tree. Figure 5.8: Performance of Baseline vs. Hybrid w/ SU-VU) Figure 5.9: Impact of Query Selectivity on Hybrid w/ Flickr Figure 5.10: Impact of Exploration on Perfor- mance w/ Flickr Figure 5.11: Impact of Exploring Spa- tially/Visually on Performance w/ Flickr The impact of Query Selectivity on Hybrid Structures: We varied the query selectivity parameter to evaluate its effect on the performance of the hybrid structures with Flickr in Figure 5.9 (the Y-axis is the number of accessed pages in a base-10 logarithmic scale). Aug VFI outperformed Aug SFI with a speedup factor of 2.1 with the query selectivity SD-VS because the spatial selectivity factor is lower than the visual one. With the other types of query selectivity, Aug SFI was the superior achieving the best speedup factor of 2.6 with SS-VD because the spatial selectivity is the highest. In 67 Figure 5.10 (the Y-axis in a base-10 logarithmic scale), we show the effect of explorative technique on the performance while varying the query selectivity. In general, exploring with Aug SFI causes a higher I/O cost than exploring with Aug VFI because Aug SFI-E explores in the secondary LSHs whose buckets do not have a limit on their size while Aug VFI-E explores the secondary R*-tree whose leaf nodes have limited size. Aug SFI- E showed the least additional I/O cost with SD-VD because there are many relative images with a high probability being scattered in few disjointed buckets. Moreover, with SD-VD Aug VFI-E showed the least additional I/O cost because the potential similar images within the explored spatial area have a higher probability to be stored into least number of leaf nodes. The Impact of Exploration Parameters: In Figure 5.11, we varied the visual explorationparameterE.v orthespatialexplorationparameterE.stoevaluatetheireffect on the performance of hybrid structures with respect to Aug R*-tree using Flickr. We chose Aug R*-tree as the benchmark because it showed the worst performance with the best recall. The performance decrease incurred by the spatial exploration (i.e., Aug VFI- E) is less than that by the visual exploration. In particular, Aug SFI-E outperformed Aug R*-tree with speedup factors of 39 and 7 whenE.v = 9 andE.v = 27, respectively, while the speedup factors of Aug VFI-E were 34 and 13 whenE.s = 0.1 andE.s = 0.7, respectively. The Impact of Spatial and Visual Range: We varied the spatial range to eval- uate its effect on the performance of hybrids with respect to Aug R*-tree using Flickr. The performance of all hybrid structures was directly affected when shrinking the spa- tial range. With the smallest spatial range, Aug SFI outperformed the other structures because it organizes the dataset spatially primarily. On the other hand, varying the visual range did not much affect the performance (i.e., disk accesses) of the two-level structures because the similarity filter is applied after retrieving all visual descriptors from the disk. 68 Chapter 6 A Class of R*-tree Indexes for Spatial-Visual Search of Geo-tagged Street Images Based on our key observation that similar street images are typically in the same spatial locality, index structures for spatial-visual queries can be effectively built on a spatial index (i.e., R*-tree). Therefore, in this chapter, we propose a class of R*-tree indexes, particularly, by associating each node with two separate minimum bounding rectangles (MBR) one for spatial and the other for (dimension-reduced) visual properties oftheircontainedimagesandadaptingtheR*-treeoptimizationcriteriatobothproperty types. Furthermore, due to the fact that the boundary of the visual properties of a set of images (even similar ones) is typically loose, images per node are grouped into clusters such that each cluster has a tighter visual boundary. 6.1 Preliminaries 6.1.1 Image Model In an image database, each image I is represented by two descriptors: spatial (I.s) and visual (I.v) 1 . 1 Note that the dot symbol in I.s and I.v does not denote the dot product operator. It denotes the spatial or visual descriptor of an image I. 69 In this chapter, without loss of generality, we use the scene location as the spatial descriptor (referred to as I.s) of images 2 . Moreover, we use R-MAC [151] as the visual descriptor (referred to as I.v) of an image. 6.1.2 Queries on Geo-tagged Images Given a dataset (D) ofn geo-tagged images (i.e.,D = {I 1 ,I 2 ,...,I n }), conventional queries onD can be either using only the spatial metadata of images (referred to as spatial query, e.g., a spatial range query) or their visual features (referred to as visual query, e.g., query by example). However, our interest is in a new form of queries referred to as spatial-visual query (Q sv ) which uses conjunctive criteria for both spatial and visual features for retrieving the images of interest. Each of these criteria can be formulated as different query types, including range and kNN; hence, various types ofQ sv can be defined. Without loss of generality, we focus our discussion on one type ofQ sv which is a spatial-visual range query as formally defined in Definition 15. In this chapter, the visual similarity of two images (I i andI j ) is measured using the Euclidean distance (see Eq. 6.1) between their visual descriptors, where each is composed of f features. Definition 15 (Spatial-Visual Range QueryQ sv (D, I Q , R(p, σ s ), σ v )).Q sv is a con- junctive query where its inputs comprising visual criteria (a query image of interest (I Q ) and a visual similarity threshold (σ v )) and spatial criteria represented by a spatial cir- cular region (R(p, σ s ) defined by a center point (p) and a radius (σ s )) and its output is defined as follows: •Q v (D, I Q , σ v ) ={∀I j ∈D|d v (I j .v,I Q .v)≤σ v } •Q s (D, R(p, σ s )) ={∀I j ∈D|I j .s∩ R(p, σ s )6= φ} •Q sv (D, I Q , σ v ) =Q v (D, I Q , σ v )∩Q s (D, R(p, σ s )). 2 The other descriptors can trivially be indexed by R-tree as well, either as a point for camera location or MBR of the FOV. 70 d v (I i ,I j ) = Ì f X w=1 (I i .v w −I j .v w ) 2 (6.1) 6.2 Spatial-Visual Indexes In this section, we discuss a class of R*-tree indexes for organizing geo-tagged street images. This class of indexes aims at supporting spatial-visual queries efficiently and accurately. Throughout this section, we use the following running example: Example: Consider a geo-tagged image dataset that includes fourteen images whose geographic coordinates are visualized in Figure 6.1a. Moreover, each image depicts one of the basic geometric shapes (e.g., triangle, square, diamond, hexagon, or circle), and each shape is drawn in one of the colors listed in Fig 6.1c. The spatial and visual descriptors of each image are shown in Figure 6.1d. Assume thatQ sv is given, as illustrated in Figure 6.1b. This query is composed of spatial and visual ranges. For simplicity, each range query is specified by a range of every feature 3 . Therefore,Q s is determined as {X: [7.5-8.4], Y: [17.77-18.4]} whileQ v is specified as {U: [2-2], V: [0.61-0.90]}. Throughout this example, a class of R*-tree indexes is constructed using the settings (M=4, m=2). 6.2.1 Baseline Index Structures Spatial R*-tree Index (SRI) SRI is similar to the conventional R*-tree in which the MBR of each node is con- stituted based on the spatial descriptors of its contained images. However, the only difference is that the leaf nodes are augmented with both the spatial and visual descrip- tors. When executing anQ sv ,SRI is traversed recursively usingQ s until reaching a set 3 Definition 15 implies each range is represented by a circle either in a 2D or a high-dimensional space. However, here, a rectangular range is used to simplify the explanation. 71 (a) Dataset Visualization (b) Spatial-Visual Query (Q sv ) (c) Visual Codebook (d) Descriptors of Images Figure 6.1: An Example Geo-tagged Image Dataset and Spatial-Visual Query of leaf nodes. Then, the similar images are selected by inspecting the visual descriptors augmented within the leaf nodes againstQ v . Note that SRI potentially enables organizing street images properly because of the street-image visual locality phenomenon (i.e., locality of similar visual features of street images). Therefore, despitethefactthatSRI uses“only”thespatialdescriptorforimage organization, the images are organized based on their spatial proximity and implicitly their visual relevancy. Nonetheless,SRI still suffers from the performance issue because the leaf nodes are fetched when their MBRs satisfyQ s regardless of the visual relevance of the contained images (i.e., some leaf nodes can be false-hits). Example: As shown in 4 Figure 6.2a, when executing the exampleQ sv with SRI, all internal nodes and four leaf nodes (nodes 2-5) are visited because of satisfyingQ s . Thereafter, the ten images contained in the visited leaf nodes are inspected sequentially 4 The information of the MBR of each node is provided in the red box beside each node. 72 (a) Spatial R*-tree Index (SRI) (b) Visual R*-tree Index (VRI) Figure 6.2: Constructing and Querying SRI andVRI using the Image Dataset andQ sv Defined in Figure 6.1 to discard the images that do not satisfy eitherQ s orQ v . As a result, some nodes were inspected even though they do not have any relevant images (e.g., node 4 is a false-hit). The final output ofQ sv includes n 4 , n 5 , n 7 , and n 8 . Visual R*-tree Index (VRI) In contrast to SRI, VRI organizes images using “only” their visual properties (i.e., MBRsareconstitutedofvisualdescriptors)whiletheleafnodesareaugmentedwithboth spatialandvisualdescriptorsofimages. However,toavoidthe“curseofdimensionality” 5 , I.v of each image should first be reduced into a low-dimensional descriptor (referred to as I.v 0 ) using one of the dimension reduction techniques (e.g., principal component 5 An experimental evaluation conducted by Otterman [124] shows that R*-tree works well up to 20 dimensions [11]. After that, R*-tree becomes worse than a linear scan for searching. 73 (a) Plain Spatial-Visual R*-tree (PSV) (b) Adaptive Spatial-Visual R*-tree (ASV) Figure 6.3: Constructing and QueryingPSV andASV using the Image Dataset andQ sv Defined in Figure 6.1. analysis (PCA)). Thereafter, the MBRs of VRI are built using I.v 0 . Nonetheless, leaf nodes are augmented by I.v along with I.s. When executingQ sv , the tree is traversed with respect toQ v , and then the images contained in leaf nodes are examined using their augmented descriptors (spatial and visual) with respect to the two parts ofQ sv to discard non-similar or far images. Since the retrieved images are evaluated using their high-dimensional visual descriptors, VRI avoids retrieving false-positive images 6 . However, because of traversing VRI through its I.v 0 -based MBRs, VRI can poten- tially miss some similar images (i.e., false-negative) 7 . In addition, the performance of VRI is potentially low because of the false-hits caused by organizing images using only 6 Therefore, VRI always achieves a perfect precision score of 1.0. 7 Therefore, VRI potentially obtains a low recall score. 74 their visual properties. Moreover,VRI cannot benefit from the street-image visual local- ity phenomenon because of the use of low-dimensional visual descriptors in the index construction. Example: As shown in Figure 6.2b, when executing the exampleQ sv , all internal nodes and three leaf nodes (nodes 1-3) are visited; however, Node 1 is falsely fetched because none of its contained images satisfyQ s . Nonetheless, VRI is able to retrieve all relevant images because, in this example, (for simplicity) the tree is built using all visual features of the images (i.e., dimension reduction is not used). 6.2.2 Hybrid Index Structures Plain Spatial-Visual R*-tree (PSV) PSV is a hybrid index in which images are organized both spatially and visually. Particularly, each node of PSV has an MBR built on a concatenated descriptor of I.s and I.v 0 of the contained images. When executingQ sv , the tree is traversed recursively according to both parts ofQ sv until reaching the leaf nodes. Moreover, the images contained in the fetched leaf nodes are inspected whether their I.s and I.v satisfyQ sv . SincePSV avoids organizing images using either their spatial or visual properties (as in SRI and VRI), false-hits are minimized; hence, the performance of PSV potentially improves. However, PSV has an accuracy limitation because of two reasons: a) it uses I.v 0 for constructing MBRs, and b) it does not consider the domain difference between the spatial and visual features of images and treat them similarly; hence,PSV is neither able to fully exploit the street-image visual locality phenomenon nor support optimal organization of the images spatially or visually (i.e., the optimization criteria cannot be tuned to favor either). Example: As shown in Figure 6.3a, when executingQ sv , two internal nodes (i.e., root and node B) and three leaf nodes (nodes 2, 4, and 5) are visited. Even though the MBR of node 2 overlaps withQ sv , the contained images are irrelevant spatially and visually 75 (i.e., node 2 is a false-hit). Nonetheless, PSV is able to retrieve all relevant images because this examples does not use a dimension reduction technique. Adaptive Spatial-Visual R*-tree (ASV) ThistreeextendsPSV byassociatingeachnodewithtwoseparateMBRsforthespa- tial and visual descriptors of the contained images (referred to as Spatial MBR (MBR s ) andVisual MBR (MBR v )). Inparticular,MBR s coversI.sofallofthecontainedimages whileMBR v coverstheirI.v 0 . AllocatingtwoMBRsforeachnodeenablesdistinguishing the spatial properties from the visual ones. Having separate MBRs for image descrip- tors enables organizing images adaptively by handling the spatial and visual descriptors differently. In particular, the conventional optimization criteria (i.e., node margin, area, and overlap) are adapted to treat the two types of properties differently. Thus, each criteria is computed independently for spatial and visual properties, and thereafter these two intermediate results are combined using a weighted linear regression and normaliza- tion, as shown in Eq. 6.2 (α is the weight of spatial properties while (1−α) is the weight of visual properties). Another advantage of having separateMBRs pernode isto support a generic design that can be tuned to emulate various paradigms of image organization including: only spatial MBRs similar to SRI when α = 1, only visual MBRs similar to VRI when α = 0, or both descriptors’ MBRs without distinction similar to PSV when α = 0.5. ExecutingQ sv using ASV is similar to that of using PSV. However, during the tree traversal, each node is examined by comparing each of the query parts against its corresponding MBR. Technically, ASV mainly differs from PSV in the implementation of the “Insert” function (see Algorithm 2). In a nutshell, when adding an image, the function locates one of the existing leaf nodes that can best include the new image and initiates a node split if the node exceeds its capacity. Locating the best leaf node for image insertion is handled by the “ChooseSubtree” function (see Algorithm 3) while the node split is done using the “Split” function (see Algorithm 4). In turn, the Split function is composed 76 of two sub-functions: ChooseSplitAxis (see Algorithm 5) and ChooseSplitIndex (see Algorithm 6). ChooseSplitAxis selects heuristically the best image property (among the spatial and visual properties) that is dominantly used for partitioning the images into two nodes while ChooseSplitIndex uses the selected property to find heuristically the best image distribution among the two nodes. All the aforementioned functions use the modified optimization criteria as defined in Eq. 6.2. margin =α margin(MBR s ) margin(max(MBR s )) + (1−α) margin(MBR v ) margin(max(MBR v )) area =α area(MBR s ) area(max(MBR s )) + (1−α) area(MBR v ) area(max(MBR v )) overlap =α overlap(MBR s ) overlap(max(MBR s )) + (1−α) overlap(MBR v ) overlap(max(MBR v )) (6.2) Similar to PSV, ASV minimizes the false-hits encountered in both SRI and VRI since ASV uses both image properties for image organization. However, unlike PSV, ASV can minimize the impact of the accuracy limitation, in spite of using I.v 0 for con- structing MBR v . Because of handling the spatial and visual properties independently, ASV enables organizing the images by assigning more weights to their spatial properties (α > 0.5). Hence, ASV enables exploiting the street-image visual locality phenomenon; consequently, achievingabetterorganizationandminimizingthefalse-negativesofPSV. Given that the discussion is in the context of street images which conform to the locality of similar visual features observation, ASV is built by granting more weight to the spa- tial properties. Nonetheless, for other contexts, ASV can be simply tuned with different weights due to its generic design. Example: Figure 6.3b shows the structure ofASV that is built on the example dataset withα=0.6. The images are organized such that the similar images locating in the spatial proximity are packed together in the same nodes. This organization is achieved because of the utility of the adaptive optimization criteria. When executing the exampleQ sv , two internal nodes (i.e., root and node B) and two leaf nodes (nodes 3 and 4) are visited. ASV retrieves all the relevant images. 77 For simplicity, in the running example, the visual descriptors of images are not rep- resented in a low-dimensional space. Hence, this example does not show the advantage of ASV over PSV in achieving better accuracy. The following example demonstrates this difference. Algorithm 2 Insert(R: ASV,E: A new entry) 1: N←R.RootNode 2: N←ChooseSubtree(N,E) 3: AddE to nodeN 4: ifN needs to be split then 5: N 1 ,N 2 ←Split(N,E) 6: ifN =R.RootNode then 7: Create new nodeM; AddN 1 toM; AddN 2 toM;R.RootNode←M 8: end if 9: end if 10: AdjustTree(R) 11: return R Algorithm 3 ChooseSubtree(N: Node,E: A new entry) 1: ifN is Leaf then 2: return N 3: else 4: for eachC i ∈N.Children do {Overlap and Area defined in Eq. 6.2} 5: C i .ΔO = Overlap(MBR(C i ,E)) - Overlap(MBR(C i )) 6: C i .ΔA = Area(MBR(C i ,E)) - Area(MBR(C i )) 7: end for 8: ifN.Children are Leaves then 9: Cx = ArgMin C i ∈N.Children C i .ΔO 10: else 11: Cx = ArgMin C i ∈N.Children C i .ΔA 12: end if 13: ChooseSubtree(Cx,E) 14: end if Algorithm 4 Split(N: Node,E: A new entry) 1: X←ChooseSplitAxis(N,E) 2: index←ChooseSplitIndex(N,E,X) 3: SortN images along theX axis 4: DefineG 1 ← {},G 2 ← {} 5: for eachE i ∈N.Children do 6: if i < index thenG 1 ←E i ∪G 1 elseG 2 ←E i ∪G 2 end if 7: end for 8: AddG 1 intoN 1 ; AddG 2 intoN 2 9: return N 1 ,N 2 Example: Consider another set of images whose spatial and visual descriptors are illustrated in Figure 6.4a. Assume that the visual descriptors of the images are repre- sented in a 5-dimensional space (i.e., five visual features {V 1 ,V 2 ,V 3 ,V 4 ,V 5 }), and these 78 Algorithm 5 ChooseSplitAxis(N: Node,E: A new entry) 1: for each x Axis do 2: SortN.Children by the lower then by the upper value of their MBR S and MBR V 3: for each Sort do {Margin is defined in Eq. 6.2} 4: Determine (M− 2m + 2) distributions of the (M + 1)N.Children 5: for each dist∈ M− 2m + 2 distributions do 6: DefineG 1 dist ← {},G 2 dist ← {} 7: SplitN.Children intoG 1 dist andG 2 dist based on dist 8: Margin f = Margin(MBR(G 1 dist )) + Margin(MBR(G 2 dist )) 9: end for 10: S = P dist∈M−2m+2 Margin f 11: end for 12: end for 13: X = ArgMin x S 14: return X Algorithm 6 ChooseSplitIndex(N: Node,E: A new entry,X: Split Axis) 1: SortN.Children by the lower then by the upper value of their MBR S and MBR V Along theX axis 2: for each Sort do {Overlap is defined in Eq. 6.2} 3: Determine (M− 2m + 2) distributions of the (M + 1)N.Children 4: for each dist∈ M− 2m + 2 distributions do 5: DefineG 1 dist ← {},G 2 dist ← {} 6: SplitN.Children intoG 1 dist andG 2 dist based on dist 7: Overlap ALL = Overlap(MBR(G 1 dist )) + Overlap(MBR(G 2 dist )) 8: end for 9: end for 10: index = ArgMin dist Overlap ALL 11: return index visual representations are reduced into a 1-dimensional space (i.e., one visual feature U). For this dataset, PSV and ASV are constructed using the settings (M=2, m=1, and α=0.6). Using PSV, because I 2 and I 3 are more similar to each other in the low- dimensional space compared to I 1 , I 2 and I 3 are grouped in the same node while I 1 is stored in another node (see Figure 6.4b). Meanwhile, usingASV, becauseI 1 ias spatially closer to I 2 thanI 3 , I 1 andI 2 are grouped in the same node while I 3 is stored in another node (see Figure 6.4c) because ASV prioritizes the spatial features in the organization. When querying PSV usingQ sv , node A is visited and I 1 is retrieved. Note that I 2 is not retrieved by PSV because it is stored in a node whose boundary of low-dimensional visual descriptors does not overlap withQ v . Meanwhile, when querying ASV, the node which includes I 1 and I 2 is visited. When inspecting these two images visually in the high-dimensional space, the visual descriptors of both images in the high-dimensional space overlap withQ v . Since the strategy of ASV, here, is to pack more “spatially” 79 related images together in the leaf nodes, ASV is able to retrieve more similar images when fetching leaf nodes; hence, observing better accuracy. (a) Dataset and Spatial-Visual Query (b) PSV (c) ASV Figure 6.4: An Illustration Example for the Inaccuracy Issue of PSV Algorithm 7 ClusterNode(N: Node, k: # of clusters, t: # of iterations) 1: ifN.Children are Leaves then 2: for eachE i ∈N.Children do 3: Matrix = Collect I.v 0 vectors for images contained inE i 4: end for 5: else 6: for eachE i ∈N.Children do 7: Matrix = ClusterNode(E i , k, t) 8: end for 9: end if 10: labels = KMeans(Matrix,k,t) 11: N.visualMBRs = CalculateBoundery(labels,Matrix, k) 12: return Matrix Clustered Adaptive Spatial-Visual R*-tree (CSV) One may think that the boundary of the visual descriptors of a set of similar images is typically tight. Nonetheless, with the variety of small visual differences among similar images (e.g., lighting, shadow), such tight boundaries are not guaranteed, and instead, these boundaries may be loose (i.e., containing dead space). Hence, the visual MBR in 80 each node of ASV may not tightly bound all of the enclosed visual descriptors, espe- cially that internal nodes (located in the higher levels in the tree) potentially contain images from different neighbourhoods (i.e., not necessarily similar). Therefore, we pro- poseanextendedversionofASV, referredtoasClustered Adaptive Spatial-Visual R*-tree (CSV). The main idea of CSV is to partition the images in a node into groups (i.e., clusters) based on their similarity, and then bound each of these groups by a separate visual MBR. This partitioning does not imply any change in the node structure of the tree (i.e., no node split), but it only comprises augmenting multiple visual MBRs into the node alongside its spatial MBR. For partitioning images per node, we used the k-means clustering algorithm. TheCSV “Insert” function is not different from that ofASV. Particularly, the visual MBR partitioning occurs once after the tree is built and may be executed periodically as a fine tuning step after several insertions. This step is performed through traversing the tree using depth-first-search while clustering is carried out per traversed node (see Algorithm 7). For executing aQ sv using CSV, the tree is traversed while performing two checks: a) compareQ s with the spatial MBR associated with each node, and b) compareQ v with the set of visual MBRs associated with each node to check whether any of them overlaps withQ v . Thereafter, the leaf nodes are fetched in a similar way to that of ASV. Compared to ASV, CSV potentially reduces the false-hits (i.e., improving the per- formance) because CSV has multiple visual MBRs that tightly bound their contained images visually. Particularly, these multiple MBRs ensure that the search process will not fetch a node unless it definitely includes similar images. On the other hand, the accuracy of CSV is similar to that of ASV. The asymptotic analysis of PSV, ASV, and CSV is shown in Table 6.1. Asymp- totically, bothPSV andASV have similar costs to that of the conventional R*-tree. In particular, the construction cost of these two trees isO(n log M n) which is the insertion cost (i.e.,O(log M n)) of n images. Meanwhile, for constructing a CSV, an additional 81 cost for performing clustering per node is needed. Given that a) the maximum number of nodes of a tree isO(fanout height−1 ) which is equivalent toO(M (log M n)−1 ) for R*-tree, and b) the cost of performing k-Means per node containing at maximum M isO(Mkt) wherek is the number of clusters andt is the number of iterations until reaching conver- gence [72], the additional cost of building a CSV isO(Mkt∗ M (log M n)−1 ). Regarding the search cost, both ASV and CSV perform similarly in the worst case (i.e., CSV cannot benefit from the multiple visual MBRs whenQ v overlaps all of the visual MBRs per node). However, on average, the search cost can be reduced k times. Table 6.1: Asymptotic Analysis of the Hybrid Indexes Operation PSV & ASV CSV Construction O(n log M n) O(n log M n + Mkt∗ M (log M n)−1 ) Search O(log M n) O(log M n) Θ(log M n) Θ( 1 k log M n); k≤m, k≤M/2 6.3 Experiments 6.3.1 Experiment Setup Table 6.2: Street Geo-tagged Image Datasets Dataset # of images Descriptors Size (Spatial; Visual) (MB) Spatial Region (Width× Height) Spatial Density (Avg. # images per km 2 ) OR 3, 204 1; 20 2.1×1.2 (km 2 ) 1,271 PT 4, 825 1; 30 1.5×3.6 (km 2 ) 893 MA 17, 825 2; 106 13.7×7.3 (km 2 ) 178 LA 24, 345 3; 140 1.4×1.0 (km 2 ) 17,389 SF 520, 623 51; 3,607 6.0×8.1 (km 2 ) 10,712 Table 6.3: Index Construction Settings Parameter Values Weight of image spatial properties (α) 0.2, 0.4, 0.6, 0.8 # clusters per node (k) 3, 5, 8, 10, 12 Dimensionality (d) of the data space of I.v 0 3, 6, 9, 12 82 Table 6.4: Query Settings σv σs OR 0.94, 1.00, 1.08, 1.15 100m, 200m, 300m, 400m PT 0.92, 0.99, 1.08, 1.15 MA 0.89, 0.94, 1.01, 1.08 LA 0.75, 0.78, 0.81, 0.84 SF 0.85, 0.94, 1.01, 1.08 Table 6.5: Indexes for SF # of Nodes Index Size (KB) SRI 7,613 18,410.6 VRI 7,781 18,629.7 PSV 7,564 21,123.7 ASV 7,540 21,107.9 CSV 7,540 35,245.4 (a) # of Nodes (b) Index Size Figure 6.5: Comparison of Indexes w.r.t. # of Nodes & Index Size (a) Nodes’ Margin (b) Nodes’ Overlap Figure 6.6: Comparison of Indexes w.r.t. Optimization Criteria Datasets. To evaluate our proposed indexes, we conducted several experiments using real street image datasets obtained through Google Street View API [2] for four geo- graphical regions [167, 112]: a) downtown Orlando, Florida (referred to as OR); b) downtown Pittsburgh, Pennsylvania (referred to as PT); c) part of Manhattan, New York City (referred to as MA); and d) the campus of the University of Southern Cali- fornia and its neighborhood in Los Angeles (referred to as LA). In these datasets, each image is tagged with its camera location (i.e., latitude and longitude coordinates), and 83 subsequently, we calculated the scene location (I.s) of each image using the scene estima- tion algorithm [18]. RegardingI.v, for each image, we extracted the R-MAC descriptor, which is a high-dimensional vector composed of 512 dimensions, using the implementa- tion available at https://github.com/noagarcia/keras_rmac. Moreover, to evaluate the scalability of our indexes, we used another real large-scale dataset for the region of San Francisco [49]. Table 6.2 shows statistical information of the five datasets. Index settings. All proposed index structures were implemented in C++, and the source code is derived from that of the conventional R*-tree (https://github.com/ virtuald/r-star-tree). All indexes were built by setting the minimum (m) and maxi- mum (M) fan-out of each node to 50 and 100, respectively. To reduce the dimensionality of the R-MAC descriptors, we used the principal component analysis (PCA) technique (available in OpenCV v3.4.3). Moreover, for clustering images at each node in CSV, we used the k-means clustering algorithm (in OpenCV v3.4.3) 8 . All of the parameters used for constructing the indexes throughout our evaluation are shown in Table 6.3 where the default values are underlined. Query settings. We selected randomly a set of query images from each dataset to constructQ sv . To ensure selecting a sufficient number of queries, the number of selected queries was decided based on the statistical analysis on the size of each dataset [95]. Thus, to obtain a confidence score of 5% and a margin error score of 2.5%, the number of queries per dataset is 1,106, 1,221, 1,436, 1445, and 1532 for theOR,PT,MA,LA, and SF datasets, respectively. Thereafter, we constructed spatial and visual ranges based on each selected query image (I Q ) where the values of σ v and σ s were chosen among the ones listed in Table 6.4. The values of σ v were chosen based on the visual distribution of each dataset such that each selected value enables retrieving, on average, the top 10% similar images for each I Q . Meanwhile, the values for σ s were chosen to create a spatial region that covers at least a building with its neighborhood. To evaluate the 8 It was shown experimentally that k-means converges after 20-50 iterations (t) regardless of the dataset size [42]. 84 proposed indexes, we report two metrics: the average recall and the number of accessed nodes for all executedQ sv . The recall score of eachQ sv result is calculated assuming that the result of SRI is the ground truth. Note that the precision of theQ sv with all proposed indexes is always 1.0 because all dimensions of visual features (pointed to by the leaf nodes) are used for final comparisons, and thus there are no false-positives. For the efficiency metric, we report the number of accessed nodes per query rather than the query execution time since the latter varies based on the status of the computing environment (e.g., caching). The number of accessed nodes demonstrates the ability of the index structure to prune the search space when executingQ sv . Index Construction. Figure 6.5 compares the constructed indexes for all datasets in terms of the number of generated nodes and index size. In general, organizing images using only their visual descriptors produces a structure with a large number of nodes (as in VRI), while the image organization using only the spatial descriptors results in a compact structure. VRI has more nodes compared to SRI because the dissimilar- ity of images from different locations is high, and consequently, VRI generates many sparse nodes (i.e., do not utilize the full node capacity). Moreover, the hybrid indexing approaches construct trees with less nodes than either (or both) of the two baselines. Furthermore, the index size of the hybrid indexes is larger than that of the baselines because the nodes in the hybrid indexes comprise MBRs of both types of image descrip- tors, while the nodes of the baselines are associated with MBRs of one type of image descriptors. CSV has the largest index size since it contains multiple visual MBRs per node. The same observations stand for the indexes constructed for the SF dataset, as shown in Table 9 6.5. The construction time of the indexes is omitted because it varies from a run to another. Nonetheless, in summary, the construction time for all indexes for both OR and PT is in the range of (1-3) seconds while for MA and LA the indexes are built within 5 to 12 seconds. Finally, the indexes for SF are built within 155 to 9 SF was excluded from Figure 6.5 because the large difference of the values of SF from that of the other datasets increases the scale of the Y-axis; hence making the differences among the various indexes in each dataset not observable. 85 330 seconds 10 . We also observed that the construction time of CSV differs from that of ASV roughly by a 10% increase, which is the cost of the clustering algorithm. Inadditiontoindexsizeandnodes, wealsoreporttheaveragemargin(i.e., perimeter) and overlap of the nodes 11 for all indexes across all datasets, as shown in Figure 6.6. Given that the standard design of R*-tree aims at minimizing the values of node margin and overlap, our proposed R*-tree-based indexes can be evaluated using these metrics. These metrics are computed per node in terms of both spatial and visual descriptors of the contained images, regardless of the index design 12 . Based on the computed values, generally, the hybrid indexes obtained lower margin and overlap values compared to that of the baselines because of having a better image organization; hence, a more compact structure. Moreover, across all datasets, the margin and overlap values of the indexes decrease as the size of the dataset increases. When having a large number of images, the index potentially uses the split algorithm more frequently compared to the situation of having a small number of images. Note that the insert algorithm always executes the ChooseSubtree algorithm, while the split algorithm is executed when needed. Both of these algorithms use the optimization criteria to generate better image organization in the index structure. Therefore, the more the split algorithm is executed, the better the image organization becomes (i.e., attaining better values of the optimization criteria). In addition to the impact of the dataset size, having a dense dataset enables better organization because of the need to execute the split algorithm more. Therefore, when 10 This is the time to build the index in memory and does not include flushing the pages into the disk. This is because, in most cases, there is enough memory to hold the index, and only the pages including the high-dimensional visual features reside on disk. Nevertheless, our reported node accesses include the access of the nodes residing in memory. 11 We did not report the values of nodes’ area because the differences in terms of the area were marginal (around 10 −5 ). 12 Given the design difference in all indexes, to conduct a fair comparison, we perform the computation of these metrics as post-process step by constructing a “virtual” MBR of the spatial and (dimension- reduced) visual properties per node for computation purpose (i.e., no weighting or normalization for the properties). 86 comparing theMA andLA datasets, where both of them have similar sizes, LA achieved better values for margin and overlap because of its high density. (a) Efficiency (b) Effectiveness Figure 6.7: Baseline vs. Hybrid (a) using the OR Dataset (b) using the PT Dataset (c) using the MA Dataset (d) using the LA Dataset Figure 6.8: The Impact of Varying Query Visual Range ofQ sv on the Index Efficiency 87 (a) using the OR Dataset (b) using the PT Dataset (c) using the MA Dataset (d) using the LA Dataset Figure 6.9: The Impact of Varying Query Visual Range ofQ sv on the Index Effectiveness 6.3.2 Experimental Results Baselines vs. Hybrid As shown in Figure 6.7a, among the baselines, SRI outperforms VRI generally because of the street-image visual locality phenomenon. However, using the LA dataset, VRI achieved a better performance because of the large size of the spatial range inQ sv compared to the size of the spatial region covered by the LA dataset, as well as, the high spatial density of the dataset. Moreover, all hybrid indexes outperform the base- lines because of using both image properties in pruning the search space in addition to the hybrid organization in these indexes. Having more street images in a local region potentially increases the similarity among the images in that region; hence, increasing the locality of similar visual features. Therefore, across all datasets, the performance of ASV improves when the spatial density of the dataset is higher. With respect to 88 (a) using the OR Dataset (b) using the PT Dataset (c) using the MA Dataset (d) using the LA Dataset Figure 6.10: The Impact of Varying Query Spatial Range ofQ sv on the Index Efficiency SRI, the speedup factors of PSV, ASV, and CSV reached up to 2.1x, 4.8x, and 5.6x, respectively usingLA. Meanwhile, with a low spatially dense dataset (MA), the speedup factors of PSV, ASV, and CSV were 1.2x, 1.6x, and 1.8x, respectively. On the other hand, as shown in Figure 6.7b,SRI always achieved a perfect recall score because it does not use I.v 0 for the image organization, while the other indexes do; hence, obtaining a lower recall. Regarding the hybrid indexes, both ASV and CSV achieved better recall scores (reaching up to 0.94) compared toPSV because of prioritizing packing the images located in the same neighborhood together in the same node. Moreover, CSV utilizes the design of having multiple tight visual MBRs per node to improve its performance compared to that of ASV while still achieving a similar recall score to that of ASV. 89 (a) using the OR Dataset (b) using the PT Dataset (c) using the MA Dataset (d) using the LA Dataset Figure 6.11: The Impact of Varying Query Spatial Range ofQ sv on the Index Effectiveness (a) Efficiency (b) Effectiveness Figure 6.12: The Effect of Varying # of Clusters for CSV Varying Query Visual Range Figures 6.8 and 6.9 show the evaluation of varying the visual range while fixing the spatial range ofQ sv in terms of efficiency and effectiveness, respectively. Among the baselines, the performance of VRI degrades when increasing the visual range because it uses only the visual properties for pruning the search space. Meanwhile, the performance 90 (a) Efficiency (b) Effectiveness Figure 6.13: The Impact of Varying Weight of image spatial properties (α) on ASV (a) Efficiency (b) Effectiveness Figure 6.14: The Impact of Varying Dimensionality of Data Space of I.v 0 on ASV (a) Efficiency (b) Effectiveness Figure 6.15: The Impact of Varying Large-scale Dataset (SF) of the hybrid indexes decreases slightly when increasing the visual range because these indexes enable pruning the search space using both the visual and spatial properties of images. Moreover, the decrease in the performance of ASV is less compared to that of PSV because ASV benefits from the impact of the street-image visual locality phenomenon on the image organization. Regarding the retrieval accuracy, the recall 91 scores of all hybrid indexes as well as VRI decrease when increasing the visual range because of the potential increase in the false-miss rate (i.e., false-negatives) resulted from the use of I.v 0 . Varying Query Spatial Range Unlike the case of varying visual range, when increasing the spatial range (see Fig- ure 6.10), the performance of SRI degrades significantly, while the performance of VRI remains steady. Furthermore, the performance of the hybrid indexes (especially ASV and CSV) decreases slightly when increasing the spatial range because of two reasons: a) these indexes use the spatial and visual properties for pruning the search space, and b) in the case of street images the opportunity of finding similar images located outside the spatial proximity of a query image is potentially low, and subsequently, increasing the spatial range while fixing the visual range ofQ sv does not lead to accessing additional nodes. In terms of the retrieval accuracy (see Figure 6.11), all hybrid indexes showed a slight decrease in the recall scores when increasing the spatial range compared to the case of increasing the visual range since the impact of using low-dimensional visual space on the query result is higher when changingQ v rather thanQ s . Varying # of Clusters (k) As shown in Figure 6.12a 13 , the performance of CSV improves when increasing the number of clusters until reaching a steady state in which adding more clusters does not affect the performance. The stability state of CSV depends on the size of the dataset whereCSV can reach its steady state using a smaller number of clusters for a small-scale dataset compared to a large-scale one. Using a large-scale dataset, the structure ofASV (the base of CSV) is potentially deep where more internal nodes are generated. The internal nodes located in the high levels of the tree contain a large number of images; hence, the visual boundary of such nodes tends to be loose unless a large value of k 13 When k=1, CSV is ASV because each node contains one visual MBR. 92 is used to produce multiple tight boundaries. Meanwhile, the recall of CSV decreases slightly when increasing the number of clusters as shown in Figure 6.12b. This decrease is caused by increasing the search space based on dimension-reduced visual properties which is potentially associated with the false-miss issue (i.e., false-negatives). Impact of Varying α on ASV Varying the weighting parameter of ASV enables organizing images in a way analo- gous to three other indexes: SRI (whenα = 1.0),VRI (α = 0.0), andPSV (whenα = 0.5). However, the performance of ASV when α = 1.0 can be better than that of SRI because ASV benefits from the visual MBRs associated with nodes to avoid traversing the nodes which do not satisfyQ v . Similarly, when α = 0.0, ASV can achieve a better performance than that of VRI. Interestingly, when α = 0.5, we observed that ASV shows a better performance than PSV because ASV transforms the calculation of the optimization criteria into another space that unifies the properties of both spatial and visual descriptors using the normalization technique; hence achieving a better organiza- tion. Moreover, as shown in Figure 6.13a, the performance of ASV increases gradually when increasing α because of the increase in utilizing the street-image visual locality phenomenon. However, the effect of the increase of α is inverted at a specific point since increasing it further means disregarding the importance of the visual properties in the image organization. The inverting point depends on the spatial density of the street image dataset of interest. In a spatially-dense dataset (e.g.,LA), the inversion may occur after a larger value ofα than that of a spatially-sparse dataset (e.g., MA). On the other hand, the recall of ASV (see Figure 6.13b) increases correspondingly with the increase of α value since it ultimately leads to avoiding the false-miss issue caused by I.v 0 and organizing the images in favor of their spatial properties. 93 Varying Dimensionality of Data Space of I.v 0 As shown in Figure 6.14, when increasing the dimensionality of I.v 0 , the perfor- mance of ASV decreases because of the potential increase in the overlap between the visual MBRs of the tree nodes. Meanwhile, the recall of ASV increases when increas- ing the dimensionality of I.v 0 because of enriching the representation of the generated low-dimensional visual descriptors; hence reducing the likelihood of encountering the false-miss issue. Scalability As shown in Figure 6.15, the hybrid indexes proved to be scalable on a large-scale dataset (e.g.,SF) while preserving the same observations discussed earlier. In particular, with the default query settings, ASV achieved a speed-up factor of 15x (w.r.t. SRI) and a recall score of up to 0.94. Moreover, CSV gained the benefit of the large depth of the constructed tree to obtain additional speed up, reaching up to 25x 14 . 14 The discussion of the impact of varying the spatial range is omitted since it is similar to the previous discussion of Figures 6.10 and 6.11. 94 Chapter 7 Smart-City Applications using Spatial-Visual Search In this chapter, we focus on developing several smart-city applications that can ben- efit from spatial-visual search. Typically, image-based smart-city applications are built using machine learning algorithms that require a trained model on a set of images. Such images should be labeled with annotations related to a specific application. For exam- ple, when developing an application for classifying the level of street cleanliness for a government (e.g., Los Angeles Sanitation Department), the first fundamental problem is to collect a set of images, and then, annotate them into categories based on an image classification (e.g., bulky item, illegal dumping, encampment, overgrown vegetation, and clean) proposed by experts in the field. Subsequently, the labeled images are used to train a machine learning model. Since the common wisdom in machine learning says that “adding more training data makes models better”, the acquisition of more anno- tated images are needed. Since labeling more data is a laborious and time-consuming task, onesolutionforobtainingmoreannotatedimagesistoperformspatial-visualsearch on the massive amounts of geo-tagged images available on the Internet (e.g., Flickr [1] and Google Street View [2]). In particular, spatial-visual search can be used for retriev- ing more similar images and located in the same neighborhood of a query image (i.e., a sample image that is collected previously). When setting the visual similarity threshold too strict, images that potentially have the classification label as the query image are retrieved. In this way, the size of the set of training images is enlarged by including more similar images which are also in the same neighborhood of the original sample images. 95 This mechanism can be considered as a form of image augmentation which aims at gen- erating synthesized images based on an original image using several image processing techniques (e.g., rotation and flipping). However, spatial-visual search enables retrieving similar real images that potentially render the scene of the query image but from various viewing directions or different weather conditions (e.g., cloudy, sunny). Therefore, this chapter presents three smart-city applications 1 : street cleanliness classification [14, 39, 100], material recognition [22, 40], and road damage detection [117, 23, 170]. Nonetheless, there are many other smart-city applications (e.g., traffic flow analysis [59, 36, 91] and situation awareness of events [113, 104, 116, 12]) that can benefit from spatial-visual search. 7.1 Image Classification to Determine the Level of Street Cleanliness: A Case Study Urban streets are littered with waste deposits from natural sources and human activi- ties, suchastreeleaves, dumpeditems(e.g., furniture), andscatteredtrash. Accordingto the “Let’s do it” organization 2 , there are 100 million tons of wastes dumped on streets around the world. Waste deposits negatively impact the public health, environment, economy, and tourism. Therefore, many cities devote a large annual budget and effort to enhance the cleanliness of their streets. Different cities have adopted various approaches to enhance street cleanliness. Some are monitoring waste bins using various sensors [118] and image-based techniques [69]. Some others devise a street-rating system based on samples of geo-tagged images col- lected and evaluated by trained observers. An observer labels an image with a category (i.e., manual classification) which reflects the cleanliness level of a street. For example, 1 We proposed a visionary platform for smart city, TVDP [88], that includes spatial-visual indexes for managing geo-tagged images and also a novel crowd-based image learning framework using edge computing [57]. 2 http://test.letsdoitworld.org/about 96 New York adopted a rating system referred to as ScoreCard [134]. Los Angeles developed a smartphone app [97] which is used by city employees to collect street images and assess them manually based on a rating system. Then, street-rating systems enable identifying waste hotspots and help managing cleanup crews. As a result, it was reported that the percentage of unclean streets had been reduced by 82% in Los Angeles [97]. However, manual assessment of images has its limitations due to the cost of human labor and time. Hence, developing automatic classification of street scenes is essential for an effi- cient analysis of geo-tagged images collected by the city employees and, furthermore, scaling up the classification process of large-scale crowdsourced images by the public 3 . The focus of this study is automating the classification of street scenes based on their cleanliness level using a big real dataset of geo-tagged images from Los Angeles Sanitation Department (LASAN). A naive and straightforward approach is to generate one trained model with all images in a dataset. Towards this end, we investigate various image features and classifiers (e.g., SVM) to identify the best features and classifier to label an image based on predefined levels of street cleanliness. Furthermore, we observe that street scenes widely vary depending on their locations across a city (e.g., a street scene in downtown Los Angeles usually includes tall buildings with less vegetation while a street scene in Beverly Hills may include tall trees with no highrises) which might affect classification accuracy. Hence, we propose a classification scheme which aims at generating a set of efficient trained models to overcome the complexity of diverse street views associated with street scene locations that affect the accuracy of the classifica- tion mechanism. Our methodology relies on spatial partitioning techniques and utilizes existing machine learning classifiers to construct a local trained model per partition. Our experimental results show that the accuracy of local trained models generated using geo-spatial partitioning techniques outperforms that of one global trained model. These 3 The public can participate in collecting geo-tagged images using the spatial crowdsourcing mecha- nism [86, 13]. 97 partitioning techniques have a set of parameters which affect the accuracy of the local trained models. Table 7.1: LASAN’s Categories of Street Scenes based on the Cleanliness Level Image Category Description Bulky Item There are some big items (e.g., couch, desk, mattress, and tire) thrown on a street. Illegal Dumping There is a pile of littered waste. Encampment There is a tent inhabited by people to live on the street. Overgrown Vegetation There is extra vegetation on a street and sidewalk. Clean The street is clean. (a) Bulky Item (b) Illegal Dumping (c) Encampment (d) Overgrown Vegetation (e) Clean Figure 7.1: Image Examples for the Categories of Street Scenes based on the Cleanliness Level 7.1.1 Image Dataset and Background on Classifiers Image Dataset Definition 16 (Street-scene Geo-tagged Image). An image I captures a street scene positioned at a geo-location (i.e., longitude and latitude). Hence, I is represented by two attributes: a visual scene of the image I.v and a spatial location I.s. Each image I (see Definition 16) is represented by two descriptors: its visual scene I.v and its location I.s. Based on the visual scene of an image I.v, the image is labeled with one of the five predefined imagecategories in this study: bulky item, illegal dumping, 98 encampment, overgrown vegetation, and clean. LASAN uses these image categories in practice where each category implies the required resources and equipment to clean the area captured by the image 4 . Table 7.1 contains the description of each category in details and Figure 7.1 shows image examples of these categories. Background on Classifiers Extracted image features are fed into classifiers to learn models for categorizing street-scene images. Here, we review a set of well-known relevant classifiers. First, the k-Nearest Neighbors (kNN) classifier is the simplest one which relies on the kNN search on the training dataset. The image class is identified based on the majority voting of its kNN. Second, Naive Bayes is a probabilistic classifier based on Bayes’ theorem which enables calculating a posterior probability for each class at prediction. Third, Support Vector Machine (SVM) is designed for binary classification where it constructs a hyperplane which divides the two classes with the largest margin. For multi- label classification, SVM is generalized in two schemes; one-versus-all and one-versus- one. Our work considers the one-versus-one scheme in which each pairwise class group is chosen as a two-class SVM each time. Fourth, the SoftMax classifier is a generalization of the binary form of Logistic Regression classifier [58]. SoftMax converts the unnormalized values at a linear regression to normalized probabilities for classification. There are known tree-based classifiers such as Decision Tree (DT). DT organizes a series of test conditions in a tree structure [130] where the internal nodes represent criteria for the attributes of each class, and the leaf nodes are associated with class labels. Random Forest includes a set of DTs, and the output of classification is defined by the leaf node that receives the majority of votes [135]. Another extension of DT is AdaBoost which builds a set of DTs adaptively. In the training phase of AdaBoost, a mis-classification of an instance is used to build a new optimized tree. 4 These image categories are defined by experts in the field (i.e., LASAN); thus can be used in other cities for developing their own frameworks. 99 7.1.2 Approaches This section describes two proposed approaches for street scene classification; the first considers only the visual scenes of the image dataset (i.e., I.v) while the second considers both visual scenes (I.v) and geographical properties (I.s) since street scenes vary across different regions in a city. Global Classification Scheme (GCS) Among the classifiers discussed earlier, GCS approach constructs one single trained model using one of the well-known classifiers. The classifier learns the image features throughout the overall geographical region in a dataset. Thus, this approach does not consider the geo-properties of images and forms the baseline approach in this study. In the experiment section, we discuss the impact of the choice of both image features and the classifier on the street scene classification problem. Geo-spatial Local Classification Scheme (LCS) GCS suffers from data noise caused by the variety of street scenes. Hence, construct- ing a local trained model per sub-region enhances learning the visual characteristics of the surrounding areas in a sub-region. In particular, each local trained model focuses on learning the features of the categories without distraction caused by the features of different street views. Hence, the probability of correct image classification increases. Constructing an efficient local trained model for a sub-region requires addressing two issues; having homogeneous views of streets in a sub-region and assuring a sufficient number of images in each sub-region for training purpose. Here, we explored two par- titioning techniques to recognize sub-regions using the geospatial properties of images: Grid and Bucket Quadtreee [138]. Using Grid, the entire region in a datset is divided into fixed equal-sized cells (Size). This mechanism is simple but may result in imbal- anced data distribution. Meanwhile, Quadtree partitions the region adaptively into four sub-regions until no sub-region contains more than a certain fixed number of images 100 (Cap Max ). This mechanism produces sub-regions with varying area sizse and a large one potentially contains heterogeneous street scenes. Design Approach: Optimally, we can create one local trained model per each cell (i.e., sub-region) generated by Grid or Quadtree. However, Grid varies in the number of images per cell and Quadtree varies in the size of cells. Moreover, to assign a cell to a classifier, the cell should contain a sufficient number of images, and the size of the cell should not be too large to avoid containing heterogeneous street scenes. Thus, we add constraints in creating a local model. A Grid cell should have at least a certain number of images (Cap Min ) and contain images from all street-cleanliness categories. The size of a selected Quadtree cell should not exceed the threshold value (Size Max ) while satisfying the other constraints on Grid cells. The cells which do not satisfy these criteria are assigned to a separate unified trained model. 7.1.3 Experiments (a) W/ Various Classifiers and Image Features (b) W/ and Various Features w.r.t. Image Cat- egories (c) W/ Various Classifiers and CNN Features w.r.t. Various Measures Figure 7.2: The Effectiveness of the Global Classification Scheme (GCS) Approach 101 (a)W/VariousApproachesandImageFeatures (b) W/ Varying the Cell Size of Geo-spatial LCS - Grid (c)W/VaryingtheCellCapacityofGeo-spatial LCS - Quadtree Figure 7.3: The Effectiveness of the Geo-spatial Local Classification Scheme (LCS) using Dataset and Classifiers We received around 22K geo-tagged street-scene images from LASAN. These images were correctly labeled by LASAN experts based on the image categories for street clean- liness described in Table 7.1 and the distribution of images among categories is shown in Table 7.2. Due to the imbalanced distribution across categories, a trained model can be biased to one of the categories. To overcome the problem of an imbalanced dataset in machine learning, researchers investigated various solutions [74]. One solution is a data- level approach by generating extra synthesized dataset through sampling mechanisms, which we adopted. In particular, we balanced only the training subset 5 by generating 5 Balancing only the training dataset for generating non-biased trained model does not distort the nature and the reality of the problem. 102 synthesized images using the Python Augmentor library [41] 6 which applies image pro- cessing techniques (e.g., cropping and rotating) on a subset of images per category to obtain a balanced distribution. Meanwhile, we did not balance the testing subset of the dataset which was used for reporting the F1 score of the classification approaches. Then, all images are processed for feature extraction. For color feature extraction, images were processed in the HSV color space, and the color histogram was divided into 20, 20, and 10 bins in H, S, and V, respectively. For SIFT-BoW, to generate the dictionary of visual words, SIFT key points were extracted from 80% of the dataset and clustered into 1000 clusters (using kMeans). For CNN, the Caffe architecture was fine-tuned using 80% of the dataset. For the classifiers adopted in our approaches, we used the Python scikit-learn [127] library. All classifiers were trained on 80% of the dataset using 10-fold cross-validation. Table 7.3 shows the parameters and constraints of the Geo-spatialLCS approach, where the default values are underlined. Table 7.2: Dataset Distribution among Image Labels Image Label # of images Bulky Item 12,315 Illegal Dumping 1,007 Encampment 886 Extra Vegetation 932 Clean 6,815 Table 7.3: Parameter Values in Geo-spatial LCS Parameter Values Size 2.5*2.5, 5*5, 7.5*7.5, 10*10, 15*15 mi 2 Cap Max 1500, 2000, 2500, 3000, 3500 images Cap Min 1000 images Size Max 10*10 mi 2 6 Our approaches are not restricted to the currently used augmentation technique and it can be easily replaced with other techniques. 103 Evaluation Results Impact of the Choice of Classifier and Image Features Figure 7.2a shows the F1 Scores of GCS for various combinations of classifier and image features. Classification with GCS achieved the best F1 score using the CNN image features due to the rich feature representation provided by CNN. Among the used classifiers, achieved the best F1 score with both SIFT-BoW and CNN obtaining scores of 0.64 and 0.83, respectively, while AdaBoost achieved the best F1 score using color histogram with a score of 0.78. Since GCS shows the best F1 score with using two types of image features, we further studied its F1 score among all image categories for street cleanliness. As shown in Figure7.2b, GCS− SVM achieved an F1 score higher than 0.80 in all image categories obtaining the highest score with the “Overgrown Vegetation” category and the lowest score with the “Encampment” category. Across all image features obtained the lowest F1 score with the “Encampment” category because “Encampment” images may be confused with the “Bulky Item” images. Figure 7.2c provides the detailed effectiveness of classifier using various metrics (precision, recall, and F1 score) based on the CNN image features. Across all categories, both precision and recall showed good values. Hence, we chose to use F1 score since it depicts the harmonic average of precision and recall values. The Impact of Geo-spatial Local Classification Scheme Figure 7.3a shows the F1 Score of various mechanisms of Geo-spatialLCS compared with GCS using the classifier while varying image features. Overall, both variants of Geo-spatial LCS achieved a better F1 score than that of GCS demonstrating that Geo- spatial LCS generates a set of regions where each region contains street images which share similar visual characteristics of streets; thus overcoming the scenes heterogeneity impact on street cleanliness classification. 104 The Impact of Varying the Cell Size of Grid Figure 7.3b shows the F1 scores of the Geo-spatial LCS - Grid approach using the classifier while varying the cell size (i.e., Size) of the Grid compared with GCS. In general, Geo-spatial LCS - Grid achieved a better F1 score than that of GCS when a grid cell is small. When creating a local trained model for a small size cell (which has sufficient data based on the Cap Min constraint), the trained model can distinguish the homogeneous visual characteristics of a small region; thus increasing the certainty of detecting waste objects. Meanwhile, a grid with large cells generates sub-datasets with heterogeneous street views; hence the F1 score of the Geo-spatial LCS approach decreases. For example, a cell of size 15×15 mi 2 covers the area of both downtown Los Angeles and its neighborhood; thus potentially contains street scenes with various visual characteristics. For the dataset, our approach obtained the best F1 score reaching to 0.90 when Size is 2.5×2.5 mi 2 . The Impact of Varying the Cell Capacity of Quadtree Figure 7.3c shows the F1 scores of the Geo-spatial LCS - Quadtree approach using the classifier while varying bucket capacity of Quadtree compared withGCS. In general, the F1 score of Geo-spatial LCS - Quadtree increases when increasing Cap Max because the number of images for learning increases for each trained model. In particular, it was not able to achieve better results than GCS with a small bucket capacity (e.g., 1500). However, Geo-spatial LCS - Quadtree obtained a better F1 score than that of GCS reaching up to 0.88 when Cap Max = 3500. 105 7.2 Recognizing Material of A Covered Object: A Case Study with Graffiti Another smart application is to understand the materials of objects in a scene. Mate- rial recognition has been widely investigated [102, 76, 129, 56, 40, 159]. However, some- times, materials are covered by incidental “covers” (referred to as covered materials) such asgraffitidrawnbypeopleorpropertydamagescausedbynaturaldisasters. Thesecovers distort the visual characteristics of the underlying materials; hence make the recognition of covered materials challenging. Nonetheless, covered material recognition is essential for developing efficient management solutions in smart cities (e.g., recognizing materials of damaged properties for post-disaster management, recognizing covered materials by graffiti for removal planning, and accurate estimation of building energy consumption). Technically, covered material recognition comprises two sub-tasks: detecting a cover as well as recognizing its underlying material. Utilizing a special device, such as hyperspectral imaging cameras (based on measuring reflectance), for material recog- nition [148] can solve the task of covered material recognition accurately; however, it is expensive and impractical in large scale monitoring. Alternatively, advances in computer vision enable using user-generated images 7 from regular cameras. However, the state-of-the-art image-based approaches [40, 159] for plain material recognition may be inefficient for the covered material recognition problem. Therefore, we propose a class of learning approaches. The first approach generates a learned model assuming a cover and its underlying material as a unified object. The second approach cas- cades two models (cover detection and material recognition) to tackle the two sub- tasks of the problem separately. These two approaches are enhanced utilizing some heuristics. To evaluate our approaches, this section focuses on recognizing materials 7 Such images can be collected by spatial crowdsourcing mechanisms [13]. User-generated images are usually tagged with their locations (if not, they can be localized [18]). Hence, to measure if the available visual data is sufficient for analyzing a geographical region, spatial coverage models [16] can be used. 106 covered by graffiti as a case study using a big real image dataset from the Los Ange- les Sanitation Department (LASAN). The source code of our approaches is available at (https://github.com/dweeptrivedi/covered-material-recognition). 7.2.1 Problem Definition and Dataset (a) Stone Wall (b) Wooden Fence (c) Stone Road (d) Metal Box (e) Wooden Tree (f) Glass Window Figure 7.4: Image Examples for Materials Covered by Graffiti In nature, materials can be categorized into various types. For example, FMD [142] uses a material classification of 10 types (i.e., fabric, foliage, glass, leather, metal, paper, plastic, stone, water, and wood) while the MINC classification [40] consists of 23 types. In our work, we consider only 5 types of materials which can be easily found on urban streets: stone, wood, metal, glass, and fabric. One important observation in recognizing materials in real-world scenesis that theymay becovered, visuallyby various covers such as graffiti and damages. Such visual covers distort the characteristics of their underlying materials, and hence make their recognition challenging. Definition 17 (Covered Material). A covered material is a material m that is covered partially with a cover c that distorts the visual appearance of m. 107 Definition 18 (Covered Material Recognition). Given an image I, a certain type of cover c k , and t types of materials (i.e., M = {m 1 , m 2 , ..., m t }), the covered material recognition is to detect the region in I which contains a cover of type c k and classify the type of its underlying material m j ∈M. The covered material recognition (see Definitions 17 and 18) can be used for many applications (e.g., energy consumption calculation and inspection of buildings). Our case study is on graffiti removal application to prepare cleanup equipment for the removal process which requires detailed information about the covered material and object before physicallyarrivingonthesite. Forexample,thegraffitiremovalonatreeisquitedifferent from that on a metal door. Therefore, our approach considers some common types of street objects 8 . In particular, given t types of materials (i.e., M = {m 1 , m 2 , ..., m t }) and n types of objects (i.e., O = {o 1 , o 2 , ..., o n }), the extended material classification includes at maximum t×n classes. In this case study, we consider 9 common object types on streets (i.e., box, tree, wall, door, pole, fence, window, stairs, and fire hydrant) along with the previously mentioned 5 material types. In our case study, we focus on 18 classes of the extended material clas- sification since many of the 5×9 classes may be discarded because of being unreasonable (e.g., metal tree) or due to the lack of example images. Some image examples are shown in Figure 7.4. Designing an image-based algorithm for the covered material recognition requires an image dataset as described in Definition 19. The annotations of D follow the extended materials classification. Definition 19 (Covered Material Image Dataset). The dataset D is composed of k images (D ={I 1 ,I 2 ,...I k }) and each image is annotated with at least one region which displays a covered material of type m j ∈M. 8 It was shown by Hu et al. [76] and Zheng et al. [173] that material recognition can be improved by incorporating both material and object recognition. 108 7.2.2 Proposed Approaches Recognizing covered materials is different from the plain material recognition prob- lem. In particular, the covered material recognition is associated with two challenges. First, the visual characteristics of the underlying surface material are partially hidden or distorted due to the existence of a cover. Second, it implies learning the visual char- acteristics of two distinctive units: cover and underlying material. To address these challenges, we propose the following approaches. Straightforward Learning Approach One straightforward approach is to consider a cover with its corresponding material togetherasoneunifiedobject(referredtoas one-phase learning approach (OLA)). Then, each type of unified objects is treated as a unique one. Subsequently, one of the state- of-the-art object detection algorithms (e.g., YOLO [132]) can be trained to learn the combined visual features of both the cover and its underlying material for various image regions displaying covered materials (see Figure 7.5). Another approach is to divide the learning into two distinct phases. The first phase focuses on learning only the characteristics of the cover regardless of the types of under- lying materials, while the second phase includes another model which learns the dis- tinguishing characteristics of each type of covered materials. This approach is referred to as a two-phase learning approach (TLA) (see Figure 7.6), and consists of two mod- els: a detection model to detect the portion of an image which potentially contains a cover (e.g., graffiti) and a classification model which classifies the covered material in the detected portion. The proposed sequence of the cascaded models can be reversed (i.e., a model for recognizing materials and then another model to verify whether each of the recognized materials is covered or not). However, the efficiency of TLA is mainly affected by the accuracy of the first model due to error propagation. Hence, our pro- posed sequence for TLA which starts with a model for detecting one type of covers is potentially more accurate than the reversed sequence of TLA which starts with a model 109 for recognizing a large number of materials. Subsequently, we chose the sequence which potentially minimizes the impact of error propagation. Figure 7.5: One-Phase Learning Approach (OLA) Figure 7.6: Two-Phase Learning Approach (TLA) Learning Approach with Heuristic Expansion A potential limitation of bothOLA andTLA is the fact that they are built based on the visual characteristics of a tightly bounded portion of the detected cover which poten- tially do not display sufficient visual cues of covered material properly. Hence, learning visual features of the underlying materials may suffer. To overcome this limitation, one can use an enlarged region rather than the exact portion of a cover to better learn funda- mental visual characteristics of the underlying material. This expanded learning can be performed heuristically using two methods: proportional and semantic expansion. The proportional expansion enlarges the learning region (which contains a covered material) by a factor (e.g., 1.3x.) referred to as an enlargement factor (γ). Even though this method is simple and straightforward, it may result in noisy learning due to potentially including other materials than the covered one (i.e., the material of interest). Hence, the value of the enlargement factor should be chosen carefully to avoid such a problem. Alternatively, the expansion can be performed utilizing the awareness of the visual con- text(referredtoas semantic expansion). Thesemanticexpansioncanuseasegmentation algorithm [52] that partitions an image into a set of segments, and each segment conveys 110 a meaningful unit that is distinct from the others. Therefore, the learning approach is expanded to the segment which contains the region of a covered material. The selected segment potentially conveys similar visual characteristics of the covered material; hence enhancing the learning approach. Both expansion methods can be employed with OLA (referred to as OLA with Proportional Expansion (OLA−PE) andOLA with Semantic Expansion (OLA−SE). Similarly, thesemethodsareusedwithTLAtoo(i.e.,TLA−PE and TLA−SE). Table 7.4: Graffiti Dataset Distribution among the Labels of Covered Materials Label # images Label # images Label # images Stone Wall 10,153 Wood Wall 167 Metal Wall 122 Wood Fence 270 Metal Fence 261 Fabric Fence 284 Stone Pole 227 WoodePole 186 Metal Pole 482 Stone Road 3,409 Wood Door 350 Metal Door 580 Metal Sign 989 Metal Box 735 Metal Fire Hydrant 102 Stone Stairs 116 Wood Tree 275 Glass Window 265 (a) Baseline vs. Proposed Approaches (b) OLA−PE w/ Varying γ (c) OLA−PE w/ VaryingC Figure 7.7: Proposed Approaches Evaluation 111 7.2.3 Experiments Dataset, Settings, and Baseline Approach Dataset. Our dataset consists of 19,000 graffiti images (D) collected by LASAN. Every image was tagged with one or more ground-truth boxes where each box is the mini- mum rectangle surrounding a graffiti and annotated with one of the 18 types of covered materials. These images were correctly labeled by LASAN experts based on the classi- fication of covered materials to automate the graffiti removal process. The distribution of images among the classes of covered materials is shown in Table 7.4. Note that the image distribution is imbalanced and hence it may result in a biased trained model. Thus, with sampling mechanisms, additional synthesized images have been generated and added to overcome the imbalance problem. In particular, we used the Python Aug- mentor library [41] which uses some image processing techniques (e.g., sharpen, affine transformation, rotation, and contrast normalization) for the categories with low fre- quency 9 . Implementation Settings. To implement our approaches, we created a detection model (for OLA and the first model of TLA) by fine-tuning the darknet53 model using the YOLOv3 framework (specifically, using stochastic gradient descent (SGD) with a batch sizeof128 andlearningrateof0.001). Forimplementingtheclassificationmodel in the second phase ofTLA, we used the Caffe architecture [81] to fine-tune the GoogLeNet model [147] using our dataset (using SGD with a batch size of 64 and learning rate of 0.001). Fine-tuning was performed by initializing the network with the weights of the pre-trained model, and then all layers were fine-tuned. Morever, the models were trained using 90% of the dataset and tested using the remaining subset. For a thorough evalua- tion of our solution, we conducted several experiments by varying two main parameters: a) minimum confidence threshold (C), and b) γ. The default values ofC and γ were 0.5 9 The augmentation does not change the nature of the problem since it was only applied to the training dataset to improve the trained model. 112 and 1.3x, respectively. To evaluate our approaches, we report mAP (i.e., mean average precision) scores of our results. Baseline Approach. We implemented a baseline approach adapted from the state- of-the-art material recognition method [40]. For this approach, we fine-tuned the GoogLeNet model on the image segments of D (i.e., trained on our 18 classes of mate- rials) that display “plain” materials (i.e., no graffiti). To obtain image parts of “plain” materials, we asked 8 human subjects to add annotations surrounding the entire objects containing graffiti spots (e.g., the blue box in Figure 7.8) in addition to the annotations of the graffiti spots (e.g., the black box in Figure 7.8). Figure 7.8: Annotation Example These annotations enable identifying and cropping the image parts that do not con- tain the covered portion (e.g., blue box - black box in Figure 7.8), so they can be used for training the baseline approach. And note that the blue box annotations are also used to perform semantic expansion. Thereafter, the baseline approach is being evaluated using only the regions that contain materials covered by graffiti (e.g., black box in Figure 7.8). The goal is basically to evaluate how well a model that is trained on plain materials can recognize covered materials. Evaluation Results Baseline vs. Proposed Approaches. Figure 7.7a shows the mAP scores of the baseline versus proposed approaches. In general, it is evident that an approach for plain 113 material recognition (i.e., baseline approach) was not suitable for recognizing covered materials. In particular, the straightforward approaches (i.e., either OLA orTLA) were able to achieve an improvement factor of 0.75x compared with the baseline. Both OLA and TLA showed similar results with a negligible difference; hence OLA is preferred because it requires less training and prediction time 10 . Moreover, OLA and TLA can be further improved by utilizing two heuristics (i.e., proportional or semantic expansion described in Section 7.2.2). In particular, both OLA and TLA with PE and SE were abletoimprovethemAPscorebyafactorof 1.6xand 1.8x,respectively. Theexperiments based on SE showed a better mAP score compared to the one based on PE because we used the annotations surrounding the entire objects marked with graffiti to obtain effective semantic expansion 11 . Impact of Varying γ. Figure 7.7b shows the mAP scores of OLA−PE 12 while varyingγ and fixingC at 0.5. In general,PE with a small value ofγ enables generating a better learned model for covered material recognition because the enlarged regions can properly depict additional visual cues of the underlying materials. However, as the value of γ grows, the model may gradually introduce more noise by learning the visual cues of other materials in addition to the underlying material of interest; hence degrading the effectiveness of the learned model. In our case study, mAP scores of the models of OLA−PE were increased using different values of γ up to 1.5. Impact of VaryingC. Figure 7.7c shows the mAP scores of OLA−PE while varying C and fixingγ at 1.3x. In general, decreasingC enables the learned model to report more predictions; hence increasing the chances of reporting a correct prediction. In our study, OLA−PE showed an increasing trend of mAP when decreasingC (i.e., it obtained an mAP of 60% and 49% when usingC = 0.01 and 0.6, respectively). 10 OLA obtained on average 16% and 28% improvement percentages w.r.t. TLA in terms of training and prediction times, respectively. 11 The effectiveness of SE depends on obtaining accurate segmentation when using a segmentation algorithm (e.g., [52]). 12 Hereafter, we report the results only for OLA−PE as the same trends hold for other approaches. 114 7.3 A Deep Learning Approach for Road Damage Detec- tion The economy of cities is essentially affected by their public facilities and infrastruc- tures. One fundamental element of such infrastructures is road. Many factors (e.g., raining and aging) cause different types of road damages that seriously impact road effi- ciency, driver safety 13 , and the value of vehicles 14 Therefore, countries devote a large annual budget for road maintenance and rehabilitation. For example, according to the statistics released by the Federal Highway Administration in the United States in 2013, the road network reached to 4.12 million miles and the government allocated $30.1 billion for constructing new roads and maintaining the existing ones [9]. Efficient road maintenance requires a reliable monitoring system and a straightfor- ward method is the human visual inspection; however, it is infeasible due to being expensive, laborious and time-consuming. Therefore, researchers have developed var- ious solutions for automatic road damage inspection, including vibration-based [164], laser-scanning-based [99], and image-based [168, 170, 85, 117] methods. While detec- tion by vibration methods is limited to the contacted parts of the road, laser-scanning methods provide accurate information about the status of roads; however, such methods are expensive and require a road closure. Meanwhile, image processing methods are inexpensive but may suffer from a lack of accuracy. In spite of its immaturity, recent advancements in image analysis techniques have been producing impressive results and thus increasing their usages for various applications. Afewresearchersdevelopedimage-basedapproachesforroadsurfaceinspectionusing the state-of-the-art deep learning methods. In particular, some works focus on detecting only the existence of the damage regardless of its type [168]. Other works focus on 13 In Europe, 50 millions people are injured in traffic crashes annually [119]. The bad conditions of roads is a primary factor for traffic cashes. 14 A study from the American Automobile Association (AAA) reports that road damages have cost U.S. drivers about $3 billion annually [7]. 115 classifying the road damages into a few types. For example, Zhang et al. [170] devised an approach for detecting two directional cracks (i.e., horizontal and vertical), while Akarsu et al. [85] developed another approach for detecting three categories of damages, namely horizontal, vertical, and crocodile. Due to the fact that differentiating among damage types is critical for proper road maintenance planning, Maeda et al. [117] have implemented an approach for a thorough classification of road damage types. The focus of this study is to develop a smart city application for automating the detection of different types of road damages (proposed by Maeda et al. [117]) using smartphone images crowdsourced by city crews or the public 15 . Our approach uses one of the state-of-the-art deep learning algorithms (i.e., YOLO [132] 16 ) for an object detection task. The source code of our solution is available at (https://github.com/ dweeptrivedi/road-damage-detection). 7.3.1 Road Damage Detection Solution Table 7.5: Road Damage Types [117] Damage Type Detail Class Name Crack Linear Crack Longitudinal Wheel mark part D00 Construction joint part D01 Lateral Equal interval D10 Construction joint part D11 Alligator Crack Partial pavement, overall pavement D20 Other Corruption Rutting, bump, pothole, separation D40 Crosswalk blur D43 White/Yellow line blur D44 Image Dataset The images provided by the IEEE BigData Cup Challenge capture scenes of urban streets located in seven geographical areas in Japan; Ichihara city, Chiba city, Sumida 15 Spatial crowdsourcing mechanisms [13] can be used for collecting more images at locations which are not sufficiently covered by visual information [16] for roads inspection. 16 YOLO is chosen due to its feasibility to work on both server and edge devices (e.g., detection and classification can be done on a smartphone). 116 (a) D00 (b) D01 (c) D10 (d) D11 (e) D20 (f) D40 (g) D43 (h) D44 Figure 7.9: Image Examples for the Road Damage Types [117] ward, Nagakute city, Adachi city, Muroran city, and Numazu city. Each image is anno- tated by one or more region(s) of interest (referred to as ground truth box) and each box is labeled with one of the road damage classes proposed by Maeda et al. [117] (which is adopted from Japan Road Association [8]). As illustrated in Table 7.5, the classification of road damages includes eight types which can be generalized into two categories: cracks and other corruptions. The crack category is either linear or alligator cracks. The linear cracks can be longitudinal and lateral. Meanwhile, the category of the other corruptions include three sub-categories: potholes and rutting, white line blur, and crosswalk blur. Examples of such types of road damages are shown in Figure 7.9. 117 Background on Object Detection Algorithms An object detection algorithm analyzes the visual content of an image to recognize instances of a certain object category, then outputs the category and location of the detected objects. With the emergence of deep convolutional networks, many CNN-based object detection algorithms have been introduced. The first one is the Region of CNN features (R-CNN) method [67] which tackles object detection in two steps: object region proposal and classification. The object region proposal employs a selective search to generate multiple regions. These regions are processed and fed to a CNN classifier. R- CNN is slow due to the repetitive CNN evaluation. Hence, many other algorithms have been proposed to optimize R-CNN (e.g., Fast R-CNN [66]). Other than the R-CNN- based algorithms, The “You Only Look Once” (YOLO) method [132] uses a different approach and basically merges the two steps of the R-CNN algorithm into one step by developing a neural network which internally divides the image into regions and predicts categories and probabilities for each region. Thus, applying the prediction once makes YOLO achieve a significant speedup compared to R-CNN-based algorithms; hence can be used for real-time prediction. Deep Learning Approach To solve the road damage type detection problem, we consider a road damage as a unique object to be detected. In particular, each of the different road damage types is treated as a distinguishable object. Then, we use one of the state-of-the-art object detection algorithms (i.e., YOLO) to be trained on the road damage dataset to learn the visual patterns of each road damage type (see Figure 7.10). 118 Figure 7.10: A Deep Learning Approach for Road Damage Detection and Classification Table 7.6: The Distribution of Training Datasets (Original, Augmented, and Cropped) among the Road Damage Classes Dataset Road Damage Classes D00 D01 D10 D11 D20 D40 D43 D44 D 1747 2856 660 683 1801 371 587 2856 Da 1984 3267 1349 1372 1916 756 613 4060 Dc 1747 2856 660 683 1801 371 587 2856 Table 7.7: Parameter Values for Experiments Parameter Values T 20k, 25k, 30k, 35k, 40k, 45k, 50k, 55k, 60k, 65k, 70k C 0.01, 0.05, 0.1, 0.15, 0.2 NMS 0.45, 0.75, 0.85, 0.95, 0.999 7.3.2 Experiments Dataset and Settings The dataset provided by the BigData Cup Challenge consists of two sets: training images (7,231) and testing images (1,813). Every image in the training set is anno- tated by one or more ground-truth boxes where each corresponds to one of the eight types of road damages. The distribution of the training dataset among the road damage classes is shown in Table 7.6. It is evident that the distribution of the dataset among the road damage classes is not balanced where some classes (i.e., D10, D11, D40, and D43) have smaller numbers of images compared to the other classes. Hence, we used the Python Augmentor library (https://augmentor.readthedocs.io/en/master/) to generate synthesized images for training images that contain such classes 17 . While augmentation, 17 Since original training images may contain multiple ground-truth boxes, the augmented images have changed the number of images for all classes. 119 we carefully selected the image processing techniques (e.g., brightening, gray-scale) pro- vided by the Augmentor tool to assure that the road damage scenes were not affected 18 . Another technique that we considered in processing the training dataset is cropping. We cropped every image to create a smaller size of the original image while making sure that the cropped image contains the annotated regions. Such a dataset enables the training model to focus on learning the features of regions of interest properly by discarding irrelevant scenes (e.g., sky view). Consequently, we had three training datasets: original (D), augmented (D a ), and cropped (D c ). To create an object detector, we fine-tuned the darknet53 model using the YOLO framework (version 3). The model was trained using the road damage classes. To experiment our solution thoroughly, we generated different versions of the trained models using each dataset (D, D a , and D c ) by varying three parameters: # of iterations for model training (T), minimum confidence threshold (C), and non-maximum suppression (NMS) 19 . Every model was trained up to 70k iterations and we preserved a snapshot of thetrainedmodelatcertainnumbersofiterations. Foreverytestimage,YOLOgenerates a set of predicted boxes and each box is tagged with a prediction confidence score and a predicted label. To avoid reporting the boxes tagged with very low prediction confidence score, we discarded the boxes whose confidence scores were below C. Furthermore, increasing the value of NMS (default value 0.45) increases the number of overlapped predicted boxes; hence increasing the chances of correctly predicting a ground-truth box. The values of these three parameters are listed in Table 7.7. In what follows, we report the F1 score of our results 20 to evaluate our solution. 18 Some image processing techniques, such as rotating, may lead to a confusion (e.g., vertical vs. horizontal cracks) among road damage types. 19 Another parameter is the intersection of union (IOU). However, it was fixed at 0.5 per the challenge rules. 20 Since the provided test images do not have ground-truth boxes, we only reported F1 scores that were calculatedusingthewebsiteoftheroaddamagedetectionchallenge(https://bdc2018.mycityreport.net/). 120 Evaluation Results Table 7.8: F1 Scores for the Model Trained using D T C 0.01 0.05 0.1 0.15 0.2 20k 0.53956 0.57488 0.57683 0.57637 0.57750 25k 0.53839 0.56616 0.57343 0.57532 0.57229 30k 0.56553 0.58418 0.58734 0.58585 0.58477 35k 0.56888 0.58526 0.58899 0.58668 0.58703 40k 0.55349 0.57198 0.57502 0.57968 0.58046 45k 0.57658 0.58770 0.59411 0.58977 0.58588 50k 0.57203 0.57988 0.58059 0.58083 0.57342 55k 0.58432 0.59115 0.59297 0.59325 0.59151 60k 0.57205 0.58154 0.58407 0.57814 0.57637 65k 0.56402 0.57580 0.57536 0.57717 0.57653 70k 0.55960 0.57423 0.57286 0.57781 0.57496 Table 7.9: F1 Scores for the Model Trained using D a T C 0.01 0.05 0.1 0.15 0.2 20k 0.54979 0.56940 0.57063 0.57303 0.57393 25k 0.57178 0.58858 0.58433 0.58639 0.58759 30k 0.55135 0.56225 0.57268 0.57013 0.56740 35k 0.56149 0.57861 0.58590 0.58163 0.58304 40k 0.57018 0.58282 0.58410 0.58062 0.57663 45k 0.56432 0.57993 0.58056 0.57528 0.57189 50k 0.57097 0.58674 0.58371 0.57791 0.57927 55k 0.57441 0.58498 0.58734 0.58853 0.58749 60k 0.57993 0.59094 0.59211 0.58587 0.58485 65k 0.57102 0.57671 0.58116 0.58791 0.58644 70k 0.56117 0.56640 0.57049 0.56692 0.56735 Table 7.10: F1 Scores for the Model Trained using D c T C 0.01 0.05 0.1 0.15 0.2 20k 0.52322 0.55720 0.56551 0.56611 0.56354 25k 0.54508 0.56571 0.56494 0.56590 0.56752 30k 0.53448 0.56157 0.56372 0.56292 0.56517 35k 0.55065 0.57252 0.57662 0.57326 0.57088 40k 0.53651 0.55193 0.54901 0.55348 0.55078 45k 0.52906 0.55767 0.55419 0.55642 0.55144 50k 0.54435 0.55956 0.56084 0.55968 0.55880 55k 0.54814 0.56924 0.57683 0.57304 0.57525 60k 0.55776 0.57562 0.57688 0.57339 0.57208 65k 0.54976 0.57010 0.57081 0.56882 0.56877 70k 0.54665 0.56253 0.56369 0.55575 0.55151 Table 7.8 shows the F1 scores of the detection model trained using D by varying the values ofT andC. In general, increasingT (i.e., training the model for a larger number of iterations) orC (i.e., discarding larger number of the predicted boxes with the lowest 121 (a) W/ the Model (D,T = 45K,C = 0.01) (b) W/ the Model (D a ,T = 25K,C = 0.0508) (c) W/ the Model (D c ,T = 55K,C = 0.15) Figure 7.11: The Impact of Varying the Value of NMS using Different Models confidence scores) does not necessarily improve the performance of the model. Using D, the model achieved the best F1 score (i.e., 0.59411) when T = 45k and C = 0.1. Similarly, we evaluated our solution by training another model usingD a . In general, the model using D a (see Table 7.9) was slightly better than the one using D in some cases. However, the best F1 score usingD was higher than the one usingD a . Furthermore, the evaluation of the model trained using D c is shown in Table 7.10. In general, the model trained using D c outperforms neither of the other models. The model using D c was not optimized because any object detection algorithm during the training phase uses only the regions of interests of each training image (i.e., ground-truth boxes) rather than the entire image. The model using D c achieved the best F1 score (i.e., 0.57688) when T = 60k andC = 0.1. We noticed that the training image dataset contains some images that have over- lapped ground-truth boxes with multiple classes as shown in Figure 7.12. Therefore, to enable reporting multiple overlapped predicted boxes, we enlarged the values of the NMS parameter. Experimentally, we noticed that increasing the value of NMS has 122 Figure 7.12: An Image Containing Overlapped Ground-truth Boxes improved the F1 score slightly. As shown in Figure 7.11, in general, the F1 score of all models whenNMS = 0.999 is the highest. Thus, we exhaustively experimented different combinations of the other parameters usingNMS = 0.999 to get the best F1 score. The detection models trained using D,D a , andD c have achieved F1 scores up to 0.61, 0.62, and 0.60 as shown in Figures 7.11a, 7.11b, and 7.11c, respectively. 123 Chapter 8 Related Work In this chapter, we review the related studies to our work in three different categories; image scene localization, spatially aggregated visual features, and spatial-visual search. 8.1 Related Work for Image Scene Localization Image Localization: The image localization challenge was introduced by Hays et al. [73]. The focus of the localization challenge was to estimate the camera location of an image. Researchers have tackled this problem by investigating several approaches utilizingimageretrieval[73,166,93,139], clustering[31,174], andclassification[158,155] algorithms. Other researchers [114, 161, 125] tackled the problem of estimating the camera direction using computer vision algorithms (e.g., structure-from-motion). To the best of our knowledge, we are the first to consider the scene location for image localization. Furthermore, some researchers exploited other modalities in addition to image content such as user tags [65, 141]. Another perspective of localization is to utilize a reference dataset of satellite images to predict the location of a ground-level image (known as cross-view image localization) [101, 160]. Hierarchical Classification: In the area of data mining, classification algorithms are widely investigated. In particular, they follow either flat or hierarchical paradigm (surveyed in [145]). To follow the hierarchical paradigm, it is critical to construct a hierarchy which depicts the relations between the classes of interest. One approach to construct a hierarchy of classes is using a tree or a directed acyclic graph (DAG). In our work, the hierarchy of classes is implicitly given by the spatial organization of the 124 reference dataset. To the best of our knowledge, we are the first to utilize hierarchical classification for the image localization problem. 8.2 Related work for Spatially Aggregated Visual Features Visual Descriptors for Image Search. Several visual feature types (e.g., per- ceived features such as color, and hand-crafted ones such as SIFT [109]) have been used to represent an image content to enable efficient image search. Recently, it has been shown that aggregating local image features into a global visual descriptor is superior for image search. Some well-known aggregation techniques for local image features are bag-of-words [146], VLAD [79], Fisher Vectors [128], and triangular embedding [80]. Other researchers investigated aggregating different types of image features into a global descriptor for image search [26]. More recently, another global descriptor which is formed of the features extracted from convolutional neural networks (CNN), particularly from the last fully-connected network layer, has emerged as a state-of-the-art generic descrip- tor for image search [143, 35, 156]. However, such a descriptor has some limitations for image scaling, cropping and cluttering. Therefore, several approaches were proposed for improving CNN-based descriptors [34, 151, 83]. One of these approaches is regional max- imum activations of convolutions (R-MAC) [151] which aggregates several image regions into a compact feature vector of a fixed length (i.e., 512 dimensions). In this thesis, we use R-MAC as the visual descriptor of a single image (i.e.,CVD) to build our proposed SVD descriptor. Image Search by Query Expansion. Query expansion (QE) is a standard method for improving the accuracy of information retrieval. It was adopted by Chum et al. [54] in image search through performing a set of steps: First, an initial set of retrieved images for a query image is verified to select a subset of images which are geometrically consistent with the query image. Then, the selected images are exploited to construct an expanded query by producing an enriched representation which is re-submitted to 125 the search engine to retrieve additional relevant images. For this sake, several meth- ods[54,82,27,150]havebeeninvestigatedtobuildtheexpandedqueryincludingaverage query expansion [54, 82], classification-based expansion [27], and hamming embedding technique [150]. The similarity betweenSVD-based image search and image search by QE is that both approaches aim at improving search accuracy. However, there are major differences. First, our proposed approach improves searching by re-organizing an image dataset properly in index structures, while the QE method improves the search results by altering the visual descriptor of the query image by injecting additional visual features of other relevant images obtained by the result of an initial search (i.e., iterative search). Second, our approach is performed offline, so it does not affect search performance while the QE method is performed at query time, so its query performance degrades. Third, our approach uses spatial metadata for selecting relevant images for a reference image while the QE method uses the relevance feedback of an initial result set to select a subset of relevant images. 8.3 Related Work for Spatial-Visual Search Spatial-TextualIndexing. Spatial-visualsearchisanalogoustotheproblemofspatial- textual search. However, in our problem, text is replaced with images that are repre- sented by high-dimensional extracted features, while low-dimensional features resulted from dimension reduction techniques are utilized in indexes to trade the search accu- racy for efficiency. Moreover, the location tagged with text may not always refer to the location of the event depicted in the text while a geo-tagged image always depicts a scene in the same image location. Therefore, spatial-visual search can’t be expedited straightforwardly by the proposed indexes proposed for spatial-textual search. These spatial-textual indexes (surveyed by Chen et al. [51]) are categorized in three categories: text-first loose combination [175, 153], spatial-first loose combination [175, 153], and tight combination [62, 87]. The text-first loose index consists of a keyword inverted 126 index file followed by multiple R*-tree or Grid files. On the contrary, the spatial-first loose index uses one spatial index on the top of multiple keyword inverted file indexes. Meanwhile, the tight index utilizes either a spatial index or text index augmented with the text and spatial data (e.g., IR2-tree [62] and SKIF [87]). Recently, other hybrid search mechanisms depending on spatial constraints have been studied for geo-social networks [28, 107, 163] and geo-located time-series [47, 48]; how- ever, they are different from spatial-visual search as discussed earlier in the comparison to spatial-textual search. For searching geo-located time series efficiently, Chatzigeor- gakidis et al. [47] have proposed a hybrid index that extends R-tree by maintaining bounds for the time series contained in each node. Subsequently, Chatzigeorgakidis et al. [48] proposed another hybrid index (referred to as geo-iSAX), which extends iSAX index [144] (one of the state-of-the-art indexes for time series) by maintaining bounds for both spatial and temporal data of the geo-located time series indexed in each node. Based on their reported results in [47], geo-iSAX outperform the R-tree-based approach. This is expected as time-series are high-dimensional data, for which R-tree is not appro- priate. However, in our case, due to the locality of visual-features, a single R*-tree is more efficient to index both visual and spatial features as compared to hybrid indexing that combines image indexes (e.g., LSH) with R-tree [21]. Spatial-Visual Search. To the best of our knowledge, we were the first to introduce hybrid indexes for spatial-visual search [21]. We have introduced several index structures that comprise both spatial and visual indexes, in tandem. In particular, the paradigms of these indexes include double index (spatial and visual indexes), two-level index (spatial first or visual first), and single index either spatial or visual one augmented with both data types. Moreover, in our previous work, we have investigated various scenarios of the selectivity for the spatial-visual query and studied their effect on the performance of the proposed indexes. In this work, we focused on proposing hybrid indexes based on R*- tree (well-known spatial index) utilizing the street-image visual locality phenomenon. In addition, we have presented a preliminary design of a hybrid index that combines 127 the spatial and visual data together. On the other hand, there are other works that investigated spatial-visual search; however, their focus was on the ranking [171] or rec- ommendation [103] problems based on geo-tagged images. Moreover, the problem of spatial-visual search has been extended into different forms including temporal spatial- visualsearchandinteractivespatial-visualsearch. Forthetemporalspatial-visualsearch, Zhang et al. [169] proposed multiple local visual indexes for the global partitioning of temporal-then-spatial segments. Meanwhile, the interactive spatial-visual search [108] is basically an iterative process of spatial-visual queries expanded by considering the user preferences from the result of the previous iteration. 8.4 Related Work for Smart-City Applications 8.4.1 Street Cleanliness Classification Systems for Street Cleanliness. The efforts for maintaining streets clean have fol- lowed two directions; waste bin monitoring systems and street-rating systems. Several street-rating systems were designed in various cities (e.g., New York [134] and Los Ange- les [97]). Some of these systems [97] are equipped with smartphone apps for the task of image collection and manual rating. Furthermore, Begur et al. [39] developed a smart- phone app, deployed in the city of San Jose, for not only collecting but also analyzing images using object detection techniques to observe dumped wastes on streets. Regard- ing monitoring waste bins, Mahajan et al. [118] proposed an integrated sensor-based system (i.e., RFID and ultrasonic sensors) and Hannan et al. [69] classified the collected images by trash vehicles to identify the level of solid waste in bins. Image-Based Classification. Recently, the advances in computer vision enable learn- ing images more accurately; hence images become a reliable source of information in different domains. Various image-based applications are developed such as searching, object detection, and classification. In particular, image-based classification is adopted in serious problems related to human life and environment such as perceiving the safety 128 levelofastreetview[120],land-usepatterns(e.g.,waterbody)fromsatelliteimages[121], disease diagnosis [50] and observing disaster situations from social networks [12]. Our approachisdifferentfromexistingonessinceweinvestigatetheimage-basedclassification with the geo-spatial information of images. 8.4.2 Covered Material Recognition Material Dataset & Recognition. Various material datasets such as Flickr Material Dataset (FMD) [142] and Material in Context Database (MINC) [40] have been avail- able. To address the material recognition problem, researchers have used image-based classification techniques using various classical image feature vectors (e.g., the approach proposed by Liu et al. [102] use bag-of-words consisting of color, SIFT, and reflectance- based edge features, Hu et al. [76] uses variances of oriented features, and Qi et al. [129] uses pairwise local binary pattern features). Thereafter, due to the advances of con- volutional neural networks, Cimpoi et al. [56] utilized CNN to develop an improved Fisher vector. Bell et al. [40] used the transfer learning mechanism by fine-tuning a pre-trained model (GoogLeNet [147]) on image segments from the MINC dataset while the GoogLeNet model was originally trained for the object recognition task. GraffitiDetection&Retrieval. Manyimage-basedresearchworkshavebeendevoted to graffiti detection and retrieval. Some of them focused on developing algorithms to automatically detect the act of graffiti drawing on a surface using surveillance cameras based on motion changes [60], visual changes due to both brightness and depth [152], or geometry changes [63]. Furthermore, various image-based research systems have been developed for detecting graffiti such as GARI [126], Graffiti Tracker [4], and GRIP [3]. Another body of research work focused on the retrieval of graffiti images from an image database following various paradigms: a) similarity-based retrieval using the SIFT fea- tures [78], b) semantic-based retrieval using optical character recognition (OCR) tech- niques [162], and c) author-based retrieval by analyzing the graffiti style (e.g., shape 129 context and word matching) [140]. However, to the best of our knowledge, there is no existing work focusing on both graffiti detection and underlying material recognition. 130 Chapter 9 Conclusions and Future Work 9.1 Conclusions Image Scene Localization. In Chapter 3, we propose a novel framework for image scene localization using spatial-visual classification. To address scene localization, we use a reference set of scene location-tagged images. To construct such a dataset, the framework estimates scene locations using a geometric approach when rich geo-metadata are available, or a vision-based approach when enough similar GPS-tagged images are available in a region. The hierarchical organization of scene locations using R-tree is utilized to build hierarchical classifiers. Since scene location is a more accurate spatial representation of an image rather than its camera location, our framework was able to classify images more accurately than camera position based localization, achieving an F1 score of 0.74. Moreover, we showed that the proposed hierarchical classification with scene localizations significantly increases image localization accuracy compared to conventional approaches. Spatially Aggregated Visual Features. In Chapter 4, we propose a new visual descriptor which represents an image using an expanded feature set obtained from a group of similar images located in its vicinity. This descriptor is referred to as Spatially Aggregated Visual Feature Descriptor (SVD). To generate theSVD descriptor of each image, we designed a two-step framework. The first step selects a group of images using a spatial-visual query, and the second step aggregates the visual features of the selected images. The main benefit ofSVD descriptors is to improve the accuracy of image search when index structures are used to expedite the search performance. UsingSVD enables 131 a better organization of images in an index structure by ensuring storing similar images together; thus it significantly improves the accuracy of image search. We empirically evaluated the results of image search on three big image datasets using two well-known image index structures (i.e., locality sensitive hashing index and hierarchical clustering tree).SVD-based index structures were able to increase the recall score of image search by an improvement percentage between 30% to 65%. As part of future work, we plan to investigate the challenge of updatingSVD descriptor for images according to the changes of an image dataset (e.g., dynamic dataset). Spatial-VisualSearch. Geo-taggedstreetimageshaveauniquecharacteristicinwhich similar images are co-located in spatial proximity. Subsequently, an efficient index struc- ture for spatial-visual queries can be based on a spatial index structure, without a need for high-dimensional image indexing techniques (e.g., LSH). In Chapter 6, we propose a class of R*-tree indexes that are built using both the spatial and visual features of the contained images. To obtain a better image organization, the optimization criteria, which are used for constructing disjointed nodes in R*-tree, are modified to discriminate the types of image descriptors and weight them differently. Furthermore, the images per node are clustered locally for generating multiple local visual MBRs instead of one loose visual MBR. Through an empirical evaluation of our proposed indexes using several street image datasets, the hybrid indexes showed a superior performance as compared to the baselines, especially as we increased the spatial or visual range of a spatial-visual query. Moreover, the hybrid indexes showed larger improvements in both performance and accuracy for the spatially dense datasets. Finally, the performance improvement achieved by having multiple visual MBRs per node is verified, especially for large-scale datasets. Aspartoffuturework, weplantoextendourproposedindexestoovercomethe challenges arisen in a dynamic environment by considering bulk insertion and deletion mechanisms. 132 9.2 Future Work In future, I intend to extend this work in the following directions. Bulk Insertion and Deletion. As discussed in Chapter 6, the CSV index structure contains multiple MBRs. Therefore, CSV may introduce a significant overhead as compared to that of conventional indexes for insertion or deletion, especially for a large database. Therefore, it is essential to devise an efficient bulk insertion and deletion mechanism. Early work on the bulk insertion for an R-tree was proposed by Kamel and Faloutsos [84]. In their proposed approach, the data items to be inserted are first sorted by their spatial proximity (e.g., the Hilbert value of the center) and then packed into blocks of B rectangles. These blocks are then inserted one at a time using standard insertion algorithm. Intuitively, the algorithm should offer an insertion speed-up of B (as a block of B data items is inserted one at a time), but it is likely to increase overlap and thus produces a worse index in terms of query performance. Consequently, other approaches have been investigated by other algorithms, including buffer-tree [154] and sort-merge-pack strategy [136]. One of the recent approaches for bulk insertion is the Generalized R-Tree Bulk-Insertion Strategy (GBI) [53]. In GBI, the new incoming data is clustered, and for each cluster, a small R-tree is constructed. After that, the algorithm searches for suitable locations for inserting small R-trees into the original R-tree (i.e., large tree). To adapt GBI for our spatial-visual index structures (ASV and CSV), we need to perform hybrid clustering for new incoming data to generate proper clusters where each cluster consists of images that are similar and located in the same neighborhood. Furthermore, finding the locations of small R-trees in the Original tree can follow the modified algorithm of chooseSubtree used in our spatial-visual index structures. In particular, the original algorithm of “chooseSubtree” will be adapted to choose a node for inserting a new image where the chosen node requires the least expansion of its spatial and visual coverages (i.e., boundaries) after inserting the new 133 image. To enable bulk insertion, the algorithm of “chooseSubtree” should be adapted further to enable inserting multiple small R-trees instead of objects. Spatial Visual Semantic Search. CSV demonstrated that clustering based on visual similarity has the potential to effectively manage the groups of overall similar images. However, its clustering is performed without considering the detailed semantics of image content. When generating a new set of classes using image content such as detected objects, CSV can be effectively extended to accommodate the classes of new semantic descriptions of images in addition to the classes of overall visual similarity. As a case study, we will consider only common objects (e.g., car, person, fire hydrant, street sign, traffic light) and background (e.g., building, tree, park) in urban street scenes. Therefore, images will be pre-processed using one of the state-of-the-art object detection algorithms (e.g., SSD [106], YOLO [132], Fast R-CNN [66], Faster R-CNN [133], and Mask R-CNN [75]) and tagged with annotations. Spatial-Visual Search for Aerial Images. Unmanned aerial vehicles (UAVs) (e.g., drones) are becoming significantly prevalent in both daily life (e.g., event coverage, tourism) and critical situations (disaster management and military operations), gener- ating an unprecedented number of aerial images and videos. UAVs are usually equipped with various sensors (e.g., GPS, accelerometers, and gyroscopes), so provide sufficient spatialmetadatathatdescribesthespatialextent(referredtoasaerialfield-of-view[105]) of recorded imagery. Therefore, with the availability of geo-tagged aerial images for a certain geographical area, it is necessary to design a hybrid index for such images which are different from ground-level geo-tagged street images. In the case of aerial images, their spatial context are modelled in the 3D space, and also, their visual features may not have the spatial locality phenomena (discussed in Chapter 6). Few researchers have investigated indexing aerial images using only their spatial properties [110, 46]. Also, other group of researchers have investigated developing spatial coverage measurement 134 models for aerial images utilizing only their spatial properties [20, 19]. Hence, it seems natural that the design of new hybrid index structures are needed for supporting spatial- visual queries on aerial images. 135 Reference List [1] Flickr API. https://www.flickr.com/services/api/. Accessed: 2018-07-14. [2] Google Street View API. https://developers.google.com/streetview/. Accessed: 2018-07-14. [3] Graffiti Reduction and Interception Program (GRIP). http://www.gripsystems. org/. [4] Graffiti Tracker. http://graffititracker.net/. [5] OpenSfM. https://github.com/mapillary/OpenSfM/. Accessed: 2015-09-30. [6] Sample Size Table. http://www.research-advisors.com/tools/SampleSize. htm". Accessed: 2018-07-14. [7] Study: Pothole damage costs u.s. drivers $3b a year. https://www. insurancejournal.com/magazines/mag-features/2016/03/21/401900.htm, 2016. [8] Maintenance and repair guide book of the pavement 2013. http://www.road.or. jp/english/publication/index.html, 04 2017. [9] American road & transportation builders association. https://www. insurancejournal.com/magazines/mag-features/2016/03/21/401900.htm, 2018. [10] A. Agarwala, M. Agrawala, M. Cohen, D. Salesin, and R. Szeliski. Photograph- ing long scenes with multi-viewpoint panoramas. ACM Transactions on Graphics (TOG), 25(3):853–861, 2006. [11] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In International conference on foundations of data organization and algorithms (FODO), pages 69–84. Springer, 1993. [12] A. Alfarrarjeh, S. Agrawal, S. H. Kim, and C. Shahabi. Geo-spatial Multimedia Sentiment Analysis in Disasters. In Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics, pages 193–202. IEEE, 2017. 136 [13] A. Alfarrarjeh, T. Emrich, and C. Shahabi. Scalable spatial crowdsourcing: A study of distributed algorithms. In Proceedings of the 16th IEEE International Conference on Mobile Data Management, volume 1, pages 134–144. IEEE, 2015. [14] A. Alfarrarjeh, S. H. Kim, S. Agrawal, M. Ashok, S. Y. Kim, and C. Shahabi. Image classification to determine the level of street cleanliness: A case study. In Proceedings of the 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), pages 1–5. IEEE, 2018. [15] A. Alfarrarjeh, S. H. Kim, A. Bright, V. Hegde, Akshansh, and C. Shahabi. Spatial aggregation of visual features for image data search in a large geo-tagged image dataset. In Proceedings of The Fifth IEEE International Conference on Multimedia Big Data (BigMM). IEEE, 2019. [16] A. Alfarrarjeh, S. H. Kim, A. Deshmukh, S. Rajan, Y. Lu, and C. Shahabi. Spatial coverage measurement of geo-tagged visual data: A database approach. In Proced- dings of the 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), pages 1–8. IEEE, 2018. [17] A.Alfarrarjeh, S.H.Kim, V.Hegde, Akshansh, C.Shahabi, Q.Xie, andS.Ravada. A class of R*-tree indexes for spatial-visual search of geo-tagged street images. Manuscript submitted for publication, 2020. [18] A. Alfarrarjeh, S. H. Kim, S. Rajan, A. Deshmukh, and C. Shahabi. A data- centric approach for image scene localization. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), pages 594–603. IEEE, 2018. [19] A. Alfarrarjeh, Z. Ma, S. H. Kim, Y. Park, and C. Shahabi. A web-based visual- ization tool for 3d spatial coverage measurement of aerial images. Manuscript in press for publication at the 26th International Conference on Multimedia Modeling (MMM), 2020. [20] A.Alfarrarjeh, Z.Ma, S.H.Kim, andC.Shahabi. 3dspatialcoveragemeasurement of aerial images. Manuscript in press for publication at the 26th International Conference on Multimedia Modeling (MMM), 2020. [21] A. Alfarrarjeh, C. Shahabi, and S. H. Kim. Hybrid Indexes for Spatial-Visual Search. In Proceedings of the Thematic Workshops of ACM Multimedia 2017, pages 75–83. ACM, 2017. [22] A. Alfarrarjeh, D. Trivedi, S. H. Kim, H. Park, C. Huang, and C. Shahabi. Rec- ognizing material of a covered object: A case study with graffiti. In Proceedings of the IEEE 26th International Conference on Image Processing (ICIP), pages 2491–2495. IEEE, 2019. [23] A. Alfarrarjeh, D. Trivedi, S. H. Kim, and C. Shahabi. A deep learning approach for road damage detection from smartphone images. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), pages 5201–5204. IEEE, 2018. 137 [24] G. Amato, F. Falchi, C. Gennaro, and F. Rabitti. YFCC100M HybridNet fc6 Deep Features for Content-based Image Retrieval. In Proceedings of the 2016 ACM Workshop on Multimedia COMMONS, pages 11–18. ACM, 2016. [25] A. Andoni and P. Indyk. Near-optimal Hashing Algorithms for Approximate Near- est Neighbor in High Dimensions. In Proceedings of the 47th Annual IEEE Sym- posium on Foundations of Computer Science (FOCS), pages 459–468, Berkeley, California, USA, 21-24 October, 2006. IEEE. [26] P. Androutsos, A. Kushki, K. N. Plataniotis, and A. N. Venetsanopoulos. Aggre- gation of color and shape features for hybrid query generation in content based visual information retrieval. Signal Processing, 85(2):385–393, 2005. [27] R. Arandjelović and A. Zisserman. Three things everyone should know to improve object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2911–2918. IEEE, 2012. [28] N. Armenatzoglou, S. Papadopoulos, and D. Papadias. A general framework for geo-social query processing. Proceedings of the VLDB Endowment, 6(10):913–924, 2013. [29] S. Arslan Ay, L. Zhang, S. H. Kim, M. He, and R. Zimmermann. GRVS: A Georeferenced Video Search Engine. In Proceedings of the 17th ACM International Conference on Multimedia, pages 977–978. ACM, 2009. [30] S.AryaandD.M.Mount. ApproximateNearestNeighborQueriesinFixedDimen- sions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete algo- rithms (SODA), volume 93, pages 271–280, 1993. [31] Y. Avrithis, Y. Kalantidis, G. Tolias, and E. Spyrou. Retrieving Landmark and Non-landmark Images from Community Photo Collections. In Proceedings of the 18th ACM International Conference on Multimedia, pages 153–162. ACM, 2010. [32] S. A. Ay, R. Zimmermann, and S. H. Kim. Viewable Scene Modeling for Geospa- tial Video Search. In Proceedings of the 16th ACM International Conference on Multimedia, pages 309–318. ACM, 2008. [33] S. A. Ay, R. Zimmermann, and S. H. Kim. Relevance ranking in georeferenced video search. Multimedia Systems, 16(2):105–125, 2010. [34] A. Babenko and V. Lempitsky. Aggregating deep convolutional features for image retrieval. arXiv, 2015. [35] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural Codes for Image Retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 584–599. Springer, 2014. [36] E. Bas, A. M. Tekalp, and F. S. Salman. Automatic vehicle counting from video for traffic flow analysis. In Proceedings of the 2007 IEEE Intelligent Vehicles Sym- posium (IV), pages 392–397. IEEE, 2007. 138 [37] H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. In Proceedings of the 9th European Conference on Computer Vision (ECCV), pages 404–417. Springer, 2006. [38] N.Beckmann,H.-P.Kriegel,R.Schneider,andB.Seeger. TheR*-tree: AnEfficient and Robust Access Method for Points and Rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, pages 322–331. ACM, 1990. [39] H. Begur, M. Dhawade, N. Gaur, P. Dureja, J. Gao, M. Mahmoud, J. Huang, S. Chen, and X. Ding. An edge-based smart mobile service sys- tem for illegal dumping detection and monitoring in san jose. In Proceed- ings of the IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (Smart- World/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pages 1–6. IEEE, 2017. [40] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. In Proceedings of the 28th IEEE conference on computer vision and pattern recognition (CVPR), pages 3479–3487, 2015. [41] M. D. Bloice. Augmentor. http://augmentor.readthedocs.io, 2016. [42] A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii, and S. Venkatesan. Scalable k-means by ranked retrieval. In Proceedings of the 7th ACM international conference on Web search and data mining (WSDM), pages 233–242. ACM, 2014. [43] C. E. Brodley and M. A. Friedl. Identifying Mislabeled Training Data. Journal of Artificial Intelligence Research, 11:131–167, 1999. [44] M. Brown and D. Lowe. Recognising panoramas. In Proceeding of the 9th IEEE International Conference on Computer Vision (ICCV), volume 3, page 1218, 2003. [45] M. Brown and D. G. Lowe. Automatic panoramic image stitching using invariant features. International journal of computer vision (IJCV), 74(1):59–73, 2007. [46] Y.-H. Chang, W.-H. Lee, and T.-F. Ke. Rectangle-based approaches for repre- senting flood data in spatial indices. Journal of Marine Science and Technology, 27(1):35–45, 2019. [47] G. Chatzigeorgakidis, K. Patroumpas, D. Skoutas, S. Athanasiou, and S. Ski- adopoulos. Visual exploration of geolocated time series with hybrid indexing. Big Data Research, 15:12–28, 2019. [48] G. Chatzigeorgakidis, D. Skoutas, K. Patroumpas, S. Athanasiou, and S. Ski- adopoulos. Indexing geolocated time series data. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL GIS), page 19. ACM, 2017. 139 [49] D. M. Chen, G. Baatz, K. Köser, S. S. Tsai, R. Vedantham, T. Pylvänäinen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk. City- scale landmark identification on mobile devices. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 737–744. IEEE, 2011. [50] E.-L. Chen, P.-C. Chung, C.-L. Chen, H.-M. Tsai, and C.-I. Chang. An auto- matic diagnostic system for CT liver image classification. IEEE transactions on biomedical engineering, 45(6):783–794, 1998. [51] L. Chen, G. Cong, C. S. Jensen, and D. Wu. Spatial Keyword Query Processing: An Experimental Evaluation. In Proceedings of the International Conference on Very Large Data Bases (PVLDB), volume 6, pages 217–228. VLDB Endowment, 2013. [52] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), volume 11211, pages 833– 851. Springer, 2018. [53] R. Choubey, L. Chen, and E. A. Rundensteiner. GBI: A generalized R-tree bulk- insertionstrategy. In International Symposium on Spatial Databases, pages91–108. Springer, 1999. [54] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic queryexpansionwithagenerativefeaturemodelforobjectretrieval. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision (ICCV), pages 1–8. IEEE, 2007. [55] P.Ciaccia, M.Patella, andP.Zezula. M-tree: AnEfficientAccessMethodforSimi- larity Search in Metric Spaces. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB), pages 426–435, Athens, Greece, August 25-29, 1997. [56] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3606–3613, 2014. [57] G. Constantinou, G. S. Ramachandran, A. Alfarrarjeh, S. Ho, B. Krishnamachari, andC.Shahabi. Acrowd-basedimagelearningframeworkusingedgecomputingfor smart city applications. In Proceedings of the Fifth IEEE International Conference on Multimedia Big Data (BigMM). IEEE, 2019. [58] J. B. Copas. Binary regression models for contaminated data. Journal of the Royal Statistical Society: Series B (Methodological), pages 225–265, 1988. [59] R. Cucchiara, M. Piccardi, and P. Mello. Image analysis and rule-based reasoning for a traffic monitoring system. IEEE transactions on intelligent transportation systems, 1(2):119–130, 2000. 140 [60] R. Cutler and L. S. Davis. Robust real-time periodic motion detection, analysis, andapplications. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8):781–796, 2000. [61] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive Hash- ing Scheme based on P-stable Distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry (SoCG), pages 253–262. ACM, 2004. [62] I. De Felipe, V. Hristidis, and N. Rishe. Keyword Search on Spatial Databases. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE), pages 656–665. IEEE, 2008. [63] L. Di Stefano, F. Tombari, A. Lanza, S. Mattoccia, and S. Monti. Graffiti detection using two views. In Proceedings of the Eighth International Workshop on Visual Surveillance (VS2008), 2008. [64] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middle- ware. Journal of Computer and System Sciences, 66(4):614–656, 2003. [65] A. Gallagher, D. Joshi, J. Yu, and J. Luo. Geo-location Inference from Image Con- tentandUserTags. InProceedings of the Computer Vision and Pattern Recognition (CVPR) Workshops, pages 55–62, Miami, FL, USA, 20-25 June, 2009. IEEE. [66] R. Girshick. Fast R-CNN. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 1440–1448, 2015. [67] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 580–587, 2014. [68] A. Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data, pages 47–57, Boston, Massachusetts, USA, June 18-21, 1984. [69] M. Hannan, W. Zaila, M. Arebey, R. A. Begum, and H. Basri. Feature extrac- tion using hough transform for solid waste bin level detection and classification. Environmental monitoring and assessment, 186(9):5381–5391, 2014. [70] R. Hariharan, B. Hore, C. Li, and S. Mehrotra. Processing spatial-keyword (SK) queries in geographic information retrieval (GIR) systems. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSDBM), pages 16–16. IEEE, 2007. [71] J. Hartigan. Representation of similarity matrices by trees. Journal of the Ameri- can Statistical Association, 62(320):1140–1158, 1967. [72] J.A.HartiganandM.A.Wong. Algorithmas136: Ak-meansclusteringalgorithm. Applied Statistics, 28(1):100–108, 1979. 141 [73] J. Hays and A. A. Efros. IM2GPS: Estimating Geographic Information from A Sin- gle Image. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE, 2008. [74] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009. [75] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV),pages2961–2969, 2017. [76] D.Hu, L.Bo, andX.Ren. Towardrobustmaterialrecognitionforeverydayobjects. In Proceedings of the 2011 British Machine Vision Conference (BMVC), volume 2, pages 6–16. Citeseer, 2011. [77] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing, pages 604–613. ACM, 1998. [78] A. K. Jain, J. E. Lee, and R. Jin. Graffiti-id: matching and retrieval of graffiti images. In Proceedings of the First ACM workshop on Multimedia in forensics, pages 1–6. ACM, 2009. [79] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In Proceedings of the 2010-23rd IEEE Confer- ence on Computer Vision & Pattern Recognition (CVPR), pages 3304–3311. IEEE Computer Society, 2010. [80] H. Jégou and A. Zisserman. Triangulation embedding and democratic aggregation for image search. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3310–3317, 2014. [81] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- rama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embed- ding. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 675–678. ACM, 2014. [82] A. Joly and O. Buisson. Logo retrieval with a contrario visual query expansion. In Proceedings of the 17th ACM international conference on Multimedia (ACM MM), pages 581–584. ACM, 2009. [83] Y. Kalantidis, C. Mellina, and S. Osindero. Cross-dimensional weighting for aggre- gated deep convolutional features. In Proceedings of the European conference on computer vision (ECCV, pages 685–701. Springer, 2016. [84] I. Kamel and C. Faloutsos. On packing R-trees. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), volume 93, pages 490–499. Citeseer, 1993. 142 [85] M. Karaköse, B. Akarsu, K. Parlak, A. Sarimaden, and A. Erhan. A fast and adaptive road defect detection approach using computer vision with real time implementation. International Journal of Applied Mathematics, Electronics and Computers (IJAMEC), 4(Special Issue-1):290–295, 2016. [86] L. Kazemi and C. Shahabi. Geocrowd: enabling query answering with spatial crowdsourcing. In Proceedings of the 20th international conference on advances in geographic information systems (SIGSPATIAL), pages 189–198. ACM, 2012. [87] A. Khodaei, C. Shahabi, and C. Li. Hybrid Indexing and Seamless Ranking of Spatial and Textual Features of Web Documents. In Database and Expert Systems Applications, pages 450–466. Springer, 2010. [88] S. H. Kim, A. Alfarrarjeh, G. Constantinou, and C. Shahabi. TVDP: Translational visual data platform for smart cities. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pages 45–52. IEEE, 2019. [89] S. H. Kim, Y. Lu, G. Constantinou, C. Shahabi, G. Wang, and R. Zimmermann. MediaQ: Mobile Multimedia Management System. In Proceedings of the 5th ACM Multimedia Systems Conference (MMSys), pages 224–235. ACM, 2014. [90] S. H. Kim, Y. Lu, J. Shi, A. Alfarrarjeh, C. Shahabi, G. Wang, and R. Zim- mermann. Key frame selection algorithms for automatic generation of panoramic images from crowdsourced geo-tagged videos. In Proceedings of the 13th Interna- tional Symposium on Web and Wireless Geographical Information Systems, pages 67–84. Springer, 2014. [91] S. H. Kim, J. Shi, A. Alfarrarjeh, D. Xu, Y. Tan, and C. Shahabi. Real-time traffic videoanalysisusingintelviewmontcoprocessor. In Proceedings of the International Workshop on Databases in Networked Information Systems (DNIS),pages150–160. Springer, 2013. [92] Y. Kim, J. Kim, and H. Yu. GeoTree: using spatial information for georeferenced video search. Knowledge-based systems, 61:1–12, 2014. [93] J. Knopp, J. Sivic, and T. Pajdla. Avoiding Confusing Features in Place Recog- nition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 748–761. Springer, 2010. [94] J. Kopf, B. Chen, R. Szeliski, and M. F. Cohen. Street slide: browsing street level imagery. ACM Transactions on Graphics (TOG), 29(4):96:1–96:8, 2010. [95] R. V. Krejcie and D. W. Morgan. Determining sample size for research activities. Educational and psychological measurement (EPM), 30(3):607–610, 1970. [96] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Annual conference on Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012. 143 [97] S. Ladin-Sienne. Turning dirty streets clean through comprehensive open data mapping. https://bit.ly/2P8lsCp, 2017. [98] D. Lee, J. Oh, W.-K. Loh, and H. Yu. GeoVideoIndex: indexing for georeferenced videos. Information Sciences, 374:210–223, 2016. [99] Q. Li, M. Yao, X. Yao, and B. Xu. A real-time 3d scanning system for pavement distortion inspection. Measurement Science and Technology, 21(1):015702, 2009. [100] W. Li, B. Bhushan, and J. Gao. A mutilple-level assessment system for smart city street cleanliness. In Proceedings of the 30th International Conference on Software Engineering and Knowledge Engineering (SEKE), pages 256–255, 2018. [101] T.-Y. Lin, S. Belongie, and J. Hays. Cross-view Image Geolocalization. In Proceed- ings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 891–898. IEEE, 2013. [102] C. Liu, L. Sharan, E. H. Adelson, and R. Rosenholtz. Exploring features in a bayesian framework for material recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 239–246. IEEE, 2010. [103] D. Liu, M. R. Scott, R. Ji, W. Jiang, H. Yao, and X. Xie. Location Sensitive Indexing for Image-based Advertising. In Proceedings of the 17th ACM Interna- tional Conference on Multimedia, pages 793–796. ACM, 2009. [104] J. Liu, Q. Yu, O. Javed, S. Ali, A. Tamrakar, A. Divakaran, H. Cheng, and H. Sawhney. Video event recognition using concept attributes. In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV), pages 339–346. IEEE, 2013. [105] S. Liu, H. Li, Y. Yuan, and W. Ding. A method for uav real-time image simulation based on hierarchical degradation model. In Foundations of Intelligent Systems, pages 221–232. Springer, 2014. [106] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, pages 21–37. Springer, 2016. [107] W. Liu, W. Sun, C. Chen, Y. Huang, Y. Jing, and K. Chen. Circle of friend query in geo-social networks. In Proceedings of the 17th International Conference on Database Systems for Advanced Applications (DASFAA), pages 126–137. Springer, 2012. [108] J. Long, L. Zhu, C. Zhang, Z. Yang, Y. Lin, and R. Chen. Efficient interactive search for geo-tagged multimedia data. Multimedia Tools and Applications, pages 1–30, 2018. 144 [109] D. G. Lowe. Distinctive Image Features from Scale-invariant Keypoints. Interna- tional Journal of Computer Vision, 60(2):91–110, 2004. [110] Y. Lu and C. Shahabi. Efficient indexing and querying of geo-tagged aerial videos. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 18. ACM, 2017. [111] Y. Lu, C. Shahabi, and S. H. Kim. Efficient Indexing and Retrieval of Large-scale Geo-tagged Video Databases. GeoInformatica, 20(4):829–857, 2016. [112] Y. Lu, H. To, A. Alfarrarjeh, S. H. Kim, Y. Yin, R. Zimmermann, and C. Sha- habi. GeoUGV: User-generated Mobile Video Dataset with Fine Granularity Spa- tial Metadata. In Proceedings of the 7th International Conference on Multimedia Systems (MMSys), pages 43–48. ACM, 2016. [113] J. Luo, J. Yu, D. Joshi, and W. Hao. Event Recognition: Viewing the World with A Third Eye. In Proceedings of the 16th ACM International Conference on Multimedia, pages 1071–1080. ACM, 2008. [114] Z. Luo, H. Li, J. Tang, R. Hong, and T.-S. Chua. Viewfocus: Explore Places of Interests on Google Maps Using Photos with View Direction Filtering. In Pro- ceedings of the 17th ACM International Conference on Multimedia, pages 963–964. ACM, 2009. [115] H. Ma, S. A. Ay, R. Zimmermann, and S. H. Kim. Large-scale Geo-tagged Video Indexing and Queries. GeoInformatica, 18(4):671–697, 2014. [116] Z. Ma, Y. Yang, N. Sebe, K. Zheng, and A. G. Hauptmann. Multimedia event detectionusingaclassifier-specificintermediaterepresentation. IEEE Transactions on Multimedia, 15(7):1628–1637, 2013. [117] H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata. Road damage detection and classification using deep neural networks with smartphone images. Computer-Aided Civil and Infrastructure Engineering (CACAIE), 33(12):1127– 1141, 2018. [118] K. Mahajan and J. Chitode. Waste bin monitoring system using integrated tech- nologies. International Journal of Innovative Research in Science, Engineering and Technology (IJIRSET), 3(7):14953–14957, 2014. [119] G. Malkoc. The importance of road maintenance. http://www. worldhighways.com/categories/maintenance-utility/features/ the-importance-of-road-maintenance/, 2015. [120] N. Naik, J. Philipoom, R. Raskar, and C. Hidalgo. Streetscore-predicting the perceived safety of one million streetscapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 779–785, 2014. 145 [121] S. Ning, N. Chang, K. Jeng, and Y. Tseng. Soil erosion and non-point source pollutionimpactsassessmentwiththeaidofmulti-temporalremotesensingimages. J. of Environmental Management, 79(1):88–101, 2006. [122] D. Nister and H. Stewenius. Scalable Recognition with A Vocabulary Tree. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2161–2168, New York, NY, USA, 17-22 June, 2006. IEEE. [123] M. S. Nixon and A. S. Aguado. Feature Extraction & Image Processing for Com- puter Vision. Academic Press, 2012. [124] M. Otterman. Approximate matching with high dimensionality R-trees. Master’s thesis, Dept. of Computer Science, University of Maryland, College Park, MD, 1992. [125] M. Park, J. Luo, R. T. Collins, and Y. Liu. Estimating the Camera Direction of A Geotagged Image Using Reference Images. Pattern Recognition, 47(9):2880–2893, 2014. [126] A. Parra, M. Boutin, and E. J. Delp. Automatic gang graffiti recognition and interpretation. Journal of Electronic Imaging, 26(5):051409, 2017. [127] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 12:2825–2830, 2011. [128] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier. Large-scale image retrieval with compressed fisher vectors. In Proceedings of the 23rd IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 3384– 3391. IEEE, 2010. [129] X. Qi, R. Xiao, C.-G. Li, Y. Qiao, J. Guo, and X. Tang. Pairwise rotation invariant co-occurrence local binary pattern. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 36(11):2199–2213, 2014. [130] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. [131] A. Rav-Acha, G. Engel, and S. Peleg. Minimal aspect distortion (MAD) mosaicing of long scenes. International Journal of Computer Vision (IJCV), 78(2-3):187–206, 2008. [132] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 779–788, 2016. 146 [133] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information pro- cessing systems (NIPS), pages 91–99, 2015. [134] L. J. Riccio, J. Miller, and G. Bose. Polishing the big appple: Models of how manpower utilization affects street cleanliness in new york city. Waste management & research, 6(1):163–174, 1988. [135] J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso. Rotation forest: A new classifier ensemble method. IEEE transactions on pattern analysis and machine intelligence (TPAMI), 28(10):1619–1630, 2006. [136] N. Roussopoulos, Y. Kotidis, and M. Roussopoulos. Cubetree: organization of and bulk incremental updates on the data cube. ACM SIGMOD Record, 26(2):89–99, 1997. [137] A. Sameer, S. Noah, S. Ian, S. M. Seitz, and R. Szeliski. Building Rome in A Day. In Proceedings of the IEEE 12th International Conference on Computer Vision (ICCV), pages 72–79, 2009. [138] H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006. [139] G. Schindler, M. Brown, and R. Szeliski. City-scale Location Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–7. IEEE, 2007. [140] P. B. Schwarz. Recognition of Graffiti Tags. Master’s thesis, The University of Western Australia, School of Computer Science and Software Engineering, 2006. [141] P. Serdyukov, V. Murdock, and R. Van Zwol. Placing Flickr Photos on A Map. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 484–491. ACM, 2009. [142] L. Sharan, R. Rosenholtz, and E. Adelson. Material perception: What can you see in a brief glance? Journal of Vision, 9(8):784–784, 2009. [143] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (CVPR), pages 806–813, 2014. [144] J. Shieh and E. Keogh. iSAX: indexing and mining terabyte sized time series. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD), pages 623–631. ACM, 2008. [145] C. N. Silla and A. A. Freitas. A Survey of Hierarchical Classification Across Differ- ent Application Domains. Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011. 147 [146] J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV), 14-17 October 2003, Nice, France, pages 1470–1477. IEEE, 2003. [147] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- houcke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 1–9, 2015. [148] B. Thai and G. Healey. Invariant subpixel material detection in hyperspec- tral imagery. IEEE Transactions on Geoscience and Remote Sensing (TGRS), 40(3):599–608, 2002. [149] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M: The New Data in Multimedia Research. Communica- tions of the ACM, 59(2):64–73, 2016. [150] G. Tolias and H. Jégou. Local visual query expansion: Exploiting an image collec- tion to refine local descriptors. PhD thesis, INRIA, 2013. [151] G. Tolias, R. Sicre, and H. Jégou. Particular object retrieval with integral max- pooling of CNN activations. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016. [152] F. Tombari, L. Di Stefano, S. Mattoccia, and A. Zanetti. Graffiti detection using a time-of-flight camera. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS), pages 645–654. Springer, 2008. [153] S. Vaid, C. Jones, H. Joho, and M. Sanderson. Spatio-textual Indexing for Geo- graphical Search on The Web. Advances in Spatial and Temporal Databases, pages 923–923, 2005. [154] J. Van den Bercken, B. Seeger, and P. Widmayer. A generic approach to bulk loading multidimensional index structures. In Proceedings of 23rd International Conference on Very Large Data Bases (VLDB), volume 97, pages 406–415, 1997. [155] N. Vo, N. Jacobs, and J. Hays. Revisiting IM2GPS in the Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2640–2649. IEEE, 2017. [156] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. Deep Learning for Content-based Image Retrieval: A Comprehensive Study. In Proceedings of the 22nd ACM international conference on Multimedia, pages 157–166. ACM, 2014. [157] G. Wang, Y. Lu, L. Zhang, A. Alfarrarjeh, R. Zimmermann, S. H. Kim, and C. Shahabi. Active key frame selection for 3d model reconstruction from crowd- sourced geo-tagged videos. In Proceeding of the IEEE International Conference on Multimedia & Expo, pages 1–6. IEEE, 2014. 148 [158] T. Weyand, I. Kostrikov, and J. Philbin. Planet-photo Geolocation with Convolu- tional Neural Networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 37–55. Springer, 2016. [159] P. Wieschollek and H. Lensch. Transfer learning for material classification using convolutional networks. arXiv preprint arXiv:1609.06188, 2016. [160] S. Workman, R. Souvenir, and N. Jacobs. Wide-area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3961–3969, 2015. [161] X. Xu, T. Mei, W. Zeng, N. Yu, and J. Luo. Amigo: Accurate Mobile Image Geo- tagging. In Proceedings of the 4th International Conference on Internet Multimedia Computing and Service, pages 11–14. ACM, 2012. [162] C. Yang, P. C. Wong, W. Ribarsky, and J. Fan. Efficient graffiti image retrieval. In Proceedings of the 2nd ACM International Conference on Multimedia Retrieval (ICMR), page 36. ACM, 2012. [163] D.-N. Yang, C.-Y. Shen, W.-C. Lee, and M.-S. Chen. On socio-spatial group query for location-based social networks. In Proceedings of the 18th ACM SIGKDD inter- national conference on Knowledge discovery and data mining (SIGKDD), pages 949–957. ACM, 2012. [164] B. X. Yu and X. Yu. Vibration-based system for pavement condition evaluation. In Applications of Advanced Technology in Transportation (AATT), pages 183–189. American Society of Civil Engineers, 2006. [165] A. R. Zamir, S. Ardeshir, and M. Shah. Gps-tag Refinement Using Random Walks with An Adaptive Damping Factor. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 4280– 4287, 2014. [166] A. R. Zamir and M. Shah. Accurate Image Localization based on Google Maps Street View. In Proceedings of the European Conference on Computer Vision (ECCV), pages 255–268. Springer, 2010. [167] A. R. Zamir and M. Shah. Image Geo-localization based on Multiplenearest Neigh- bor Feature Matching using Generalized Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1546–1558, 2014. [168] A. Zhang, K. C. Wang, B. Li, E. Yang, X. Dai, Y. Peng, Y. Fei, Y. Liu, J. Q. Li, and C. Chen. Automated pixel-level pavement crack detection on 3d asphalt surfaces using a deep-learning network. Computer-Aided Civil and Infrastructure Engineering (CACAIE), 32(10):805–819, 2017. [169] C. Zhang, R. Chen, L. Zhu, A. Liu, Y. Lin, and F. Huang. Hierarchical infor- mation quadtree: efficient spatial temporal image search for multimedia stream. Multimedia Tools and Applications, pages 1–23, 2018. 149 [170] L. Zhang, F. Yang, Y. D. Zhang, and Y. J. Zhu. Road crack detection using deep convolutional neural network. In Proceedings of the 23th IEEE International Conference on Image Processing (ICIP), pages 3708–3712. IEEE, 2016. [171] P. Zhao, X. Kuang, V. S. Sheng, J. Xu, J. Wu, and Z. Cui. Scalable Top-k Spatial ImageSearchonRoadNetworks. InProceedings of the International Conference on Database Systems for Advanced Applications (DASFAA), pages 379–396. Springer, 2015. [172] E. Zheng, R. Raguram, P. Fite-Georgel, and J.-M. Frahm. Efficient generation of multi-perspective panoramas. In Proceedings of the International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), pages 86–92. IEEE, 2011. [173] S. Zheng, M.-M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, and P. H. Torr. Dense semantic image segmentation with objects and attributes. In Proceed- ings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 3214–3221, 2014. [174] Y.-T.Zheng, M.Zhao, Y.Song, H.Adam, U.Buddemeier, A.Bissacco, F.Brucher, T.-S. Chua, and H. Neven. Tour the World: Building A Web-scale Landmark Recognition Engine. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1085–1092. IEEE, 2009. [175] Y. Zhou, X. Xie, C. Wang, Y. Gong, and W.-Y. Ma. Hybrid Index Structures for Location-based Web Search. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pages 155–162. ACM, 2005. 150
Abstract (if available)
Abstract
Due to continuous advances in camera technologies as well as camera-enabled devices (e.g., CCTV, smartphone, vehicle blackbox, and GoPro), urban streets have been documented by massive amounts of images. Moreover, nowadays, images are typically tagged with spatial metadata due to various sensors (e.g., GPS and digital compass) attached to or embedded in cameras. Such images are known as geo-tagged images. The availability of such geographical context of images enables emerging several image-based smart city applications. Developing such smart city applications requires searching for images, among the massive amounts of collected images, especially to be used for training various machine learning algorithms. Thus, there is an immense need for a data management system for geo-tagged images. ❧ Towards this end, it is paramount to build a data management system that organizes the images in structures that enable searching and retrieving the images efficiently and accurately. On one hand, the data management system should overcome the challenge of lacking an accurate spatial representation of legacy images that were collected without spatial metadata, as well as representing the content of an image accurately using an enriched visual descriptor. On the other hand, the system should also enable efficient storage of images utilizing both their spatial and visual properties and thus their retrieval based on spatial-visual queries. To address these challenges we present a system which includes three integrated modules: a) modeling an image spatially by its scene location using a data-centric approach, b) extending the visual representation of an image with the feature set of multiple similar images located in its vicinity, and c) designing index structures that expedite the evaluation of spatial-visual queries.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Combining textual Web search with spatial, temporal and social aspects of the Web
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Tag based search and recommendation in social media
PDF
Generalized optimal location planning
PDF
Efficient crowd-based visual learning for edge devices
PDF
Privacy-aware geo-marketplaces
PDF
Vision-based studies for structural health monitoring and condition assesment
PDF
Efficient pipelines for vision-based context sensing
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Building a spatial database for agricultural record keeping and management on a regenerative farm
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Scalable dynamic digital humans
Asset Metadata
Creator
Alfarrarjeh, Abdullah Mohammad
(author)
Core Title
Enabling spatial-visual search for geospatial image databases
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/04/2020
Defense Date
10/23/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
covered material recognition,deep learning,geo-aware classification,geo-tagged images,geo-tagged street images,Graffiti,hierarchical classification,hybrid index,image localization,image scene localization,image search,image visual features,image-based classification,material classification,OAI-PMH Harvest,object detection,road damage detection and classification,scene location,spatially aggregated visual feature descriptor,spatial-visual index,spatial-visual query,spatial-visual search,street cleanliness,street-scene geo-tagged image,surface material,urban street analysis,visual descriptor
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Kuo, C.-C. Jay (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
alfarrar@usc.edu,fararjeh@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-241109
Unique identifier
UC11673319
Identifier
etd-Alfarrarje-7990.pdf (filename),usctheses-c89-241109 (legacy record id)
Legacy Identifier
etd-Alfarrarje-7990.pdf
Dmrecord
241109
Document Type
Dissertation
Rights
Alfarrarjeh, Abdullah Mohammad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
covered material recognition
deep learning
geo-aware classification
geo-tagged images
geo-tagged street images
hierarchical classification
hybrid index
image localization
image scene localization
image search
image visual features
image-based classification
material classification
object detection
road damage detection and classification
scene location
spatially aggregated visual feature descriptor
spatial-visual index
spatial-visual query
spatial-visual search
street cleanliness
street-scene geo-tagged image
surface material
urban street analysis
visual descriptor