Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Leveraging georeferenced meta-data for the management of large video collections
(USC Thesis Other)
Leveraging georeferenced meta-data for the management of large video collections
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEVERAGING GEOREFERENCED META-DATA FOR THE MANAGEMENT OF LARGE VIDEO COLLECTIONS by Sakire Arslan Ay A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2010 Copyright 2010 Sakire Arslan Ay Dedication This dissertation is dedicated to my lovely daughter Zeynep Ay, my beloved husband and my soul-mate Suat U. Ay, and my beloved parents, Aysel and Sukru Arslan, who gave me unconditional support and love all my life. ii Acknowledgements It is my pleasure to express my gratitude to a large number of people who have contributed, in many different ways. First, I wish to express my deepest gratitude tomysupervisorProf. RogerZimmermann. Iamtotallyindebtedtohiscontinuous encouragement, tireless efforts, and invaluable guidance. He spent endless hours in teaching me how to be a researcher, and how to identify research challenges. He travelled to Los Angeles from Singapore two times just to attend my qualifying exam and defence. I was really fortune to have Dr. Zimmermann as my advisor. I am also greatly indebted to Prof. Seon Ho Kim who voluntarily served as my unofficial second advisor. He taught me to how to approach a problem, organize my ideas, and present results. This research would never be possible without him. I will be always grateful to Dr. Kim for the thoughtful discussions with him during our Skype calls. My gratitude and appreciation to my advisory and examining committee Prof. Cyrus Shahabi and Prof. C.-C. Jay Kuo, for their time and efforts. I greatly appreciate their help and guidance. Special thanks to Prof. Alexander A. Sawchuk and Prof. Viktor K. Prasanna for serving in my qualifying committee. iii My sincere gratitude goes to my husband Suat for his unceasing patience when I spend way too much time in the lab and the library. I am totally indebted to his love, caring, and support. Thanks for my daughter Zeynep for letting me know, appreciate, and make the best use of every available time slot. My forever gratitude goes to my parents. Without their unconditional love, support, and encouragements, I would have never made this far. I also would like to thank Yuling Hsueh for being such a great friend and colleague. The last 3 year have been quite challenging for me. I had to pursue my Ph.D. remotely, I didn’t have an office to work in, I didn’t have colleagues to chat with and get inspired from. But I managed to learn how to keep my concentration high and establish a working routine within the home environment. I am truly grateful to my advisor Prof. Roger Zimmermann and Prof. Seon Ho Kim for mentoring my research through the weekly Skype calls. iv Table of Contents Dedication ii Acknowledgements iii Abstract xv Chapter 1: Introduction 1 1.1 Background and Motivations . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Georeferenced Meta-data Acquisition . . . . . . . . . . . . . . 6 1.2.2 Management and Search of Georeferenced Meta-data . . . . . 7 1.2.3 Presentation of Search Results . . . . . . . . . . . . . . . . . 10 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Related Work 14 2.1 Augmenting Meta-data to Describe Media Content . . . . . . . . . 15 2.2 Searching and Browsing Images using Georeferenced Meta-data . . . 18 2.3 Searching Georeferenced Videos . . . . . . . . . . . . . . . . . . . . 20 2.4 Georeferenced Videos and Geographic Information Systems (GIS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Exploiting Video Data in GIS Applications . . . . . . . . . . 21 2.4.2 Ranking Videos based on Geo-spatial Properties . . . . . . . 22 2.5 Indexing Georeferenced Video Meta-data . . . . . . . . . . . . . . . 22 Chapter 3: A Framework for Georeferenced Video Search 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Data Collecting Device . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Application Data Processor . . . . . . . . . . . . . . . . . . . 28 3.3.2 Database Server . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.3 Media Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 v Chapter 4: Georeferenced Meta-data Acquisition 36 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Viewable Scene Model . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Collecting Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 Georeferenced Video Recording . . . . . . . . . . . . . . . . . . . . . 46 4.4.1 Prototype System for Recording Georeferenced Videos . . . . 46 4.4.2 Data Collection. . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 5: Georeferenced Meta-data Management and Search 50 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Searching Georeferenced Videos . . . . . . . . . . . . . . . . . . . . 52 5.2.1 Searching Videos Using the Viewable Scene Model . . . . . . 52 5.2.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 56 5.2.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2.2 Completeness of Result Video Set . . . . . . . . . . . 58 5.2.2.3 Accuracy of Search Results . . . . . . . . . . . . . . 62 5.3 Vector-based Indexing in Support of Versatile Georeferenced Video Search . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.1 Modeling Viewable Scene using Vector . . . . . . . . . . . . . 66 5.3.2 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.2.1 Point Query . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.2.2 Point Query with Bounded Distance . . . . . . . . . 73 5.3.2.3 Directional Point Query . . . . . . . . . . . . . . . . 74 5.3.2.4 Directional Point Query with Bounded Distance . . . 76 5.3.2.5 Rectangular Range Query . . . . . . . . . . . . . . . 77 5.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 82 5.3.4.1 Experiments using Real-world Data . . . . . . . . . . 82 5.3.4.2 Experiments using Synthetic Data . . . . . . . . . . 91 5.3.4.3 Illustration of Directional Query Results: A Real- world Example . . . . . . . . . . . . . . . . . . . . . 96 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Chapter 6: Relevance Ranking in Georeferenced Video Search 101 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Ranking Georeferenced Video Search Results . . . . . . . . . . . . . 102 6.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.2 Three Metrics to Describe the Relevance of a Video . . . . . 106 6.2.2.1 Total Overlap Area (R TA ) . . . . . . . . . . . . . . . 107 6.2.2.2 Overlap Duration (R D ) . . . . . . . . . . . . . . . . 108 6.2.2.3 Summed Area of Overlap Regions (R SA ) . . . . . . . 108 6.2.3 Ranking Videos Based on Relevance Scores . . . . . . . . . . 109 6.2.4 A Histogram Approach for Calculating Relevance Scores . . . 112 vi 6.2.4.1 Execution of Geospatial Range Queries Using His- tograms . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.2.4.2 Histogram Based Relevance Scores . . . . . . . . . . 116 6.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3.1 Data Collection and Methodology . . . . . . . . . . . . . . . 117 6.3.1.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . 119 6.3.2 Comparison of Ranking Accuracy. . . . . . . . . . . . . . . . 120 6.3.2.1 Comparison of Proposed Ranking Schemes . . . . . . 120 6.3.2.2 Comparison with User Feedback . . . . . . . . . . . 121 6.3.3 Evaluating the Computational Performance . . . . . . . . . . 125 6.3.4 Evaluating Histogram based Ranking . . . . . . . . . . . . . 128 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Chapter 7: GRVS: A Georeferenced Video Search Engine 134 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.2 Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.2.1 Database Implementation . . . . . . . . . . . . . . . . . . . . 135 7.2.2 Web User Interface . . . . . . . . . . . . . . . . . . . . . . . . 136 7.3 Functionality Illustration . . . . . . . . . . . . . . . . . . . . . . . . 138 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Chapter 8: Summary 141 References 144 Appendix A Generating Synthetic Meta-data for Georeferenced Video Management . 150 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 A.2 Synthetic Video Meta-data Generation Requirements . . . . . . . . 152 A.2.1 Camera Movement Requirements . . . . . . . . . . . . . . . . 155 A.2.2 Camera Rotation Requirements . . . . . . . . . . . . . . . . . 155 A.3 Video Meta-data Generation . . . . . . . . . . . . . . . . . . . . . . 157 A.3.1 Generating Camera Movement . . . . . . . . . . . . . . . . . 157 A.3.1.1 Network-based Camera Movement (T network ) . . . . . 158 A.3.1.2 Unconstrained Camera Movement (T free ) . . . . . . 160 A.3.1.3 Mixed Camera Movement (T mixed ) . . . . . . . . . . 161 A.3.2 Generating Camera Rotation . . . . . . . . . . . . . . . . . . 162 A.3.2.1 Generating Camera Direction Angles . . . . . . . . . 163 A.3.2.2 Calculating Moving Direction . . . . . . . . . . . . . 163 A.3.2.3 Assigning Random Direction Angles . . . . . . . . . 166 A.3.3 Creating Meta-data Output . . . . . . . . . . . . . . . . . . . 167 A.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 167 A.4.1 Comparison with Real-world Dataset . . . . . . . . . . . . . . 167 A.4.1.1 Datasets and Evaluation Methodology . . . . . . . . 167 A.4.1.2 Comparison of Camera Movement Speed . . . . . . . 168 vii A.4.1.3 Comparison of Camera Rotation . . . . . . . . . . . 171 A.4.2 Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . 173 A.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 viii List of Tables 5.1 Example schema for FOVScene representation. . . . . . . . . . . . . 52 5.2 Summary of metrics for evaluating accuracy (in terms of number of visible queries). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Detailed results of point query. . . . . . . . . . . . . . . . . . . . . . 84 5.4 Results of directional point query with 45 ◦ ±5 ◦ . . . . . . . . . . . . 90 5.5 Results of directional point query with r. 45 ◦ ±5 ◦ viewing direction and δ = 0.3M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.6 Results of point query on synthetic meta-data. . . . . . . . . . . . . 93 5.7 Results of directional point query with 45 ◦ ±5 ◦ on synthetic meta-data. 94 5.8 Execution times of 1000 point queries on synthetic data without MySQL index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.9 Execution times of 1000 point queries on synthetic meta-data with MySQL index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 Summary of terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2 Comparison of proposed ranking methods: RL TA , RL SA and RL D . 120 6.3 The ranked video results and relevance scores obtained for query Q 207 .122 6.4 Measured computational time per query . . . . . . . . . . . . . . . 126 ix 6.5 RankordercomparisonbetweenRL G TA ,RL G SA ,RL G D andRL TA ,RL SA , RL D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.1 Schema for viewable scene (FOVScene) representation. . . . . . . . 136 A.1 Summary of camera template specification. . . . . . . . . . . . . . . 152 A.2 Details of camera template parameters. . . . . . . . . . . . . . . . . 154 A.3 Properties of the synthetic and real-world datasets.. . . . . . . . . . 168 A.4 Characteristics of the camera speed. . . . . . . . . . . . . . . . . . . 169 A.5 Characteristics of the camera rotation. . . . . . . . . . . . . . . . . . 171 A.6 Summary of the time requirements for synthetic data generation. . . 174 x List of Figures 1.1 Illustration of a generic georeferenced video search system. . . . . . 7 1.2 Experimental hardware and software to acquire georeferenced video. 8 3.1 Framework structure and modules . . . . . . . . . . . . . . . . . . . 25 3.2 Comparison of two coverage models . . . . . . . . . . . . . . . . . . 29 4.1 Illustration of FOVScene model in 2D and 3D. . . . . . . . . . . . . 37 4.2 Illustration of visibility. . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Camera positions for the collected Moscow, Idaho dataset.. . . . . . 47 4.4 Visualization of viewable scenes on a map. . . . . . . . . . . . . . . 48 5.1 MBR estimations of FOVScenes. . . . . . . . . . . . . . . . . . . . . 55 5.2 Number of visible queries per video file. . . . . . . . . . . . . . . . . 61 5.3 Cumulativenumberofvisiblequeriesasafunctionofnumberofinput videos (40 videos only). . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Comparison of CircleScene and FOVScene coverage. . . . . . . . . . 63 5.5 Cumulativenumberofvisiblequeriesasafunctionofnumberofinput videos (whole dataset). . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.6 Cumulative sum of returned video segment lengths as a function of number of input videos (whole dataset). . . . . . . . . . . . . . . . . 64 xi 5.7 Comparison of CircleScene and FOVScene results. . . . . . . . . . . 65 5.8 Effect of query size. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.9 Illustration of filter and refinement steps. . . . . . . . . . . . . . . . 66 5.10 FOVScene (F) representation in different spaces. . . . . . . . . . . . 69 5.11 Illustration of filter step in point query processing. . . . . . . . . . . 71 5.12 Example of filtering in point query. . . . . . . . . . . . . . . . . . . 72 5.13 Illustration of the filter step in point query with bounded distance r. 73 5.14 Illustration of filter step in directional point query with angle β. . . 75 5.15 Illustration of filter step in directional point query with β and r. . . 76 5.16 Illustration of filter step in range query. . . . . . . . . . . . . . . . . 78 5.17 Problem of single vector model in point query processing. . . . . . . 79 5.18 Overestimation constant δ. . . . . . . . . . . . . . . . . . . . . . . . 80 5.19 Camera positions and query points. . . . . . . . . . . . . . . . . . . 83 5.20 RetrievedFs for point query with r =M. . . . . . . . . . . . . . . . 84 5.21 Query points overlapped with anF (video: 61,F id: 42). . . . . . . 86 5.22 RetrievedFs for point query with r = 50m. . . . . . . . . . . . . . . 87 5.23 Recall with varying r. . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.24 Precision with varying r. . . . . . . . . . . . . . . . . . . . . . . . . 88 5.25 Impacts of bounding distance in video search. . . . . . . . . . . . . . 89 5.26 RetrievedFs for range query. . . . . . . . . . . . . . . . . . . . . . . 91 xii 5.27 Illustration of directional range query results – filter step. . . . . . . 97 5.28 Illustration of directional range query results – refinement step. . . . 98 6.1 Visualization of the overlap regions between query Q 207 and videos V 46 and V 108 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2 The overlap between a video FOVScene and a polygon query . . . . 105 6.3 Grid representation of the overlap polygon . . . . . . . . . . . . . . 116 6.4 Color highlighted visualizations for overlap histograms for videosV 46 and V 108 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.5 Discounted Cumulated Gain (DCG) curves . . . . . . . . . . . . . . 123 6.6 Processing time per query vs the number of videos . . . . . . . . . . 127 6.7 MAP at N for RL G TA , RL G SA and RL G D for varying cell sizes. . . . . . 129 6.8 Comparison of precise and histogram based query processing . . . . 131 6.9 Evaluation of computation time and precision as a function of the grid cell size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.1 Georeferenced video search engine web-interface. . . . . . . . . . . . 137 7.2 Impacts of bounding distance in video search. . . . . . . . . . . . . . 139 7.3 Illustration of directional range query results. . . . . . . . . . . . . . 139 A.1 Analysis of real-world data: average rotation as a function of the camera speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 A.2 Generator architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 158 A.3 Illustration of camera direction adjustment for vehicle cameras. . . . 164 A.4 Comparison of camera speed distributions for S car and RW. . . . . . 169 xiii A.5 Illustration of camera movement speed on map. . . . . . . . . . . . . 170 A.6 Comparison of camera rotation distributions for S car and RW. . . . 172 A.7 Comparison of camera rotation distributions for S pass and RW. . . . 172 A.8 Illustration of camera rotation on map. . . . . . . . . . . . . . . . . 173 xiv Abstract The rapid adoption and deployment of ubiquitous video sensors has led to the collection of voluminous amounts of data and hence there is an increasing need for techniques that manage these collections in a variety of applications, includ- ing surveillance and monitoring systems, web-based video search engines, among others. However, the indexing and search of large video databases remains a very challenging task. Current techniques that extract features purely based on the vi- sual signals of a video are struggling to achieve good results, particularly on the large scale. By considering video related meta-data information more relevant and precisely delimited search results can be obtained. Latest technological trends have enabled the cost- and energy- efficient deploy- ment of video cameras together with other sensors (e.g., GPS and compass units). The sensor data acquired automatically in conjunction of videos provide important cues about the video content. In this dissertation we propose to utilize the location and direction meta-data (i.e., georeferenced metadata) from the GPS and compass sensors to describe the coverage area of mobile video scenes. Specifically, we put forward a viewable scene model which describes the video scenes as spatial objects such that large video collections can be organized, indexed and searched effectively xv using this model. Our work focuses on the following key issues in leveraging geo- referenced meta-data for effective search of large video collections: 1) Acquisition of meta-data from sensors. We develop a prototype system to acquire georeferenced meta-data from GPS and compass sensors. The proposed system addresses several challenges in acquiring sensor meta-data including, com- patibility among various meta-data streams, synchronization with video frames, etc. 2) Management and search of georeferenced meta-data. We propose a viewable scene model for the automatic annotation of video clips and present the algorithms for searching videos based on this scene model. We also propose a novel vector based indexing for the efficient search of large video collections. 3) Presentation of search results. We investigate and present three ranking algorithms that use spatial and temporal video properties to effectively rank search results. Finally, weintroduceaprototypeofaweb-based georeferenced video search engine (GRVS) that utilizes the proposed viewable scene model for efficient video search. xvi Chapter 1 Introduction 1.1 Background and Motivations Camera sensors have become a ubiquitous feature in our environment and more and more video clips are being collected and stored for many purposes such as surveillance, monitoring, reporting, entertainment, or web-publishing. Because of the precipitously falling prices of digital cameras the general public is now generating and sharing their own videos, which are attracting significant interest from users and have resulted in an extensive user generated online video market catered to by such sites as YouTube. Cisco predicts that, by 2014, global online video will approach 57% of the consumer Internet traffic[Cis10]. Companiesaredevelopingvariousbusinessmodelsinthisemergingmarket, with one of the more obvious ones being advertising. In 2008, Forrester Research and eMarketer reported that the global online video advertising market will reach more than US $7.2 billion by 2012 [Wra08]. 1 Manyoftheend-usercamerasaremobile, suchastheonesembeddedinsmartphones. The collected video clips contain a tremendous amount of visual and contextual infor- mation that makes them unlike any other media type. As the acquisition rate increases, effective management of these large collections of digital videos is becoming a critical problem in the user generated video market. The scope of this issue is illustrated by the fact that video searches on YouTube accounted for 28% of all Google search queries in the U.S. in December of 2009 and that 23% of YouTube’s total visits for December orig- inated from Google search [Rob09]. Potential mobile video applications are not limited to user generated videos. Recently, some service vehicles such as city buses, ambulances, and police cars have been equipped with cameras for various reasons, e.g., to monitor the quality of service provided. These cameras record a vast amount of data that cover large geographic areas. It is often desirable to browse, summarize and search these videos. Better video search has the potential to significantly improve the quality and usability of many services and applications that rely on large repositories of video clips. While the size of the collected information is likely to exceed even the largest textual databases, video data is immensely difficult to index and search effectively. The barrier in obtaining the high level semantic meaning that a human observer, or a search system user, would like to attach to a video from the low level visual features (i.e., the pixel intensities) has not been overcome yet. In the multimedia community this is often called the “semantic gap” [SMW + 00]. A significant body of research exists – going back as early as the 1970s – on techniques that attempt to address the semantic gap using con- tent based methods (refer to [VT02] for a survey of this research area). They extract semantic information from the visual signals of a video. The TREC Video Retrieval 2 Evaluation (TRECVID) [SOK06] benchmarking activity has been promoting progress in content-based retrieval of digital video since 2001. Each year, various feature detection methodsfromdozensofresearchgroupsaretestedonhundredsofhoursofvideo[SOK09]. While progress has been very significant in this area, achieving high accuracy with these approaches is often limited to specific domains (e.g., sports, news), and applying them to large-scale video repositories creates significant scalability problems [Zha09, Pav09]. The TRECVID benchmark data was largely focused on broadcast news videos. A complementary and more tractable approach relies on the processing of meta-data associated with video information. Such auxiliary data may consist of manual, textual annotations (provided by users of at social media sites) and/or some automatically gen- erated tags. This method also enables high-level, semantic descriptions of video scenes which are very useful for human users. As a drawback, textual annotations must of- ten be added manually and hence their use is cumbersome for large video collections. Furthermore, text tags can be ambiguous and subjective. Recent technological trends have opened another avenue to associate more contextual information with videos: the automatic collection of sensor meta-data. A variety of sensors are now cost-effectively available and their data can be recorded together with a video. For example, current smartphones embed GPS, compass, and accelerometer sensors into a small, portable and energy-efficient package (e.g., iPhone 3GS, Nokia 6210 Navigator,etc.). Also,thereexistseveralGPS-enableddigitalcameraswhichcansavethe locationinformationwiththedigitalimagefileasapictureistaken(e.g., SonyGPS-CS1, Ricoh 500SE, Nikon D90 with GP-1). We expect that it will only be a matter of time until many camcorders will include GPS and compass sensors. The meta-data generated 3 by such sensors represents a rich source of information that can be mined for relevant search results. A significant benefit is that sensor meta-data can be added automatically and represents objective information (e.g., the position and time). Location is one of the important cues when people are retrieving relevant videos. A search keyword often can be interpreted as a point or regional location in the geo-space. Sometypes ofvideodata arenaturally tiedtogeographicallocations. Forexample, video data from traffic monitoring may not have much meaning without its associated location information. Thus, in such applications, one needs a specific location to retrieve the traffic video at that point. Hence, combining video data with its location information can provide an effective way to index and search videos, especially when a database handles an extensive amount of video data. There are still many open, fundamental research questions in this field. Most videos capturedarenotpanoramicandasaresulttheviewing directionbecomesveryimportant. GPS data only identifies object locations and therefore it is imperative to investigate the natural concepts of a viewing direction and a view point. For example, we may be interested to view a building only from a specific angle. The question arises whether a video database search can accommodate such human friendly views. Cameras may also be mobile and thus the concept of a camera location is extended to a trajectory. Therefore, unlike still images a single point location won’t be adequate to describe the geographic region covered in the video. The continuous evolvement of cameras’ location and viewing direction should be modeled and stored in the video database. A preliminary example of a similar type of work is Google Street View, which imple- ments such a concept with panoramic – but static – still images, from various positions 4 along many streets in the world. In the current implementation, the images need to be taken by specially adapted directional cameras. The system can not utilize other geographically annotated images (e.g., user images). The collection and fusion of multiple sensor streams such as the camera location, field-of-view, direction, etc., can provide a comprehensive model of the viewable scene. The objective is then to describe the geographic coverage of the video content using this viewable scene model in order to retrieve the video sections that show a particular point orregion. Aneffectivemodellingandindexingoftheviewablescenemeta-datawillenable fast and efficient search of large video collections. Researchers have only recently started to investigate and understand the implica- tions of the trends brought about by technological advances in sensor-rich cameras. Though some approaches has been proposed for managing still images based on loca- tion cues [TLR03, NYGMP05, TBC06, KN08], little work has been done regarding the management of videos [KKL + 03, LCS05]. There is tremendous potential that has yet to be explored. 1.2 Overview of Approach In this dissertation we propose the use of the geographical properties of video clips as an effective means to aid in the search of large video archives. We focus on how to quantify, storeand search theviewablesceneof captured videos. Wemodeltheviewablespaceof a scenewithparameterssuch asthecameralocation, theangle oftheview, andthecamera direction. We refer to these sensor data as georeferenced meta-data. Due to the mobility, 5 as the cameras move or rotate over time, their viewable scenes dynamically change over time. The georeferenced meta-data streams have to be acquired from sensor-equipped cameras,storedwithinanappropriatecatalogorschemaandindexedforefficientquerying and retrieval. The results of a user query should be organized for effective browsing of relevant videos. This automatic and reliable annotation forms the basis of the proposed framework. Inthisdissertationwefocusonthreemainissuesingeoreferencedvideomanagement: 1)acquisitionofmeta-datafromsensors,and2)managementofgeoreferencedmeta-data, i.e., modeling, indexing and searching of meta-data, and 3) presentation of search results. Figure 1.1 illustrates a generic georeferenced video search system. The components that provide solutions to the above outlined issues are highlighted in the figure. We will next discuss each of these issues in more detail and state the challenges addressed. 1.2.1 Georeferenced Meta-data Acquisition We focus on the very essential geographic properties of the video contents captured from avideocamera,namelythecameralocationandcameradirection. Theseinformationcan be obtained from the attached GPS receiver and compass. Video clips can be captured with various camera models. The sensor data collected from these sensors are sampled with different rates and precisions. Their synchronization with video frames should be designed so that the processing accuracy is maximized and the amount of sensor data stored is minimized. Some measurement errors and missing samples should also be con- sidered. Another challenge in data collection is handling a variety of video technologies 6 Figure 1.1: Illustration of a generic georeferenced video search system. (and cameras) and sensors. Without any standard tagging method, the compatibility among various meta-data would be a critical problem. To collect the sensor data, we have constructed a prototype system which includes a GPS sensor, a 3D digital compass and an high-definition video camera. A program was developed to acquire, process, and record the georeferences along with the video streams. Figure 1.2 shows our prototype system for recording georeferenced videos. The recording software implements the above outlined challenges in meta-data acquisition. 1.2.2 Management and Search of Georeferenced Meta-data The management of meta-data involves the modeling, indexing and searching of meta- data. 7 Pharos iGPS-500 Receiver OceanServer OS5000-US Compass JVC JY-HD10U camera Figure 1.2: Experimental hardware and software to acquire georeferenced video. The first step is modeling and describing the geographic coverage of the video content usingthesensordata. Wemodelthecamera’sviewablesceneasapie-shapedregionwhich can be estimated using camera’s location, direction and field-of-view angle. We refer to this as the viewable scene model. Using this viewable scene model, videos are stored as spatial objects in the database and searched based on their geographic properties. The scene model is computationally simple and at the same time efficient in indexing and searching of the database. The second step involves indexing the meta-data for efficient and effective search of videoviewablescenes. Whenalargecollectionofvideometa-dataisstoredinadatabase, the cost of processing spatial queries may be significant because of the computational complexity of the operations involved, for example, determining the overlap between the viewable scene model and a polygon shaped query region. Therefore, such queries are typically executed in two steps: a filter step followed by a refinement step [Ore86, BKSS94]. The idea behind the filter step is to approximate the complex spatial shapes with simpler outlines (e.g., a minimum bounding rectangle, MBR [BKSS90]) so that a large number of unrelated objects can be dismissed very quickly based on their simplified 8 shapes at the earlier stage of searching. The resulting candidate set from the filter step is thenfurtherprocessedduringtherefinementsteptodeterminetheexactresultsbasedon the exact geometric shapes. We devise a novel vector-based approximation of viewable scenes for the filter step. The vector model represents the viewable scene of the camera at a particular time instant using only the camera position and the center vector. Then the contents of a video are represented by a series of vectors. Video searching should be able to take full advantage of the collected meta-data for various requests of applications. Beyond the importance of the geographic information where a video is taken, there are other obvious advantages in exploiting the spatial prop- erties of video because the operation of a camera is fundamentally related to geometry. When a user wants to find images of an object captured from a certain viewpoint and from a certain distance, these semantics can be interpreted as geometric relationships between the camera and objects. Such relationships allow for different query types such as, for example, distance-based queries, direction-based queries, or a combination of the two. The overall goal of this process is to extract more meaningful results for end-users with high efficiency that scales to large video archives. For the implementation of the query types, we introduce new database query functionalities. Moreover, we evaluate the proposed vector-based index structure for the new query types and empirically show that it can be employed as an efficient filter step for the large scale georeferenced video search applications. 9 1.2.3 Presentation of Search Results Inlarge-scalevideoarchivesevenaveryaccuratesearchmayproduceasignificantnumber of resulting video segments. Hence the effective presentation of such results is of critical importance, otherwise it may require a time-consuming browsing of all the results by the user. To enhance the effectiveness of the result presentation, a critical approach is to quantify the relevance of resulting videos with respect to the given query and to present the results based on their relevance ranking. In georeferenced video search, we quantify the relevance based on spatial and temporal overlap between the video contents and the query region. We present three ranking algorithms that use spatial and temporal prop- erties of georeferenced videos to effectively rank search results. To allow our techniques to scale to large video databases, we further introduce a histogram based approach that allows fast online computations. Additionally, the presentation style of the results such as a map-based user interface alsoaffectstheuser’sappreciationoftheresults. Weintroduceaweb-basedsearchsystem where users can visually draw the query region and the viewing direction on the map. The result of a query contains a list of the overlapping video segments that show the query region from the query viewpoint. For each returned video segment, we display the correspondingviewablesceneregionsonthemap,andduringvideoplaybackwehighlight the viewable scene region whose timecode is closest to the current video frame. Note that relevance ranking and presentation style are not just technical issues so may require an intensive user study. 10 1.3 Contributions The primary contributions of this thesis are essentially as follows: • Description of an overall framework for georeferenced video search. We propose a general framework that will serve as a test-bed for georeferenced video search applications. We describe the required components to address the key issues and challenges in managing georeferenced videos. The goal is to enhance video search, especially with very large video collections. • Automatic annotation of video clips based on the viewable scene model. Weproposeaviewablescenemodelderivedfromthefusionoflocationanddirection sensorinformationwithavideostream. Itstrikesabalancebetweenthecomplexity of its analytical description and the efficiency with which it can be used for fast searches. • Searching videos based on the viewable scene model. We present the algo- rithms for searching the georeferenced meta-data and retrieving the videos relevant to a user query. We illustrate the benefits of our approach in retrieving the most relevant results through analysis on real-world georeferenced videos. • Vectorbasedindexingfortheefficientsearchofvideos. Weproposeanovel vector-based approximation model for efficient indexing and searching of georefer- enced videos based on the proposed viewable scene model. The vector model suc- cessfully supports new geo-spatial video query features. We evaluate the run-time 11 performanceofthevectormodelusingourreal-worldgeoreferencedvideometa-data as well as a large collection of synthetically generated meta-data. • Ranking video search results. We introduce three ranking algorithms that considerthespatial,temporalandspatio-temporalpropertiesofgeoreferencedvideo clips. • Implementation of a web-based georeferenced video search system. We present a prototype of a web-based georeferenced video search engine (GRVS) that utilizestheproposedviewablescenemodelforefficientvideosearch. Themap-based interfaceenhancedwithvisualfeaturesprovidestheuserwithaclearunderstanding of the geo-location seen in the video. • Adatageneratorforproducingsyntheticgeoreferencedvideometa-data. Weproposeanapproachforgeneratingsyntheticvideometa-datawithrealisticgeo- graphical properties for mobile video management research. The goal is to generate large collections of georeferenced video-meta to enable comprehensive performance evaluations of realistic application scenarios on the large scale. Users can control the behavior of the proposed generator using various input parameters. 1.4 Organization The remainder of this dissertation is organized as follows. Chapter 2 provides a survey of the related work. Chapter 3 presents the design of the framework for handling georefer- enced videos. Chapter 4 introduces the proposed viewable scene model and explains the meta-data collection process. The management and search of georeferenced meta-data 12 is detailed in 5. Chapter 6 introduces the video ranking algorithms for result presenta- tion. Thisisfollowedbyadescriptionoftheweb-basedgeoreferencedvideosearchengine in Chapter 7. Chapter 8 concludes with a summary of the proposed research. Finally, Appendix A introduces the georeferenced synthetic meta-data generator. 13 Chapter 2 Related Work Inthischapter,weprovideasurveyoftheresearchworkinthemanagementandsearchof multimedia documents (specifically images and videos) using location cues. We start by presenting a through classification of meta-data creation for multimedia content. Of spe- cial interest is the geographical meta-data. Next, we review the related work that utilize the associated geographical meta-data for organizing, searching and browsing image and video collections. Our discussion starts with the methods that specifically consider still images then moves on to videos. The various techniques employed by these systems are orthogonal to our work, therefore can be combined with our proposed system to enhance the search functionality. Finally, we review some prior work from two related research fields, namely Geographical Information Systems (GIS) and spatio-temporal databases. We list some example applications that build video enabled geographic information sys- tems and discuss the related techniques in indexing and storage of georeferenced video meta-data. 14 2.1 Augmenting Meta-data to Describe Media Content Media data, specifically images and videos are inherently hard to index and search. Cur- renttechniquestogeneratemeta-datatodescribemediacontentcanbroadlybeclassified in three groups: • Keywords extracted from the visual content of images/videos. • Manually added captions to images/videos and annotations derived from the con- text they appear. • Automatically collected meta-data (most noticeably location data) from the at- tached sensor devices. The techniques in the first group try to extract semantic information from the visual images[VT02]. Inrecentyearsmuchoftheresearchinimageandvideoretrievalbasedon the visual features has been driven largely by the NIST TRECVID evaluations [SOK06]. NIST provides open TRECVID benchmark datasets which are used by the researchers to evaluate their techniques and establish comparisons with the related work on identical retrieval tasks [SOK09]. One of the relevant key evaluation tasks is the “semantic index- ing” task, which involves the assignment of semantic tags to images/videos for browsing, filtering and searching videos. Searching across TRECVID data collections are done us- ing a few core tools including: text keyword search, query-by-example (finding similar images or video shots), and recently concept based search that apply machine learning techniques to create visual concept detectors which can be searched with text keywords. In some recent systems, these tools are applied independently and then the results are 15 joined as a weighted sum, to create a multi-modal ranking of the retrieved media files. Even though significant progress has been done [DBFF02, BF01] in this field, these tech- niques are not yet practical for semantic annotation of images and videos, on the large scale. Recognizing the semantic themes in the image and videos from these low-level fea- turesstillremainstobeachallengingtask. Thereforeaugmentingcontextualinformation with media data is often beneficial [Zha09, Pav09]. The techniques in the second group gather the meta-data from a variety of sources related to the media document. For example, the images and videos published on the web can be annotated using text keywords obtained from filenames or the text around the image on the web-page. In addition, broadcast news videos can be described using the text keywords extracted from the closed captions. However, these techniques can not be utilized when there is no textual information associated with the media file (e.g. personal image and video collections). Recently, manual annotation of photo collection systems, where users manually assign keywords and tags has been quite popular [AN07, AHS06, NYGMP05]. These systems provide web-based annotation interfaces allowing users to tag their images with meta-data facets and select the appropriate categories to describe their pictures [NSPGM04a, YSLH03] and videos [HCJL03]. Some of these methods, in particular the labeling techniques, were deployed in commercial systems, the most popular of which are Apple’s iPhoto, Google’s Picassa, and Yahoo’s Flickr for images and Youtube and Myspace for videos. There has been a major effort from both the research community and the commercial providers for automatizing and easing the tagging process [SK00, KS06]. However, annotation is still cumbersome and time- consuming. In addition it can be quite noisy and error prone, therefore search results 16 often suffer from low precision. Recently, some research work proposed to improve the performance of existing image/video tags and annotations by utilizing machine learning and other artificial intelligent technologies [CTH + 09]. However, most of these techniques require a large number of balanced labelled image/video samples. With the recent advances in technology, we now see a new generation of media pro- grammable media capture devices with several attached sensors and network connection capabilities. For example, the location where the shot was taken can automatically be read from a GPS device. The camera heading can be acquired from an attached com- pass. There has been some work on how to describe the still images assuming location andheadingforthecamerawhenthephotowastakenisavailable[TBC06]. Manymobile portable devices have embedded GPS and compass sensors (e.g., Nokia 6210 Navigator, Apple iPhone 3GS and 4G). Also, there exist several GPS-enabled digital cameras which can save the location information with the digital image file as a picture is taken (e.g., Sony GPS-CS1, Ricoh 500SE, Jobo Photo GPS, Nikon D90 with GP-1). Very recent models additionally record the current heading (e.g., Ricoh SE-3, Solmeta DP-GPS N2). Howeverallthesecamerassupportgeotaggingforstillimagesonly. Ourrecordingsoftware automatically records the location and heading updates for every video frame temporally aligned with the video. As more still cameras (with video mode) and video camcorders willincorporateGPSandcompasssensors,morelocationanddirectiontaggedvideoswill be produced and there will be a strong need to perform efficient and effective search on those video data. 17 2.2 Searching and Browsing Images using Georeferenced Meta-data Therehasbeensignificantresearchonorganizingandbrowsingpersonalphotosaccording to geographical location. Toyama et al. [TLR03] introduced a meta-data powered image searchandbuiltadatabase,alsoknownasWorldWideMediaeXchange(WWMX),which indexesphotographsusinglocationcoordinates(latitude/longitude)andtime. Thiswork specifically explores methods for acquiring location tags, optimizing an image database for efficient geo-tagged image search and exploiting meta-data in a graphical user inter- face for browsing. A number of additional techniques in this direction have been pro- posed[NSPGM04b,PG05,JNTD06,LHCW07]. Allthesetechniquesuseonlythecamera geo-coordinates as the reference location in describing images. We instead rely on the field-of-view of the camera to describe the scene. More related to our work, Ephstein et al. [EOWZ07] proposed organizing large collections of images based on scene seman- tics. The authors related images with their view frustum (viewable scene) and used a scene-centric ranking (termed Geo-Relevance) to generate a hierarchical organization of images. Several additional methods are proposed for organizing [KT05, MG05, SS08] and browsing [GST + 06, TBC06] images based on camera location, direction and addi- tional meta-data. Although these research work are similar to ours in using the camera field-of-view to describe the viewable scene, their main contribution is on image browsing and grouping of similar images together. Some approaches [TBC06, KN08] use location and other meta-data, as well as tags associated with images and the images visual fea- tures to generate representative images within image clusters. Geo-location is often used 18 as a filtering step. Some techniques [EOWZ07, SS08] solely use location and orientation of camera in retrieving the “typical views” of important objects. However then the con- tribution is on segmentation of image scenes and organizing photos based on the image scene similarity. Our work describes a more broad scenario that considers mobile cam- eras capturing geo-tagged videos and the associated view frustum, which is dynamically changing over time. Therefore, unlike still images, a single point location won’t be ade- quatetodescribethegeographicregioncoveredinthevideo. Duetomobilitytheconcept of a camera location is extended to a trajectory. Furthermore, our searching technique does not target any specific application domain and therefore can easily be applied to any specific application. Complementary to the video search based on geographic properties, visual cues ex- tracted from the video content can be used to improve the search functionality. For example, Zheng et al. [ZZS + 09] proposed an earth-scale landmark recognition engine that leverages the multimedia data on the web (i.e., geo-tagged pictures, travel guide ar- ticles) with object recognition and clustering techniques. In addition, Simon et al. [SS08] proposed to use camera field-of-view to identify and segment interesting objects from the static 3D scenes. More recently, Crandall et al. [CBHK09] investigated the relationship between location and visual content in large photo collections. Their approach consid- ers the task of estimating where a photo was taken based on its content, using both attributes extracted from the image content and text tags. Similar to these techniques, the search capabilities of our georeferenced search system can be advanced by leveraging visual features in addition to geographic properties. 19 2.3 Searching Georeferenced Videos Thereexistonlyafewsystemsthatassociatevideoswiththeircorrespondinggeo-location. Hwangetal.[HCJL03]andKimetal.[KKL + 03]proposeamappingbetweenthe3Dworld and the videos by linking the objects to the video frames in which they appear. Such a mapping would enable browsing the 3D world to extract the associated videos or to browse the video to extract the visible objects within the video. Their work mentions using GPS location and camera direction to build links between video frames and world objects. However they neglect to provide any details on how this is established and how accurate it is. Furthermore, their work differs from ours by only targeting specific objects within the geo-space – ignoring the time information – whereas our approach keeps track of the viewable region over time. Liu et al. [LCS05] presented a sensor enhanced video annotation system (referred to as SEVA) which enables searching videos for the appearance of particular objects. SEVA serves as a good example to show how a sensor rich, controlled environment can support interesting applications, however it does not propose a broadly applicable approach to geo-spatially annotate videos for effective video search. For example, in the context of what is being annotated, SEVA tags video streams with the information about what was visible within the video, whereas we target to annotate the viewable region within the video. The techniques mentioned above present ideas about how to search videos based on the geographic properties, most notably the camera location. However in addition to the importance of the geographic location where a video is taken, there are other obvi- ous advantages in exploiting the spatial properties of video. For example the user query 20 might specify a certain view direction and a maximum required distance with respect to the query point or region. Our approach supports such queries by utilizing the camera direction and the distance of the camera location to the query point. Furthermore, the spatio-temporal properties of the overlap between query points and viewable scenes can be used in estimating the relevance of search results. To our knowledge, the proposed techniques in this study are the first to utilize direction and distance based queries in georeferenced video search and address the video ranking based on the viewable scene cues. We believe that our approach serves as a general purpose and flexible video man- agement and search mechanism that is applicable to any types of video with associated location and direction tags. Consequently it can be the basis for a tremendous number of multimedia applications. 2.4 Georeferenced Videos and Geographic Information Systems (GIS) 2.4.1 Exploiting Video Data in GIS Applications Thegeoreferencedvideometa-datacontainsignificantinformationabouttheregionwhere thevideoswerecapturedandcanbeeffectivelyutilizedinvariousGISapplicationsto, for example, visually navigate and query the geo-space. For example, Zeiner et al. [ZKH + 03] described a number of use cases and applications of video assisted GIS in the area of multimedia archives and territorial networks. Recently, Mills et al. [MCK + 10] proposed to utilize video for research in geography and described case studies of geo-spatial video applicationsfrompost-disasterrecovery. Ourviewablescenemodelprovidestheaccurate 21 estimation of the region covered in the videos. The georeferenced meta-data (i.e., scene descriptions) collected and annotated using our recording software can be employed by various GIS systems to build video enabled geographic information systems. 2.4.2 Ranking Videos based on Geo-spatial Properties Ranking videos based on geo-spatial properties has not been well studied in the mul- timedia community. However, some techniques in the GIS community investigate the geo-spatial relevance of documents based on their spatial properties. As an example, Beard et al. [BS97] studied the basic spatial and temporal relevance calculation methods. Morerecently,Larsonetal.[LF04]providedacomprehensivesummaryofgeospatialrank- ing techniques and Gobel and Klein [GK02] proposed a global ranking algorithm based on spatial, temporal and thematic parameters. To quantify the relevance of a video’s viewable scene to a given query we leveraged some of these fundamental, pioneering spa- tial ranking techniques. Although similar ranking schemes have been studied before, our work is novel in applying these techniques to rank video data based on viewable scene descriptions. 2.5 Indexing Georeferenced Video Meta-data When a large collection of videos is stored in a database, the cost of processing spatial queries on the georeferenced meta-data may be significant because of the computational complexity of the operations involved. There is a strong need for an indexing structure for efficient storage and querying of our georeferenced video annotations. Conventional 22 database and GIS technology can be used to partially manage some non-temporal prop- erties of video objects such as the location of a camera and the trajectory of camera movement. However, we consider not only the movement of a camera but also the ro- tation of the camera heading direction. In addition to the traditional point and range queries, our proposed framework supports searching for specific camera view points and viewing distances. Some researchers have investigated the relative geographical locations of objects within a scene [TVS96]. However, to the best of our knowledge, there has been no study in modelling the actual geographical information of video scenes and utilizing them for an efficient video search. In our framework, we propose to use a histogram to accumulate the relevance scores for the camera viewable scenes. Data summarization using histograms is a well-studied research problem in the database community. A comprehensive survey of histogram creation techniques can be found in [Ioa03]. Ephstein et al. [EOWZ07] use a grid of voting cells to discover the important parts of an image. Their techniques use only the spatialattributestodiscovertherelevantsegmentsoftheimagescenewhereasourranking methods incorporates both spatial and temporal attributes in calculating relevance. 23 Chapter 3 A Framework for Georeferenced Video Search 3.1 Introduction This chapter proposes a general framework for georeferenced video search applications as shown in Figure 3.1. We emphasize the mobility of cameras in the framework because of the ubiquity of mobile devices and the prominent importance of geographic properties of moving cameras. We envision that more and more user generated videos are produced from mobile devices such as cellular phones. To address the issues of georeferenced video search outlined in Chapter 1, our framework consists of three main parts: the data collection with mobile devices, the search engine to store, index, search and retrieve both the meta-data and video contents, and the user interface to provide web-based video search services. The proposed framework provides a test-bed for implementing the video search system illustrated in Figure 1.1. In the following chapters , we will introduce our solutions for the essential modules in implementingtheframework. Notethatwearenotnecessarilyprovidingoptimalsolutions forallmodulesdiscussedinthissection,butinsteadintendtoprovideanexamplesolution 24 Figure 3.1: Framework structure and modules for each critical module to demonstrate the feasibility and adequacy of the proposed georeferenced video search framework. 3.2 Data Collecting Device At the mobile device level, the main objective is to capture the sensor inputs and to fuse them with the video for future storage and retrieval of videos. In the framework, a mobile device can be any camera equipped with various sensors and a communication unit. A good example is a smartphone such as Apple’s iPhone 3GS that includes GPS, digital compass, accelerometer, 5 mega pixel camera, WiFi/Broadband data connection, 25 and programming capability. The following issues need to be considered at the device level. First, the Data Collection Module in Figure 3.1 captures videos with sensor inputs through various sensors including camera, GPS receiver, compass, accelerometer, etc. Sensor signals can be affected by noises so they might be checked and refined in the Sensor Signal Processing Module. For example, accelerometer input can be filtered for a clearer signal. Sensor measurement errors can be detected and missing sample values can be estimated here. Then, the sampled sensor data should be synchronized and tagged in accordance with the recorded video frames (Format Module). This automatic synchro- nized annotation forms the basis of the proposed framework. Assuming multiple sensors with different sampling rates and precisions (e.g., for each second 30 frames of video, 1 GPS location coordinate, and 40 direction vectors), values might be manipulated using numerical methods such as interpolating, averaging, etc. The sensor meta-data can be sampled either periodically, or aperiodically by applying adaptive methods. An adaptive method can be more efficient and desirable in a large scale application since it can min- imize the amount of the captured data and can support a more scalable system. The synchronization among sensor inputs and video frames should be designed to maximize processing accuracy and to minimize the amount of meta-data. Another challenge is how the automatic annotation can handle a variety of video technologies (and cameras) and sensors. Without any standard tagging method, the compatibilityamongvariousmeta-datawouldbeacriticalproblemforgeneralacceptance indiverseapplications. Thisissuecanbemorepronouncedwhenvideosearchisextended to user generated videos on public web sites. Therefore, it is desirable to represent the 26 collectedmeta-datausingastandardformat,regardlessofthenumber,type,andprecision of sensors. The next question is whether the meta-data are either (1) embedded into the video files or (2) handled separately. The embedding granularity can be at the frame, segment, scene or clip level of the video. Embedding requires a standard embedding method based on a specific video technology. The problem is that there are so many different video coding techniques. Separating video and meta-data works independently from the video coding techniques, however it presents a verification problem between a video file and its meta-data. The next issue is an efficient interaction between the mobile devices and the server. Multimedia data is generally large and may require intense processing (such as compres- sion) and significant bandwidth for its transmission. Therefore, an effective interaction and efficient data transmission among the mobile clients and the server is of much impor- tance. Currently, user-generated data is typically sent immediately to a server in their entireties. This approach works well for small size image collections. However, when we consider a large data video transfer from a mobile device which has expensive com- munication cost or limited communication capability, this approach is not cost-efficient. Considering the bandwidth and power consumption costs of transmitting large amounts of data such as video content, the immediate transmission of potentially irrelevant data is an inefficient use of resources. The Communication Module defines and controls the data transmission between the mobile device and the server based on a predefined protocol. This module needs to provide versatile ways to accommodate diverse applications and service models. 27 3.3 Search Engine The goal of the search engine is to retrieve more meaningful results for end-users in a highly efficient way that scales to large video archives. In our framework, we define the search engine as the collection of all components that store, index, search, and retrieve both the meta-data and video contents. The search engine consists of three components as shown in Figure 3.1: (1) the database server (DB) which manages the spatio-temporal meta-data and performs video search based on them, (2) the media server (MS) which stores and retrieves videos, and (3) the application data processor (AP) which manages the details of application-dependent features such as meta-data modeling, query types, etc. The framework does not assume any specific database management system or media server. However, AP can be customized for individual online video applications. Note that, in the most general and largest scale applications, multiple database servers and media servers may be distributed across a wide area and they may collaborate with each other. 3.3.1 Application Data Processor To utilize the captured geographic properties of videos for searching, the framework represents the coverage area of video scenes as spatial objects in a database, i.e., it modelsthecoverageareausingthecollectedmeta-data(ModelingModule). Thismodeling effectivelyconvertstheproblemofvideosearchintotheproblemofspatialobjectselection in a database. The effectiveness of such a model depends on the availability of sensor data. For example, in an application with only camera location data from GPS, the 28 (a) (b) Figure 3.2: Comparison of two coverage models potential coverage area of video frames can be represented as a circle centered at the cameralocation(Figure3.2(a)). Inanotherapplicationwithextracameradirectiondata, the coverage area can be more accurately refined like a pie slice shown in Figure 3.2(b) (more details in Figure 4.1). Thus, videos represented by the pie model can be searched more effectively. Modeling is essential for indexing and searching of video contents in a database because the query functionality and performance are greatly impacted by the spatial modeling of videos. Video searching should be able to fully take advantage of the collected meta-data for various requests by applications. Search types exploiting the geographic properties of the video contents may include not only conventional point and range queries (i.e., overlap between the covered area of video and the query range), but also new types of video specific queries. For example, one might want to retrieve only frames where a certain small object at a specific location appears within a video scene, but with a given minimum size for better visual perception. Usually, when the camera is closer to the query object, the object appears larger in the frame. Thus, we can devise a new search type with a range restriction for the distance of the camera location from the 29 query point; we term this a distance query. Similarly, the camera view direction can be an important factor for the image perception of an observer. Consider the case where a videosearchapplicationwouldliketoexploitthecollectedcameradirectionsforquerying, representing a directional query. The Query Processing Module can implement a set of user-defined query types for an application. Note that, for the implementation of the new query types, new indexing techniques or database query functionalities might need to be introduced. Moreover, the evaluation of new query types should be fast enough to be practical, especially for large scale video search. There has been little research on these issues. In Section 5.3.2, our implementation of new query types will be described in detail. In a large collection of videos, a search usually results in multiple video matches. The challenge is that human visual verification of the video results may take a significant amount of time due to the overall length of videos. To enhance the effectiveness of the result presentation, an approach is to quantify the relevance of resulting videos with respect to the given query and to present the results based on their relevance ranking. The difficulty lies in the fact that the human appreciation process of relevance is very subjectiveandsoitischallengingtobequantified. Inourframework,theRanking Module harnesses objective measurements to quantify the relevance between a query and videos in two different ways: (1) spatial relevance: overlapping coverage area between query range and a video, and (2) temporal relevance: overlapping covered time. The main objective of the framework is to search videos using their spatio-temporal meta-data. However, it is also well known that video search can further be improved by leveraging the features extracted from the visual video content. As a complementary 30 approach to enhance the searchability, the framework can be combined with optional video processing modules. For example, the Visual Analysis Module can synergistically enhance the accuracy of the ranking process by accommodating visual features extracted bytheFeature Extraction Module. Tocalculatetherelevancebasedonthevisualcontent, existing video ranking techniques can be adopted [LZLM07]. However, considering that most techniques are proven to work effectively on specific domains, it remains uncertain how well these techniques can perform with unconstrained video datasets in a general video search framework such as the one presented. Some recent work [JB08] proposed to analyzethevisualsimilaritiesamongresultingimagestochoosetherepresentativeimages that answer the search keywords well. The well-connected images that are found to be similar to a majority of the resulting images are returned as the most relevant. Such an approach can be applied in our framework for content based ranking of videos. Since our framework targets general video search while most content based search techniques are domain-specific, any discussion of combining our approach and a content based technique may not be meaningful without a specific application. Consequently, we do not provide further details here. 3.3.2 Database Server The database server stores the coverage areas of video scenes as spatial objects in a con- ventional database management system such as MySQL. When a user query is received from AP and translated into a spatio-temporal selective query according to the video scene model, the database is searched using conventional query processing techniques. The database server unit can consist of the following modules: 31 Database Insertion Module: inserts the spatio-temporal video scene descriptions into the database. Database Search Module: searchesthedatabasebasedonthequeryspecificationsreceived from the application data processor unit. Storage Module: the video scene meta-data (based on the model) are stored using appro- priate data structures and indexes. 3.3.3 Media Server Theroleofmediaserveristostoreactualvideocontentsandtoprovidestreamingservice to users. In general, the media server obtains a list of video segments as query results and transmits them to the user interface in a predefined way. Different user interfaces can present the search results in different ways so the media server corresponds to the requests of the user interface. Inalargecollectionofvideoswithheterogenouscodingtechniques,avideomightneed tobetranscodedbythe Transcoding Modulewhenitarrivesattheserver. Forexample, a user can collect videos in any format but the application might require certain predefined formats for service. Similarly, when users request the display of the query results, videos can be transcoded to accommodate the different supported video formats between the serveranduserdevices. TheStorageModulestoresvideosbasedontheunderlyingstorage system and the media server technology. The query results from the application data processor are analyzed by the Retrieval Scheduler Module to provide the most efficient way to retrieve the requested data. One critical functionality of the media server in our framework is the ability to randomly access any portion of stored videos in a fast way. 32 Since the amount of data in an entire video clip can be very large and the user might be interested in watching only a portion of the video where the query overlaps, random access is very important for humans to verify the results. 3.4 User Interface Theroleofthe User Interfaceunitistoprovideuserswithmethodstocommunicatewith the search engine from both wired computers and mobile devices. Then, depending on the type of devices, the user interface can be designed in different ways. Our framework assumes a web-based user interface to communicate with the search engine. Depending on the computing power of the user’s machine and the availability of other supporting software (e.g, map interface APIs, media player, web browser), the features of the video search applications may change. Users can search videos in multiple ways (User Interface Module). One intuitive method is submitting a map-based query when users are familiar with the interested area. Drawing a query region directly on any kind of map (see our implementation examples in Figures 7.2 and 7.3) might provide the most human-friendly and effective interface paradigm. Alternatively, a text-based query can also be effective when users are searching for a known place or object. For example, the user interface can maintain a local database of the mapping between places and their geo-coordinates. Then, the textual query can be converted into a spatial query with exact longitude and latitude input. External geo-coding services can also provide this translation. 33 The user interface module receives the ranked query results from the search engine. In addition to the ranking method, the presentation style or format of the results also greatly affects the effectiveness of the presentation. Thus, human friendly presentation methods should be considered such as using key frames, thumbnail images, any textual descriptions, etc. The optional Visual Scene Organizer Module organizes video search results based on the spatial and visual scene similarity. Then, the user interface module may implement moreeffectivevideobrowsingtoolsbasedonthescenesimilarity. Amap-baseduserinter- face for both query input and video output can also be an effective tool by coordinating the vector map and actual video display. Note that relevance ranking and presentation style are not just technical issues, but may require an intensive user study. The Media Player software plays out the returned videos. One important aspect of videopresentationisthecapabilitytodisplayonlyrelevantsegmentsofvideosattheuser side avoiding unnecessary transmission of video data. Thus, the media player at the user side and the streaming media server are expected to closely collaborate. 3.5 Summary In this chapter, we presented a framework based on the complementary idea of acquiring sensor streams automatically in conjunction with video content. Of special interest are geographic properties of mobile videos. The meta-data from sensors can be used to model the coverage area of scenes as spatial objects such that videos can effectively, and on a large scale, be organized, indexed and searched based on their field-of-views. We 34 explained the concept of managing geo-tagged video and presented an overall framework in support of various applications. In the following chapters we will present our design and implementation ideas to illustrate the real-world feasibility and adequacy of our framework. 35 Chapter 4 Georeferenced Meta-data Acquisition 4.1 Introduction This chapter focuses on how to acquire and quantify the georeferenced meta-data to de- scribe the viewable scenes of captured videos. We model the viewable space of a scene with parameters such as the camera location, the angle of the view, and the camera di- rection. As the camera moves or rotates over time, the viewable scene changes as well. The meta-data that describes this dynamic scene information has to be acquired from theattachedsensors(a3Ddigitalcompass, andaGPSreceiver)andfusedwiththevideo stream for future storage and retrieval of videos. We will next introduce our viewable scene model and describe the meta-data acquisition from attached sensors. Finally, we will present our prototype system for capturing georeferenced videos. 36 P θ φ θ,φ : horizontal and vertical viewable angles P <longitude,latitude,altitude>: camera location R R : visible distance θ : viewable angle P <longitude,latitude>: camera location R : visible distance (a) (b) θ P R d d : camera direction vector d d : camera direction vector (in 3D) Figure 4.1: Illustration of FOVScene model (a) in 2D (b) in 3D. 4.2 Viewable Scene Model A camera positioned at a given point P in geo-space captures a scene which we call the viewable scene of the camera. In the field of computer graphics, this area is referred to as camera field-of-view (FOV for short). We will use the terms ‘viewable scene’ and ‘field-of-view’ interchangeably throughout this manuscript. The meta-data related to the geographical properties of a camera and its captured scenearethefollowing. (1)ThecamerapositionP isthehlatitude,longitudeicoordinates readfromapositioningdevice(e.g.,GPS).(2)Thecameradirectionvector − → d isobtained based on the orientation angle provided by a digital compass. (3) The camera viewable angle θ describes the angular extent of the scene imaged by the camera. The angle θ is calculated based on the camera and lens properties for the current zoom level [GBB + 65]. (4) The far visible distance R is the maximum distance at which a large object within the camera’s field-of-view can be recognized. Based on the availability of the above data one can model the viewable scene of a camera in different ways. One simple and 37 straightforward approach – using only the GPS location of a camera (P) – is to model the viewable scene as a point. We will refer to this scene description as PointScene(P) throughout the manuscript. In the presence of information about both the GPS location P and the visible distance R, the viewable scene can be modeled as a circle with radius R centered at P. This method covers all the possible area a camera might see when the viewdirectionisnotknown. Wewillusetheterm CircleScene(P,R) torefertothisscene description. If all the meta-data (P, R, ~ d, θ) are available, one can model the viewable area more accurately, which results in more effective indexing and searching. In this study, we propose the FOVScene model, which describes a camera’s viewable scene in 2D space by using the four parameters: camera location P, camera orientation vector ~ d, viewable angle θ and visible distance R. FOVScene(P, ~ d,θ,R) (4.1) The symbol ‘F’ is a short term for FOVScene. We will use the term FOVScene and symbol F interchangeably throughout the manuscript. The full field-of-view is obtained with the maximum visual angle, which depends on the lens/image sensor combination used in the camera [Hec01]. Smaller image sensors have smaller field-of-view than larger image sensors (when used with the same lens). Alternatively, shorter focal-length lenses have a larger field-of-view than longer focal-length lenses (when used with the same image sensor). The viewable angle θ can be obtained via the following formula (Equa- tion 4.2) [Hec01]: θ =2tan −1 y 2f (4.2) 38 whereyisthesizeoftheimagesensorandf isthefocallengthofthelens. Therelationship between the visible distance R and the viewable angle θ is given in Equation 4.3 [Hec01]. As the camera view is zoomed in or out, the θ and R values will be adjusted accordingly. θ =2arctan à y(Rcos( θ 2 )−f) 2fRcos( θ 2 ) ! (4.3) We assume that the video content is captured from a sensor-equipped camera (see Figure1.2),whichcanaccuratelyestimateitscurrentlocationP,orientation ~ dandvisual angle θ. In 2D space, the field-of-view of the camera at time t, (F(P, ~ d,θ,R,t)) forms a pie-slice-shaped area as illustrated in Figure 4.1a. Figure 4.1b shows an example camera F volume in 3D space. For a 3D representation of F we would need the altitude of the camera location point and the pitch and roll values to describe the camera heading on the zx and zy planes (i.e., whether the camera is rotated around the back-to-front axis or around the side-to-side axis). We believe that the extension to 3D is straightforward, especially since we already acquire the altitude level from the GPS device and the pitch and roll values from the compass. In this study we will represent the FOVScene in 2D space only. We plan to work on the 3D extension in our future work. 4.3 Collecting Sensor Data The viewable scene of a camera changes as it moves or alters its orientation in geo- space. In order to keep track of the FOVScene of a moving camera over time, we need to record its location (P), direction ( ~ d) and viewable angle (θ) with a certain frequency and produce time stamped meta-data together with time stamped video streams. Our 39 meta-data streams are analogous to sequences of hP, ~ d,θ,R,ti quintuples, where t is the timeinstantatwhichFOVScene informationisrecorded. Inlargescaleapplications,there maybethousandsofmovingcameraswithdifferentsensingcapabilities. Eachcamerawill recorditslocationand FOVScene descriptionwithadifferentsamplingfrequency. Wedo not make any assumptions about how frequently a camera should record its FOVScene coverage. Recording georeferenced video streams Our sensor rich video recording system incorporates three devices: a video camera, a 3D digital compass, and a Global Positioning System (GPS) device. We assume that the optical properties of the camera are known. The digital compass, mounted on the camera, periodically reports the direction in which the camera is pointing. The camera location is read from the GPS device as a hlatitude, longitudei pair. Video can be captured with various camera models – we use a high-resolution (HD) camera. Our custom-written recording software receives direction and location updates from the GPS and compass devices as soon as new values are available and records the updates along with the current computer time and coordinated universal (UTC) time. Each recorded GPS update includes: (1) the current latitude and longitude, (2) the local time when the location update was received (in ms), (3) the satellite time (in UTC) for the location update (in s), (4) the current speed over ground, and (5) the altitude. Each recorded compass update includes: (1) the current heading angle (with respect to the North), (2) the local time when the direction update was received (with ms granularity), and (3) the current pitch and roll values. Altitude, pitch and roll measurements are not directly used 40 in the 2D camera viewable scene calculation, however they would be required to estimate a 3D viewable scene which defines a cone shaped volume (see Figure 4.1b). Furthermore, pitch and roll values are useful for analyzing the accuracy of the heading measurements. Each video data packet received from the camera is processed in real time to extract frame timecodes. Extracted timecodes are recorded along with the local computer time whentheframewasreceived. Sincevideodataisreceivedfromthecameraasdatapacket blocks, all frames within a video packet will initially have the same local timestamp. Timestampingaccuracycanbeimprovedbyadjustingthelocaltimebackwardswithinthe datablockbasedontheframerate. Creatingaframeleveltimeindexforthevideostream will minimize the synchronization errors that might occur due to clock skew between the camera clock and the computer clock. In addition, such a temporal video index, whose timing is compatible with other datasets, enables easy and accurate integration with the GPS and compass data. We also keep track of the size of the video data captured since the beginning of the capture and record the byte offset for each video frame. Calculating viewable angle (θ) and visible distance (R) Assuming that the optical focal length f and the size of the camera image sensor y are known, the camera viewable angleθ can be calculated through Equation 4.2. The default focal length for the camera lens is obtained from the camera specifications. However, when there is a change in the camera zoom level, the focal length f and consequently the viewable angle θ will change. To capture the change in θ, the camera should be equipped with a special unit that will measure the focal length for different zoom levels. Such functionality is not commonly available in today’s off-the-shelf digital cameras and 41 camcorders. To simulate the changes in the viewable angle, we have manually recorded the exact video timecodes along with the change in the zoom level. Using the Camera Calibration Toolbox [Bou] we have measured the f value for five different zoom levels (from the minimum to the maximum zoom level). For all other zoom levels, the focal lengthf isestimatedthroughinterpolation. ThevisibledistanceRcanbeobtainedbased on the following equation (Equation 4.4), R = fh y (4.4) where f is the lens focal length, y is the image sensor height and h is the height of the target object that will be fully captured within a frame. With regard to the visibility of an object from the current camera position, the size of the object also affects the maximum camera-to-object viewing distance. For large objects (e.g., mountains, high buildings) the visibility distance will be large whereas for small objects of interest the visibility distance will be small. For simplicity in our initial setup we assume R to be the maximum visible distance for a fairly large object. As an example, consider the buildings A and B shown in Figure 4.2a. Both buildings are approximately 8.5m-tall and both are located within the viewable angle of the camera. The distances from the buildings A and B to the camera location are measured as 100m and 300m, respectively. The frame snapshot for the FOVScene in Figure 4.2a is shown in Figure 4.2b. We assume that, with good lighting conditions and no obstructions, an object can be considered visible within a captured frame if it occupies at least 5% of the full image height. For our JVC JY-HD10U camera, the focal length is f = 5.2mm and the CCD image sensor height is 42 y =3.6mm. UsingEquation4.4theheightofbuildingAiscalculatedas12%ofthevideo frame, therefore is considered visible. However, building B is not visible since it covers only 4% of the image height. Based on the above discussion, the threshold for the far visible distance R for our visible scene model is estimated at around 250m. We currently target a mid-range far visible distance of 200-300m. We believe that this range best fits with typical applications that would most benefit from our georeferenced video search (e.g., traffic monitoring, surveillance). Close-up and far-distance will be considered as a part of our future research. (a) (b) Figure 4.2: Illustration of visibility. (a) Building A is 100m away from the camera and is within the camera FOVScene. Building B is 300m away form the camera and outside of the FOVScene. (b) Building A covers 12% of the frame height and therefore is assumed to be visible. Building B covers only 4% of the frame height, therefore is assumed to be not visible. Timing and Synchronization As described above, all meta-data entries (GPS, compass updates and video frame timecodes) have millisecond-granular timing. In addition to the local time, we record the satellite time (in UTC) that is received along with the GPS location update. The use of the recorded satellite time can be twofold: (1) it enables synchronizing the current 43 computer time with the satellite time (2) it may be used as the time base when executing temporal queries, i.e., by applying the temporal condition of the query to the satellite time. Timestamping the camera viewable scenes with the satellite time ensures a global temporal consistency among all georeferenced video collections. All three meta-data streams need to be combined and stored as a single stream with an associated common time index. In a sensor-rich system with several attached devices, one challenge is how to synchronize the sensor data read from the attached devices which have different data output rates. Let f i be the output rate for the device with ID i. Intuitively, the data frequencyofthecombinedstreamwillbeatmostmin(f i ),∀i. Anaivewayistocreatethe combineddatastreamduringdatacollectionandrecordthemostup-to-date sensorread- ings as a single record every 1/min(f i ) s using the local time as the common timestamp. This method can only consider past data updates from devices. However it is possible thatanewupdatefromadeviceisavailableimmediatelyafteracombinedtuplehasbeen generated. Therefore,thenaivemethoddoesnotguaranteethattemporallyclosestvalues among all device updates will be matched in the combined stream. A better approach is to record separate data streams for each device, where each data entry is timestamped with the time when the update was received from the device. Later these data streams can be combined with a 2-pass algorithm. Such an algorithm processes data in a sliding timewindowcenteredatthecurrenttime. Itwillalwaysmatchthedataentriesthathave the closest timestamps (past or future). In our setup f GPS = 1 sample/s, f compass = 40 samples/s, and f video = 30 samples/s. Therefore we match each GPS entry with the temporally closest video frame timecode and compass direction. It is further possible to estimate missing data items in low frequency data streams by applying an interpolation 44 technique. Positional Interpolation is widely used in location-based services to obtain the entire movement/trajectory of a moving object. In our data collection prototype, we match the temporally closest direction updates and frame timecodes and estimate the location at that timestamp through a positional interpolation technique. The ground speed of the camera point, which is available at GPS locationupdates, is useful for a bet- ter location estimation. Note that since f video < f compass there is no need to interpolate compass heading values. Measurement Errors The accuracy of the FOVScene calculation is somewhat dependent on the precision of the location and heading measurements obtained from GPS and compass devices. A typicalGPSdeviceisaccurateapproximatelywithin10meters. Inourproposedviewable scene model, the area of the region that a typical HD camera captures (FOVScene) is on order of tens of thousands of square meters (e.g., at full zoom-out approx. 33,000m 2 ). Therefore, a difference of 10m is not very significant compared to the size of the viewable scene we consider. Additionally, missing GPS locations – due to various reasons such as a tunnel traversal – can be recovered through estimation such as interpolation. There exists extensive prior work on estimating moving object trajectories in the presence of missing GPS locations. An error in the compass heading may be more significant. Many digital compasses ensure azimuth accuracy of better than 1 ◦ (e.g., about 0.6 ◦ for the OS5000 digital compass in our system), which will have a minor effect on the viewable scene calculation. However, when mounted on real platforms the accuracy of a digital compass might be affected by local magnetic fields or materials. For our experiments 45 the compass was calibrated within the setup environment to minimize any distortion in compass heading. For some cameras the cost of add-on sensors may be an issue and manufacturersmightuseaccelerometerspluscameramovementtoestimatethedirection. In that case the accuracy will be lower and needs to be further investigated. It is also worth mentioning that multimedia applications often tolerate some minor errors. If a small object is at the edge of viewable scene and is not included in the search results, often it will not be recognized by a human observer as well. 4.4 Georeferenced Video Recording 4.4.1 Prototype System for Recording Georeferenced Videos To collect georeferenced video data, we have constructed a prototype system which in- cludesahigh-resolution(HD)camera, a3DcompassandaGPSreceiver(seeFigure1.2). We used the JVC JY-HD10U camera with a frame size of approximately one megapixel (1280x720 pixels at a data rate of 30 frames per second). It produces MPEG-2 HD video streams at a rate of slightly more than 20 Mb/s and video output is available in real time fromthebuilt-inFireWire(IEEE1394)port. Toobtaintheorientationofthecamera, we employed the OS5000-US Solid State Tilt Compensated 3 Axis Digital Compass, which provides very precise tilt compensated heading, roll and pitch information. The compass delivers an output rate of up to 40 samples per second. To acquire the camera location weusedthePharosiGPS-500GPSreceiver. Wehavebuiltasoftwaretoacquire, process, and record the georeferences along with the MPEG2 HD video streams. MPEG2 video is processed in real-time (without decoding the stream) and each video frame is associated 46 with its viewable scene information. Although our sensor rich video recording system has been tested mainly with cameras that produce MPEG-2 video output, based on its DirectShow filter architecture, different video formats are potentially supported. Both progressive and interlaced video can be processed. With interlaced video the camera location and heading is recorded for each frame consisting of an odd and even field pair. 4.4.2 Data Collection To obtain some experimental data sets, we mounted the recording system setup on a vehicle and captured video along streets at different speeds (max. 25 MPH). During video capture, we frequently changed the camera view direction. We recorded two sets of video data: (ii) in downtown Singapore and, (ii) in Moscow, Idaho. Figure 4.3: Camera positions for the collected Moscow, Idaho dataset. 47 TheSingaporedatasetcontains34videoclips, rangingfrom180to1260sinduration. Each second, anF (i.e., FOVScene) was collected, resulting in 7,330Fs in total. And in the Idaho dataset, the recorded videos cover a 5.5 km by 4.6 km region quite uniformly. The total captured data includes 134 video clips, ranging from 60 to 240 s in duration. (10,652 Fs in total). Figure 5.19 shows the distribution of the camera positions of the 10,652Fs in the Idaho dataset. And in Figure 4.4, we show a visualization example of camera viewable scenes for two video files on a map. For visual clarity, viewable scene regions are drawn every three seconds. Figure 4.4: Visualization of viewable scenes on a map. 4.5 Summary In this chapter, we introduced a methodology for automatic annotation of video clips with a collection of meta-data such as camera location, viewing direction, field-of-view, 48 etc. Such meta-data can provide a comprehensive model to describe the scene a camera captures. Weproposed a viewable scene model that strikesa balance between the analyt- icalcomplexityandthepracticalapplicabilityofthescenedescription. Wealsodescribed our implemented prototype which demonstrates the feasibility of acquiring meta-data enhanced georeferenced video. 49 Chapter 5 Georeferenced Meta-data Management and Search 5.1 Introduction Geographicmeta-dataacquiredfromattachedsensorsallowtheconsistentandautomatic annotation of large amounts of collected video contents and thus enable various criteria for versatile video search. In addition, the captured georeferenced meta-data have a significant potential to aid in the indexing and searching of georeferenced video data at the high semantic level preferred by humans. In Chapter 4.2 we proposed to describe the viewable scene area of video frames using a pie-shaped geometric contour, i.e., a pure spatial object, thus transforming the video search problem into a spatial data selection problem. The objective then is to index these spatial objects (FOVScenes) and to search videos based on their geographic properties. In this chapter we propose novel approaches for efficient indexing, searching and retrievalofvideoclipsbasedontheproposedviewablescenemodel. Weintendtoprovide anexamplesolutionforthe“applicationprocessormodule”ofthevideosearchframework 50 presented in Section 3.3.1. The proposed techniques implement the major components of the search engine (see Figure 3.1). Our contributions in this chapter are twofold: Inthefirstsection, wedescribethemodelingandstorageofthecollectedsensormeta- data using the proposed viewable scene model. We rigorously explain the search process and provide the algorithms to search the video scenes. We evaluate the effectiveness and accuracy of the scene based search using a real-world video dataset which was collected using our recording prototype. In the second section, we introduce a new vector-based approximation model of the video’s viewable scene area. We show that the vector model can be used in various ways to enhance the effectiveness of a search filter step so that the expensive and complex refinement step can be performed with far fewer potentially overlapping viewable scenes. Wedemonstratehowthevectormodelcanprovideaunifiedmethodtoperformtraditional overlapquerieswhilealsoenablingsearchesthat, forexample, concentrateonthevicinity of the camera’s position or take into account its view direction. The amount of georeferenced video data available in our real-world dataset is not large enough to evaluate realistic application scenarios using the vector model. To en- able comprehensive performance evaluations on a large scale, much bigger datasets are required. However, collecting real-world data requires a considerable amount of time and effort. In Appendix A, we propose an approach for generating synthetic video meta-data withrealisticgeographicalproperties. Thegeneratedmeta-dataisbasedonthedefinition of our viewable scene model. Using our synthetic data generator, we produced a large repository of georeferenced video meta-data, corresponding to around 180,000 hours of 51 georeferenced videos. We performed extensive analysis on the filtering performance of the vector model using the synthetic dataset. 5.2 Searching Georeferenced Videos 5.2.1 Searching Videos Using the Viewable Scene Model The next task after collecting georeferenced meta-data is to semantically describe them so that accurate and efficient analysis on the camera viewable scenes is possible. An intuitive way is to store a separateF quintuple for each video frame. An example schema is given in Table 7.1. camera id ID of camera video id ID of video frame timecode Timecode forthe currentframein video(extracted from video) location Current camera position (P) (read from GPS) visual angle Angular extent for camera field-of-view (θ) (calculated based on camera properties) direction Current camera direction (d) (read from compass) Table 5.1: Example schema for FOVScene representation. The user query is either a geographical region or some textual or visual input that can be interpreted as a region in geo-space. As an example, a query asking for the videos of the “University of Idaho Kibbie Dome” can be translated into the coordinates of the corner points of the rectangular region that approximates the location of the dome. The database is then searched for this spatial region and the video segments that capture the Kibbie Dome are retrieved. If the query specifies a temporal interval, only the videos that were recorded during the specified time window are returned. 52 TheFOVScene coverageofamovingcameraovertimeisanalogoustoamovingregion inthegeo-spatialdomain,thereforetraditionalspatio-temporalquerytypes,suchasrange queries, k nearest neighbor (kNN) queries or spatial joins, can be applied to theF data. Thetypicaltaskwewouldliketoaccomplishistoextractthevideosegmentsthatcapture a given area of interest. As explained in Section 4.2, for each video frame with timecode t, we keep track of the camera location coordinates P, the camera orientation ~ d, and the camera view angleθ. Therefore, we can construct theF(P, ~ d,θ,R,t) description for every video frame. Hence, for a given area of interest Q, we can extract the sequence of video frames whose viewable scene overlap with Q. Going from most specific to most general, the query region Q can be a point, a line (e.g., a road), a poly-line (e.g., a trajectory be- tweentwopoints),acirculararea(e.g.,neighborhoodofapointofinterest),arectangular area (e.g., the space delimited with roads) or a polygon area (e.g., the space delimited by certain buildings, roads and other structures). In our initial setup, we simply return all frames with timecode t, for which SceneIntersect(Q,FOVScene(P, ~ d,θ,R,t)) is true. Al- gorithm 1 formalizes the methodology for checking whether a particular FOVScene area intersects with the query region Q. Algorithm 1 calls the subroutines pointFOVIntersect (given in Algorithm 2) and edgeFOVIntersect (given in Algorithm 3). The computations for the other two scene models, SceneIntersect(Q,CircleScene(P,R,t)) and SceneInter- sect(Q,PointScene(P,t)), are trivial and omitted due to space limitations. Onepotentialproblemwiththeproposedsearchmechanismbuiltontopofarelational model is the computational overhead. In a typical query, all frames that belong to the query time interval have to be checked for overlaps. A more appropriate approach is to define the FOVScene in the spatial domain with a pie-slice-shaped area as described 53 Algorithm 1 SceneIntersect(Q,FOVScene(P, ~ d,θ,R,t)) Q ← Given convex polygon shaped query region Q =<V,E > (V←Set of vertices in Q; E ←Set of edges in Q) FOVScene(P, ~ d,θ,R,t) ← FOV region for frame t pointPolygonIntersect(P,Q) ← returns true if point P is in Q if pointPolygonIntersect(P,Q) is true then return true; end if for ∀v ∈V do if pointFOVIntersect(v,FOV(t,P, ~ d,θ,R)) ==true then return true; end if end for for ∀e∈E do if edgeFOVIntersect(e,FOV(t,P, ~ d,θ,R)) ==true then return true; end if return false; end for Algorithm 2 pointFOVIntersect(q,FOVScene(P, ~ d,θ,R,t)) q ← Given query point FOVScene(P, ~ d,θ,R,t) ← FOV region for frame t EarthDistance(q,P) ← returns the earth distance between points q and P if EarthDistance(q,P))≤R then α=angle of ~ d with respect to north δ=angle of vector ~ Pq with respect to north if α−( θ 2 )≤δ ≤α+( θ 2 ) then return true; end if end if return false; in Section 4.2 and then estimate it with a Minimum Bounding Rectangle (MBR) (see Figure 5.1). An intuitive structure to index these MBRs is the R-tree [Gut84] family, one of the most popular index structures for rectangles. In Section 5.3, we elaborate more on the indexing issues and propose a vector based indexing method for efficient search of camera viewable scenes. Inthischapterwerestrictourexamplequeriestosimplespatiotemporalrangesearches. However, using the camera view direction ( ~ d) in addition to the camera location (P) to describe the camera viewable scene provides a rich information base for answering more complexgeospatialqueries. Forexample, ifthequeryasksfortheviewsofanareafroma 54 Algorithm 3 edgeFOVIntersect(q,FOVScene(P, ~ d,θ,R,t)) e ← Given edge FOVScene(P, ~ d,θ,R,t) ← FOV region for frame t LinePointDistance(P,e) ← returns the shortest earth distance between point P and closest point on line e <p 1 ,p 2 > = lineCircleIntersect(e,P,R); {return the two points that line e intersects with the circle centered at P with radius R. Note that p 1 =p 2 if e is tangent to circle} if LinePointDistance(P,e))≤R then α=angle of ~ d with respect to north δ 1 =angle of vector ~ Pp 1 with respect to north δ 2 =angle of vector ~ Pp 2 with respect to north if [α−( θ 2 ),α+( θ 2 )] overlaps with [δ 1 ,δ 2 ] then return true; end if end if return false; frame i frame i+f frame i+2f frame i+3f frame i+4f frame i+5f FOVScene description is generated every f frames Figure 5.1: MBR estimations of FOVScenes. particular angle, more meaningful scene results can be returned to the user. Directional query processing is discussed in Section 5.3.2. Alternatively, the query result set can be presented to user as distinct groups of resulting video sections such that videos in each group will capture the query region from a different view point. Similarly, when query results are presented to the user, the video segments can be ranked based on how relevant they are to the query requirements and user interests. There are several factors that need to be taken into account in estimating relevance. Among them are the amount of overlap, the closeness of the overlapping area to the camera location, the angle at which the camera is viewing the overlap area, etc. Furthermore, the way FOVScene overlaps with the query area may indicate the 55 importance of the video clip. In this study we improve the proposed query mechanism and order the search results based on their relevance to the user query. Video ranking is discussed extensively in Section 6. 5.2.2 Experimental Evaluation 5.2.2.1 Methodology Inthissectionweevaluatetheperformanceofourscenebasedvideosearchusingthereal- world dataset that we collected with our prototype video recording system in Moscow, Idaho. ThedatacollectionprocessisexplainedinSection4.4.2. Thecaptureddatacovers a6kmby5kmsizeregionquiteuniformly. Thedatasetincludes134videos, rangingfrom 60 to 240 s in duration. Each second, an F was collected (i.e., one F per 30 frames of video), resulting in 10,652Fs in total. We selected 250 random query regions (Q), of size 300mby300mwithinthe6kmby5kmareaoftotalvideocoverage. Wethensearchedthe georeferenced video dataset for the videos that captured at least one of the query regions in Q and extracted the video segments that show these regions, i.e., where the viewable scene and the query region intersect. We specifically set out to answer the following questions: • Does the proposed scene based video search algorithm return all matching videos? I.e., is the returned video result set complete? • How well does the proposed video search algorithm effectively eliminate irrelevant video segments in a result video? I.e., how accurate is the search result? The former argument is hard to verify, since there is no easy way to get the “ground truth” for the query result set. One possible way is to place easily recognizable objects 56 in geo-space (e.g., a big green circular sign) to mark the boundaries of query regions. Then, usingafeatureextractionalgorithm, thevideostreamscanbeprocessedtoextract the video frames in which these objects are visible. However, such object recognition algorithms have their own limitations and suffer from low accuracy. Alternatively a human subject may watch all the clips and confirm whether a query region is visible in a videoandmarktheframesthatshowtheregion. Suchmanualverificationisalsoproneto errors. However, human perception is still the most reliable source to determine whether an object is visible within a video. Although we could not perform the manual check for the whole dataset, we randomly chose 40 videos and had a student analyze them and mark the query regions that appear in any of these 40 videos. For the manual check, to eliminatehuman errorsasmuch aspossible, weclearlydefinedtheconditions toconclude whether a query region is visible or not. First, to ease the process we adjusted the query boundariestotheclosestroadintersections,buildingsorsignswhichareeasytorecognize within the video. Second, we assured that, for the queries which are marked as visible, at least 10% of their region appear in some frame within the video. Also, we made sure that objects within query region (e.g., buildings) appear fairly clear within the video. If they are difficult to recognize, which means the query region is far away from the camera, it is not marked as visible. And third, if a query region is not visible due to an occlusion by a bigger object along the view direction, it is marked as invisible. Unfortunately, manual verification is quite laborious and requires careful scanning of videos. In our first step we removed the videos which are reasonably far away from each query region. In our experiments a query covers on average 1/400 th , and each video 1/40 th , of the total video coverage. Since query locations are uniformly distributed throughout the total covered 57 area, such filtering pruned the number of result videos by 70% on average. As mentioned earlier, the query regions are associated with some landmarks within the geo-space and wewereveryfamiliarwiththisgeographicregion. ForeachqueryQ, wewatchedthrough the remaining videos and looked for those landmarks and the region delimited by them. If query region Q is clearly visible within the video, it is marked as visible by that video. Finally, we recorded the number of queries visible by each video. To evaluate the accuracy of the query result set, we compared our approach with two other video scene description models as explained in Section 4.2: (1) The CircleScene model – the camera viewable scene is described as a circular region around the camera location with the assumption that the view direction is not known. A query is visible if its region intersects with the circular viewable scene. (2) The PointScene model - here the camera viewable scene is the camera location point. A query is visible if the camera point resides within the query’s region. 5.2.2.2 Completeness of Result Video Set Given250randomqueries, throughmanualscan, wecreatedthelistofqueryregionsthat are visible within each video in the dataset of 40 videos. To verify the completeness, we neededtoshowthattheproposedviewablescenebasedsearchalgorithmdoesnotmissany videosthatactuallyintersectwithQ. Forthatpurpose,weexecutedoursearchalgorithm on the same data set of 40 videos, using the same query set we used in the manual scan. We then repeated the search using CircleSceneSearch and PointSceneSearch. Figure 5.2 shows the number of queries marked as visible for each video file by (1) manual verification (ManualCheck) (2) the video search algorithm using the proposed FOVScene 58 model (FOVSceneSearch) (3) the video search algorithm using the CircleScene model, ignoring view direction (CircleSceneSearch) and (4) the video search algorithm using the PointScene model (PointSceneSearch). Note that we used the same far visible distance (R) value for both the CircleSceneSearch and the FOVSceneSearch algorithms. ResultsshowthatFOVSceneSearch correctlyreturnsallvideosmarkedvisiblethrough ManualCheck. However, the FOVSceneSearch resultsetalsoincludessomefalsepositives (i.e., returned as an intersecting query but the scene does not actually show the query region), which might occur due to the following reason: within the geo-space the camera view is sometimes occluded with structures, trees, cars, etc. Although the query region is within a fair distance from the camera and the camera points towards it, the query region can still be invisible in the video, therefore may not be included in the result set of the manual check. As expected, CircleSceneSearch returns an excessive percentage of extra irrelevant videos and overestimates the manual search results while PointScene- Search underestimates the manual search results by returning only a subset of the visible queries. Let MR(i), FOVR(i), CR(i) and PR(i) be the sets of queries returned as visible for the i th video file by algorithms ManualCheck, FOVSceneSearch, CircleScene- Search, and PointSceneSearch, respectively. To quantify the amount of underestimation and overestimation of each search algorithm with respect to the manual search, we com- pare FOVR(i), CR(i) and PR(i) to the manual result set MR(i), ∀i 1 ≤ i ≤ 40. Table 5.2 shows the metrics for comparisons and summarizes the results. MaxDiff, MinDiff and AvgDiff are the maximum, minimum and average differences between the FOVR, CR and PR result sets and the MR result set, respectively. Precision and Recall are two metrics widely used in past studies for evaluating the degree of accuracy 59 FOVSceneSearch vs ManualCheck CircleSceneSearch vs ManualCheck MaxDiff max ∀i,1≤i≤40 (|FOVR(i)−MR(i)|) = 3.0 (o) max ∀i,1≤i≤40 (|CR(i)−MR(i)|) = 6.0 (o) MinDiff min ∀i,1≤i≤40 (|FOVR(i)−MR(i)|) = 0.0 (o) min ∀i,1≤i≤40 (|CR(i)−MR(i)|) = 0.0 (o) AvgDiff P ∀i,1≤i≤40 (|FOVR(i)−MR(i))| |MR(i)| = 0.27 (o) P ∀i,1≤i≤40 (|CR(i)−MR(i))| |MR(i)| = 2.10 (o) Precision |FOVR(i) T MR(i)| |FOVR(i)| = 0.96 |CR(i) T MR(i)| |CR(i)| = 0.77 Recall |FOVR(i) T MR(i)| |MR(i)| = 1.0 |CR(i) T MR(i)| |MR(i)| = 1.0 PointSceneSearch vs ManualCheck MaxDiff max ∀i,1≤i≤40 (|PR(i)−MR(i)|) = 0.0 (o) MinDiff min ∀i,1≤i≤40 (|PR(i)−MR(i)|) = 9.0 (u) AvgDiff P ∀i,1≤i≤40 (|PR(i)−MR(i)|) |MR(i)| = 3.88 (u) Precision |PR(i) T MR(i)| |PR(i)| = 0.28 Recall |PR(i) T MR(i)| |MR(i)| = 0.12 (o←Overestimation, u←Underestimation) Table 5.2: Summary of metrics for evaluating accuracy (in terms of number of visible queries). and comprehensiveness of a result set. Precision is the ratio of retrieved relevant items to all retrieved items. A lower value of precision implies the search result set contains a large number of invisible query regions. Recall is the ratio of retrieved relevant items to allrelevantitems. Alowerrecallvaluemeansmorequeryregionsthatshouldbereturned as visible are ignored. In Figure 5.2, we plot the exact number of intersecting queries for each approach, which results in a jaggy graph where the number of intersecting videos changes dramat- ically for each video file. To illustrate the overall performance of each algorithm better, 60 Figure 5.3 shows the total number of instances 1 of intersecting queries identified and returned by each approach versus the total number of videos queried. Note that, in Fig- ure5.2,thehorizontalaxisshowstheindividualvideofileswhereasinFigure5.3,itshows the total number of video files included in the search. We observe that CircleSceneSearch returns a large percentage of false positives when compared to ManualCheck, whereas the proposed FOVSceneSearch closely matches the results of the manual check. The third approach, PointSceneSearch returns a very small percentage of the correct result set missing many intersecting videos. Number of Visible Queries Video File ID Manual Check FOVScene CircleScene PointScene Figure 5.2: Number of visible queries per video file. We are aware that the completeness of the results from a single dataset may not imply the same for the general case. However, we argue that performing the search on a fairly large georeferenced video dataset acquired within a fairly large real world geospace without making specific assumptions and using a large and random query set serves as a realistic example which strongly resembles the generalized case. 1 For example, when a query region intersects with three videos, we counted it as three instances of intersecting queries. 61 0 50 100 150 200 250 300 350 400 1 5 10 15 20 25 30 35 40 Total Number of Instances of Visible Queries Cumulative Number of Input Videos ManualCheck CircleScene FOVScene PointScene Figure 5.3: Cumulative number of visible queries as a function of number of input videos (40 videos only). 5.2.2.3 Accuracy of Search Results BothFigures5.2and5.3showthesuperiorityoftheFOVScene modelovertheCircleScene and PointScene models in georeferenced video search. The graphs illustrate the results using a subset (40 videos) of the gereferenced video data we collected. Figure 5.5 shows the cumulative sums of the query results for the whole dataset (134 videos). We used 250 random queries with a size of 300m by 300m square. The gap between search algorithms FOVSceneSearch and CircleSceneSearch increases as the input dataset becomes larger. It is important to note that, although a query region is marked as visible by both the FOVScene andtheCircleScene models,thevideosegmentstheyreportfortheappearance ofthequeryregioninasinglevideocanbedifferent. Forexample,asshowninFigure5.4, whenthecameraisrotated,althoughthequeryregionisnotvisibleanymoreinthevideo, CircleSceneSearch will still report that the query intersects with its viewable scene for the following frames. Therefore, the total length of the video segments identified by FOVSceneSearch can be shorter (i.e., more precise with less false positives) than that 62 CircleScene Coverage FOVScene Coverage Figure 5.4: Comparison of CircleScene and FOVScene coverage. The square is the query region. by CircleSceneSearch. For comparison, in Figure 5.6, we plotted the total length of all video segments identified by each search algorithm while varying the number of input video files. Similar to Figure 5.5, we have used the cumulative sums to show the overall difference as the input data size grows. When we analyze Figures 5.5 and 5.6, we clearly see the difference in the gap between FOVSceneSearch and CircleSceneSearch. In Figure 5.5, the average percentage gain in accuracy for FOVSceneSearch over CircleSceneSearch is measured as 15.7% with a maximumdifferenceof21.4%. Ontheotherhand,Figure5.6showsa56.4%improvement in accuracy for the FOVSceneSearch result set. The maximum percentage difference in Figure 5.6 is 60.7%. These results show that the proposed FOVSceneSearch algorithm improves the accu- racy of georeferenced video search not only by eliminating the irrelevant videos, but also by filtering out the video segments whose scenes do not contain the query region. For 63 0 200 400 600 800 1000 1200 1 20 40 60 80 100 120 Total Number of Instances of Visible Queries Cumulative Number of Input Videos FOVScene CircleScene PointScene Figure 5.5: Cumulative number of visible queries as a function of number of input videos (whole dataset). 0 10 20 30 40 50 60 1 20 40 60 80 100 120 Total Lenght of Returned Video Segments (seconds) Cumulative Number of Input Videos FOVScene CircleScene PointScene x10 3 Figure 5.6: Cumulative sum of returned video segment lengths as a function of number of input videos (whole dataset). example, in Figure 5.7, FOVSceneSearch returns only subparts of the long video segment returnedbyCircleSceneSearch, eliminatingtheframesthatdonotshowthequeryregion. Considering the huge size of video data and time consuming human verification process for the final result, this significant reduction of false positives can greatly enhance the performance of video search. Therefore we conclude that, by using the camera view di- rection in addition to the camera location in viewable scene estimations, the accuracy of 64 t=24sec t=215 sec Video segment returned by CircleScene Algorithm t=52sec t=80sec t=215 sec t=196 sec Video segments returned by FOVScene Algorithm Figure 5.7: Comparison of CircleScene and FOVScene results. 0 1 2 3 4 5 6 7 8 9 10 11 20 50 100 150 200 250 300 350 400 450 500 550 Total Lenght of Returned Video Segments (seconds) Query Size (meters) FOVScene CircleScene PointScene x10 4 Figure 5.8: Effect of query size. thesearchresultscanbeimproveddramaticallywhileensuringcompletenessoftheresult set. To analyze the effect of the query size over the total length of the returned video segments, we repeated the same experiments shown in Figure 5.6 while varying the size of the query regions. Figure 5.8 reports the total length of the returned video segments by all three approaches for query region sizes ranging from 20m×20m to 550m×550m. For smaller query regions we observed bigger differences between the three approaches, i.e., the superiority of FOVSceneSearch is maximized for small query regions. The per- formance gap among the three approaches reduced as the query size increased. For sizes bigger than 550m×550m, we have not observed dramatic changes in the results. 65 5.3 Vector-based Indexing in Support of Versatile Georeferenced Video Search 5.3.1 Modeling Viewable Scene using Vector When a large collection of videos is stored in a database, the cost of processing spatial queries may be significant because of the computational complexity of the operations involved. Therefore,suchqueriesaretypicallyexecutedintwosteps: afilterstepfollowed by a refinement step [Ore86, BKSS94] (Figure 5.9). ! " Filter Step Refinement Step # Figure 5.9: Illustration of filter and refinement steps. The idea behind the filter step is to approximate the large number of complex spa- tial shapes (n1 objects in Figure 5.9) with simpler outlines (e.g., a minimum bounding rectangle, MBR [BKSS90]) so that a large number of unrelated objects can be dismissed very quickly based on their simplified shapes. The resulting candidate set (n2 objects) is then further processed during the refinement step to determine the exact results (n3 objects) based on the exact geometric shapes. The rationale of the two step process is 66 that the filter step is computationally far cheaper than the refinement step due to the simpleapproximations. Overall, thecostofspatialqueriesisdeterminedbytheefficiency of the filter step (many objects, but simple shapes) and the complexity of the refinement step (few objects with complex shapes). Additionally, in video search applications, the refinement step can be very expensive due to the nature of the processing. Depending on the application, various computer vision and content-based extraction techniques may be applied before presenting the search results. For example, some occlusions may need to be detected based on local geographic information such as the location and size of buildings. Some specific shapes or colors of objects might be analyzed for more accurate results, or the quality of images such as brightness and focus may be considered in determining the relevance ranking of results. Such extra processing is in general performed during refinement on a per frame basis, therefore significantly increases the time and execution cost of the refinement step. It is thus critical to minimize the amount of refinement processing for large scale video searches. This, in turn, motivates the use of effective and efficient filtering algorithms which minimize the number of frames that need to be considered in the refinement step. In traditional spatial data processing, MBR approximations are very effective for the filter step. However, with a bounding rectangle some key properties that are useful in video search applications may be lost. For example, MBRs retain no notion of direction- ality. This study advocates a new vector approximation that provides similar efficiency and low processing cost as MBR-based methods, but additionally provides better sup- port for the type of searches that a video database may encounter. Thus, the main focus of the paper is to provide a novel filter step called the vector model as a more efficient 67 and effective filter step for the large scale georeferenced video search applications and to provide its comparison to a conventional filter step using MBRs. An identical refinement step will be assumed for a fair comparison between the vector and MBR model. In the following sections we will introduce our vector model and illustrate that it is both competitive with MBR-based methods where applicable, but also extends to cases that MBRs cannot handle. Vector Model Recall that a camera positioned at a given point p in geo-space captures a scene whose covered area is referred to as camera viewable scene (F). As we explained in Section 4.2, the meta-data related to the geographic properties of a camera and its captured scenes are as follows: 1) the camera position p consists of the latitude, longitude coordinates read from a positioning device (e.g., GPS), 2) the camera direction α is obtained based on the orientation angle (0 ◦ ≤α<360 ◦ ) provided by a digital compass, 3) the maximum visible distance from p is R at which objects in the image can be recognized by observers – since no camera can capture meaningful images at an indefinite distance, R is bounded by M which is the maximum distance set by an application –, and 4) the camera view angle θ describes the angular extent of the scene imaged by the camera. Based on the availability of the sensor input data, we defined theF of a video frame as an area of circular sector shape (or pie-slice shape) in 2D geo-space as shown in Figure 5.10. Then, an F can be represented as a tuple hT,p,θ,Vi, with T as the real time when the frame was captured, a position p, an angle θ, and a center vector V. The magnitude of V is the viewable distance from p, i.e., R and the direction of V is α. 68 V1 ת x y p1 0 1 px p1x VX 0 1 py V1Y p1y VY V1X V1X V1Y Figure 5.10: FOVScene (F) representation in different spaces. For indexing purposes, we propose a vector estimation model that represents an F using only the camera position p and the center vector V. When we project the F onto the x and y axis, a point p is divided into p x and p y , and V is divided into V X and V Y along the x and y axis, respectively. Then, an F denoted by a point and vector can be represented by a quadruple hp x ,p y ,V X ,V Y i; this can be interpreted as a point in four dimensional space. In mathematics, space transformation is an approach to simplify the study of multi- dimensional problems by reducing them to lower dimensions or by converting them into some other multidimensional space. Using a space transformation, an F hp x ,p y ,V X ,V Y i can be divided and represented in two 2D subspaces, i.e., p x −V X and p y −V Y . Then, an F can be represented as two points, each in its own 2D space. For example, Figure 5.10 shows the mapping between anF represented by p1 andV1 in geo-space and two points in two transformed spaces without loss of information. To define the vector direction, let any vector heading towards the right (East in the northern hemisphere) on the x axis have a positive V X value, and a negative V X value for the other direction (West). Similarly, any vector heading up (North) on the y axis has a positive V Y value, and a negative V Y value for the other direction (South). Using the proposed model, any single 69 F can be represented as a point in a p−V space. As a result, the problem of searching for F areas in the original space can be converted to the problem of finding F points in the transformed subspace. Note that the actualF is an area represented by a circular sector, so representing an area using a single vector is incomplete. More precisely, the F can be considered as a collectionofvectorsstartingfromptoallthepointsonthearc. Tosimplifythediscussion for now we use only one center vector to represent an area as described above. We will relax this simplifying assumption in Section 5.3.3. 5.3.2 Query Processing We represent video content as a series of Fs. Since each F can be considered as a spa- tial object, the problem of video search is transformed into finding spatial objects in a database. This section describes how the filter step can be performed by using the proposed vector model for some typical spatial query types. 5.3.2.1 Point Query The assumed query is, “For a given query point qhx,yi in 2D geo-space, find all video framesthatoverlapwithq.” Thefilterstepcanbeperformedinp−V spacebyidentifying all possible points ofFs that have a potential to overlap with the query point. Recall that the maximum magnitude of any vector is limited to M, and hence any vector outside of a circle centered at the query point q with a radius M cannot reach q in geo-space; see Figure 5.11 for an illustration. Only vectors starting inside the circle (includingthecircumferenceofthecircle)havethepossibilitytocrossormeetq. Because 70 x y q ୶ ୶ - M ୶ + M ୶ ୶ -M ୶ +M M -M M px ܄ ଵ ܄ ଶ ܄ ଷ ଵ ୶ ܞ ସ ୶ ଶ ୶ py ܄ ସ ଵ ୷ ଶ ୷ ଷ ୷ ଷ ୶ Ƭ ସ ୷ ܄ ܆ ܄ ܇ 0 0 ୷ ୷ ୷ -M ୷ +M M -M Figure 5.11: Illustration of filter step in point query processing. a query point is not a vector, it is mapped only to the p axis. First, let us consider only the x components of all vectors. In p x −V X space, the possible vectors that can cross (or touch) q x should be in the range [q x −M,q x +M]. That is, any vector at p x is first filtered out if |p x −q x | > M. Next, even though a vector is within the circle, it cannot reach q x if its magnitude is too small. Thus,|p x −q x |≤|V X | must be satisfied for V X to reachq x . Atthesametimethevectordirectionshouldbetowardsq x . Forexample, when p x > q x , any vector with a positive V X value cannot meet q x . Hence, in p−V spaces as shown in Figure 5.11, all points (i.e., all vectors) outside of the shaded isosceles right triangle areas will be excluded in the filter step. For example, vector V 1 in geo-space is represented as a point v 1 in p−V space. Now consider all vectors starting from a point on the circumference of the circle towards the center with the maximum magnitude M. All such vectors moving from V 1 to V 4 in a clockwise direction map to the diagonal line starting from v 1 to v 4 in p−V space. The same can be observed for the y components of vectors, i.e., the same shape appears in p y −V Y space. The resulting vectors from the filter step should be included in the shaded areas of both p x −V X and p y −V Y space. 71 Formally, a vector at p that satisfies the following conditions can be selected in the filter step: |p−q|≤M p x −q x ≤−V X if p x >q x p y −q y ≤−V Y if p y >q y q x −p x ≤V X if q x >p x q y −p y ≤V Y if q y >p y any V X if q x =p x any V Y if q y =p y (5.1) x y q ୶ ୶ - M ୶ + M ୶ ୶ -M ୶ +M M -M px ܄ ଵ ܄ ଶ ܄ ଷ ଵ ୶ ସ ୶ ଶ ୶ py ܄ ସ ଵ ୷ ଶ ୷ ଷ ୷ ଷ ୶ ସ ୷ ܄ ܆ ܄ ܇ 0 0 ୷ ܄ ହ ହ ୶ ହ ୷ ୷ ୷ -M ୷ +M M -M Figure 5.12: Example of filtering in point query. Figure 5.12 shows five examples of Fs and their mapping between x−y space and p−V spaces. The starting points of all five vectors are within the circle. However, not all of them pass the filter step. The starting points of three vectors, V 1 , V 2 , and V 4 , are located inside the circle but their vector direction and/or magnitude do not meet the necessary conditions so they are filtered out. For example, V 1x is heading in the opposite direction even though its magnitude is large enough. Thus, v 1x is outside of the triangle shaped search space. Similarly, V 4x is heading in the wrong direction so v 4x is outside 72 of the search space. V 5 is directly heading towards q and both V 5x and V 5y have a large enough magnitude to reach q. Thus, both v 5x and v 5y are inside the search space, which means the vector should be included in the filter result. V 3 is considered a false positive in the filter result because it satisfies the conditions but actually does not cover q. It will be pruned out in the refinement step. 5.3.2.2 Point Query with Bounded Distance x y q ୶ ୶ - M ୶ + M ୶ ୶ -M ୶ +M M -M M px py ܄ ܆ ܄ ܇ 0 0 ୷ ୷ r ୶ +r ୷ +r ୶ -r ୷ -r ୷ -M ୷ +M M -M Figure 5.13: Illustration of the filter step in point query with bounded distance r. Unlike with a general spatial query, video search may enforce application specific search parameters. For example, one might want to retrieve only frames where a certain smallobjectataspecificlocationappearswithinavideoscene,butwithagivenminimum sizeforbettervisualperception. Usually,whenthecameraisclosetothequeryobject,the object appears larger in the frame. Thus, we can devise a search with a range restriction for the distance of the camera locations from the query point such as “For a given query point qhx,yi in 2D geo-space, find all video frames that overlap with q and that were taken within the distance r from q.” Because of the distance requirement r, the position 73 of the camera in anF cannot be located outside of the circle centered at q with radius r, where r <M. Thus, the search space can be reduced as shown in Figure 5.13. 5.3.2.3 Directional Point Query The camera view direction can be an important factor for the image perception by an observer. Consider the case where a video search application would like to exploit the collected camera directions for querying. An example search is, “For a given query point qhx,yiingeo-space,findallvideoframestakenwiththecamerapointingintheNorthwest direction and overlapping with q.” The view direction can be defined as a line of sight from the camera to the query point (i.e., an object or place pictured in the frame). The line of sight can be defined using an angle at the camera location similar to the camera direction α. Note that the camera orientation is always pointing to the center of an FOVScene while the view direction can point to any locations or objects within the scene. Adigitalcompassmountedonacamerawillreportthecameradirectionprimarily using bearings. A bearing is a horizontal angle measured clockwise from North (either magnetic North or true North) to a specific direction. When we use bearing as the view direction angle (say β), the Northwest direction is equivalent to 315 ◦ (Figure 5.14). An important observation is that allFs that cover the query point have their starting points along the same line of sight in order to point towards the requested direction. Thus, the filter step needs to narrow the search to the vectors that satisfy the following conditions: 1) their starting points are on the line of sight, 2) their vector directions are heading towards q, and 3) their vector magnitudes are long enough to reach q. 74 x y q ୶ ୶ - M ୶ ୶ +M -M 1 orth px ୷ ȕ ୶ -Msin ɴ ୷ ୷ -M M py ୶ -Msin ɴ ୷ -Mcos ɴ ܄ ܆ M Msin ɴ Mcos ɴ ܄ ܇ ୷ -Mcos ɴ Figure 5.14: Illustration of filter step in directional point query with angle β. For a given view direction angle β, we can calculate the maximum possible displace- ment of a vector starting point from the query point. Because the largest magnitude of any vector is M, the maximum displacement between the query point and the start- ing point of any possible overlapping vector is −Msinβ on the x axis and −Mcosβ on the y axis (note that the sign is naturally decided by β, e.g., sin315 ◦ = −0.71 and cos315 ◦ =0.71). In other words, as shown in Figure 5.14, any vector starting at a point greater than q x +(−Msinβ) on the x axis or less than q y +(−Mcosβ) on the y axis cannot touch or cross the query point with the given angle β. Thus, the search area for such vectors can be reduced as illustrated in Figure 5.14. To meet the view direction request (say, 315 ◦ line of sight), no vector with a positive V X valuecanreachq. Therefore,inthefiltersteptheentiresearchspace(i.e.,thetriangle shape) on the positiveV X side is excluded in thep x −V X space. Similarly, no vector with a negative V Y value can reach q, so the entire search space (the triangle shape) on the negative V Y side is excluded in the p x −V Y space. Next, the size of the remaining search space is reduced because the range of possible V X and V Y values is now [0,Msinβ] and [0,Mcosβ], respectively. 75 Using only a single specific view direction value may not be practical in video search because a slight variation in view directions does not significantly alter the human visual perception. Therefore, it will be more meaningful when the query is given with a certain range of directions such as β±ǫ, e.g, 315 ◦ ±10 ◦ . The extension can be straightforward and it will increase the search area in the p−V space. 5.3.2.4 Directional Point Query with Bounded Distance x y q ୶ ୶ - M ୶ ୶ +M -M 1 orth px ୷ ȕ ୶ -Msin ɴ ୷ M py ୶ -Msin ɴ ܄ ܆ M rsin ɴ r ୷ -Mcos ɴ ୷ -rcos ɴ ୶ -rsin ɴ ୶ -rsin ɴ Msin ɴ ୷ -M ୷ +Mcos ɴ ୷ +rscos ɴ rcos ɴ ܄ ܇ Mcos ɴ Figure 5.15: Illustration of filter step in directional point query with β and r. This type of query is a hybrid of the previous types. For a very specific search, the user might specify the query position, the view direction from the camera, and the distance between the location of the query and the camera. An example query is, “For a given query point qhx,yi in geo-space, find all video frames heading in the Northwest direction, overlapping with q and taken within the distance r from q.” The objective of this query is to find frames in which small objects (e.g., a 6 meter-high statue) at the query point appear large in the viewable scenes. Another example query is, “For a given query point qhx,yi in geo-space, find all video frames heading in the Northwest direction that overlap with q and that were taken farther than the distance r from q.” 76 Now the intention is to find frames where large objects (e.g., a 6 story-tall building) at the query point appear prominently in the frames. For the former case, the positions of cameras are bounded by −rsinβ from q x on the p x axis and −rcosβ from q y on the p y axis, respectively. At the same time, the vector is bounded by rsinβ on the V X axis and rcosβ on the V Y axis, respectively. Therefore, the grid patterned triangle area in Figure 5.15 represents the search space. For the latter case, the positions of cameras are bounded within [−rsinβ,−Msinβ] on the p x axis and [−rcosβ,−Mcosβ] on the p y axis, respectively. Furthermore, the vector is bounded within [rsinβ,Msinβ] on the V X axis and [rcosβ,Mcosβ] on the V Y axis, respectively. Therefore, the shaded triangle area in Figure 5.15 represents the search space. 5.3.2.5 Rectangular Range Query The assumed query is, “For a given rectangular query range in geo-space, find all the video frames that overlap with this region.” Assume that the rectangular query region q is a collection of points (the rectangular shaded area with a grid pattern in Figure 5.16). When we apply the same space transformation, all points in the query region can be represented as a line interval on the p x and p y axes. First, when any vector’s starting point falls inside the query region, the vector clearly overlaps with q so it should be included in the result of the filter step. Next, when we assume that any location along the perimeter of q is an independent query point as in Section 5.3.2.1, the starting points of vectors that can reach the query point is bounded by a circle with radius M. Drawing circles along all the points on the perimeter forms the shaded region in Figure 5.16. It 77 x q ଵ ଵ െ ଶ ଶ ଵ ଵ -M ଶ +M M px ଶ -M 0 ଵ ଵ -M ଶ +M +M -M py ଶ ܄ ܆ ܄ ܇ ଵ ଶ ଶ ܯ ଵ െ ܯ Figure 5.16: Illustration of filter step in range query. follows that any vector with its starting point outside of the shaded region cannot reach any point in q. Only vectors starting inside the region have a possibility to cross q. The search area in the p−V spaces can be defined as shown in Figure 5.16. Again, any vector in the resulting set should be found in both search areas in the p x −V X and p y −V Y spaces. When p x and p y of a vector fall inside the mid-rectangles (grid pattern), p is inside q so the vector automatically overlaps with q regardless of its direction or magnitude. However,whenpislocatedoutsideofq,thevector’sdirectionandmagnitude should be considered to determine the overlap. 5.3.3 Implementation So far, we assumed that anF is represented by a single center vector. However, in reality, an F is a collection of vectors with the following properties: 1) they all start from the samepoint,2)theyhavethesamemagnitude—V—,and3)theyhavedifferentdirections to points along the arc of a circular sector. In this paper we define anF usinghT,p,θ,Vi, where V is the center vector of a F (i.e., V C in Figure 5.17). V C consists of a compass bearingαasdirectionandthevisibledistanceRasmagnitude. Whenonlyasinglevector 78 V C represents the entire area of an F, there is a limitation in retrieving all the objects covered by theF. Because V CX and V CY are used to represent theF in p−V spaces as described in Section 5.3.1, this approach underestimates the coverage of theF. In Figure 5.17, the rectangle with the grid pattern represents the estimation of the F in the filter step using V C . Only query points inside the rectangle are selected during the filter step. The black dots overlap with the actualF so they represent the true query results. The white dots overlap with the estimation of the F but they are not actually overlapping with theF. The single vector model cannot exclude these points during the filter step, thus they become false positives. The problem is that the white rectangles are filtered out even though they are actually inside the F. They are completely missed during the search. x y p į x į y F V C ת V L V R ɲ px ୶ ௫ py ܄ ܆ ܄ ܇ ୖ ୶ ௬ ୷ ୖ ୷ େ ୶ େ ୷ Figure 5.17: Problem of single vector model in point query processing. Alternatively, one can use two vectors to represent an F, the leftmost and the right- most vector (V L and V R ). Both have the same magnitude but different directions (their calculation from the collected data V C is straightforward). When we use V L and V R to estimate the F, the estimation area is extended by δ x and δ y along the x and y axis, 79 respectively (grid rectangle plus shaded areas in Figure 5.17). This approach can encom- pass the black dots, the white dots, and the white rectangles, which means that it is not missing any query points within theF. However, the number of false positives may also increase due to the bigger estimation area. The triangular points in the figure become false positives which are filtered out in the single vector model. Note that the two-vector model now has an identical estimation size compared with the MBR model. The more serious problem is that the two-vector model makes the use of p− V space as search region more complex because we cannot use a simple point query in p−V space. This is because an F is represented by a line interval bounded by the two vectors, as seen in Figure 5.17. The exact boundary is related to the camera direction α, the angle θ, and the visible distance R. ܞ ଶ ୶ x y q ୶ ୶ - M ୶ + M ୶ ୶ -M ୶ +M M -M F1 px ଵ ୶ py ଵ ୷ ଶ ୷ ܄ ܆ ܄ ܇ 0 0 ୷ F2 V1 V2 ୷ +M į į ୷ ୷ -M M -M Figure 5.18: Overestimation constant δ. This problem can be resolved when we introduce an overestimation constant δ in defining the search area in p−V space. The overestimation constant is a generalization of errors in using a single vector model, i.e., δ x and δ y . As shown in Figure 5.18, a single vector V1 represents anF, F1. This vector covers the query point in the middle of anF and so it can be searched without any problem using the triangular shaped search space 80 as originally described in Section 5.3.2.1. However, the other vectorV2, representing F2, cannotbeincludedinthesearchspace. Becausethequerypointislocatedattheleftmost corner of F2, v 2y covers q y but v 2x falls outside of the triangle. V2 is not considered as overlapping so F2 is missed. However, if the search space is extended by δ along the V axis (the parallelogram-shaped shaded area), v 2x becomes included in the search space and V2 can be selected in the filter step. Note that, in Figure 5.18, δ is applied in one direction because the other direction already reaches the maximal value of M. The next question is how to define the overestimation constant δ. The overestimation constant can be determined by the tolerable error between the magnitudeofthecentervectorandtheleftmost(orrightmost)vectorasexplainedabove. Assumingaregularcamera(non-panoramic), thecameraangleθ canbe180 ◦ intheworst case, which results in the maximum difference M. This maximum value, i.e., δ = M, significantly increases the search area inp−V space. Asθ becomes smaller, the extended search area decreases. The range of the overestimation constant is 0≤δ≤M. However, note that normal camera lenses generally cover between 25 ◦ to 60 ◦ and wide angle lenses cover between 60 ◦ to 100 ◦ . Only ultra wide angle lenses capture up to 180 ◦ . An interesting observation on the overestimation constant is that it can be an impor- tant parameter of georeferenced video search. First, the angle θ is related to the zoom level of the camera and the visible distance R. For a certain angle θ, the overestima- tion constant is limited to Msin(θ/2) for 100% coverage. In our experiments, the widest measured angle was 60 ◦ and the maximum visible distance was 259 m. In this case the worst overestimation constant will be 259× sin(60/2) = 129.5 m. Another important observation in video search is that small objects which cannot be easily perceived by 81 humans may be sometimes ignored even though they actually appear in FOVScenes. For example, if an object appears in the far left corner of anF and occupies only a very small portion of the frame, users may not be interested in such results. Moreover, if an object is located very far from the camera location (i.e., near the arc in our proposed model), it might be blocked by some nearer objects. Different applications (or users) might require different levels of accuracy in search results. So the overestimation constant provides a tradeoff between the performance and the accuracy of video search. Note that a smaller overestimation constant findsFs where the query point appears in the center part of the frames and effectively discriminates against other frames where the query point appears towards the far left or far right side of the frames. 5.3.4 Experimental Evaluation 5.3.4.1 Experiments using Real-world Data Methodology In this section we evaluate the proposed vector estimation model using our real-world dataset collected in Moscow, Idaho. The data collection process is explained in Sec- tion 4.4.2. The captured data covers a 6km by 5km size region quite uniformly. The test dataset includes 134 videos, ranging from 60 to 240 s in duration. Each second, anF was collected (i.e., oneF per 30 frames of video), resulting in 10,652Fs in total. We generated 1,000 point queries which were randomly distributed within the region. Figure 5.19 shows the distribution of the camera positions of 10,652 Fs and the 1,000 query points. For each query, we searched the georeferenced meta-data to find the Fs that overlap with that query. 82 5173140 5173640 5174140 5174640 5175140 5175640 5176140 5176640 5177140 497460 498460 499460 500460 501460 502460 Northing (meters) Easting (meters) camera position querypoint Figure 5.19: Camera positions and query points. For all experiments we constructed a local MySQL database that stored all the FOVScene meta-data and their approximations (both MBRs and vectors). We used MySQL Server 5.1 installed on a 2.33 GHz Intel Core2 Duo Windows PC. For each query type described in Section 5.3.2 with the MBR and the vector approximation, we created a MySQL user defined function (UDF) to search through theFs in the database. We also implemented a UDF for the refinement step which returns the actual overlap instances between a query and anF. We used the Universal Traverse Mercator coordinates for all comparisons. For the evaluation of the search results with different approaches, we computed the recalland precisionmetricsforthefilterstep. Therecallisdefinedasthenumberofover- lappingFs found in the filter step by an approach over the actual number of overlapping Fs. Note that the actual number of overlappingFs is obtained after the refinement step from the exact geometric calculation using the circular sectors. The precision is defined 83 MBR Vector (with different overestimation constant) 0.5M 0.4M 0.3M 0.2M 0.1M 0.0M Fs returned 30491 32535 28302 23843 19268 14762 10360 Fs actually matched 17203 17197 16620 15390 13686 11488 8493 Recall 1.00 1.00 0.966 0.895 0.796 0.668 0.494 Precision 0.564 0.529 0.587 0.645 0.710 0.778 0.820 Exec. time of 1000 queries (s) 8.5 10.5 10.5 10.0 10.0 9.7 9.7 Table 5.3: Detailed results of point query. as the number of actually overlapping Fs found over the total number of Fs returned in the filter step. Results We set the distanceM to the maximum viewable distance among all recordedR values of Fs, so M equaled 259 m. The widest camera angle recorded was 60 ◦ . Thus, in all experiments we set the maximum overestimation constant to sin30 ◦ ×259, i.e., 0.5M. $ % $ $ $ & $ $ $ $ & % $ $ $ ’ $ $ $ $ ’ % $ $ $ ( $ $ $ $ ( % $ $ $ ) * + $ , % ) $ , - ) $ , ( ) $ , ’ ) $ , & ) $ , $ ) . / 0 1 2 3 4 5 2 6 3 7 . / 0 1 8 9 4 : ; 3 7 Figure 5.20: RetrievedFs for point query with r =M. Pointquery: Afterexecuting1,000randompointquerieswith10,652Fsinthedatabase, the number of actual overlap instances between a query point and anF was 17,203. This number was obtained and verified by geometric computation after the refinement step, 84 i.e., it represents the ground truth. The point query results from the MySQL implemen- tation are summarized in Table 5.3 and Figure 5.20. The MBR approach returned 30,491 potentially overlappingFs in the filter step and found all 17,203 actually overlappingFs attherefinementstep. Thevectormodelwasappliedwithvaryingδ values. Asexpected, with the maximum overestimation constant δ = 0.5M the vector model showed almost identical results to those of the MBR model (the size of the approximation is slightly bigger than with an MBR). However, when we decreased the value of δ, the vector model returnedasmallernumberofactuallyoverlappingFsaswellasasmallernumberofpoten- tially overlappingFs at the filter step. This is because the vector model is discriminating more against overlapping objects at the side of scenes as the value of δ decreases. Figure 5.21 provides an example of how different approaches perform the filter step. The MBR for the 42 nd F of video 61 overlapped with 7 query points while only 6 points actuallyoverlappedwiththeF. Thevectormodelfounddifferentnumbersofquerypoints as δ varied. As shown in Figure 5.21a, query points A, B, and G were located closer to the center vector of the F, so they were found in all approaches, even when δ = 0.0M. However, D and F were very far from the center vector so they were only found when δ became larger. The vector model with a reduced δ found a smaller number of Fs. The query points closer to the sides (i.e., those that may not be well perceived by humans) were effectively excluded. Overall, when δ grows the recall increases and the precision decreases. We measured the time to execute 1,000 point queries with MySQL using various approaches. The bottom row of Table 5.3 shows the total amount of time in seconds reportedbyMySQL.Onaverage,thevectormodelstook14-19%moretimethantheMBR 85 Label Query ID MBR Actual Vector (with different overestimation constant) 0.0M 0.1M 0.2M 0.3M 0.4M 0.5M A 63 X X X X X X X X B 185 X X X X X X X X C 317 X X – – X X X X D 394 X X – – – – – X E 465 X – – – – – – X F 740 X X – – – – X X G 761 X X X X X X X X X: found, –: not found (b) Query point overlaps using different approaches Figure 5.21: Query points overlapped with anF (video: 61,F id: 42). model. We did not use indexes in the search so the results reflected the computational time for table scans. In reality, indexes such as B-trees or R-trees are used for a more efficient spatial search for a larger set of data and the execution time of the filter step depends on the performance of the indexes. Note that the reported time is for the filter step since the refinement step was implemented as a separate program with the results from MySQL. The focus of this study is not on the speedup of the filter step itself but on the overall query processing, and the effectiveness of the filter step to support versatile search using the characteristics of video. Even though the execution time of the vector- based filter step is a little longer than that of the MBR-based one, the number of selected objects from the vector model can be far smaller than that of the MBR model as shown 86 in Tables 5.3, 5.4 and 5.5, which results in a significant speed up on the overall query processing by minimizing the workload of time-consuming refinement step. 0 500 1000 1500 2000 30491 MBR 0.5M 0.4M 0.3M 0.2M 0.1M 0.0M FOVs returned FOVs matched Figure 5.22: RetrievedFs for point query with r =50m. Pointquerywithboundingdistance r: Figure5.22showstheresultsofpointqueries withaboundingdistancerbetweenthecamerapositionandthequerypoint. Whenrwas 50mthenumberofmatchingFsfor1,000querieswas649. Notethat50misapproximately one fifth of the maximum viewable distance, which means that overlapping query points should be contained in 1/25 of the originalF size. Thus, the number of overlap instances is greatly reduced. The MBR model returned the same 30,491 Fs but, for example, the vector model with δ = 0.5M returned only 1,908 (a 94% reduction) with 1.0 recall. As δ decreased the recall diminished as well and the precision increased. This trend is analogoustotheoneobservedontheresultsofpointquerieswithoutaboundingdistance. We repeated the same experiments while varying r from 50m to 200m. The results all exhibited the same trend as shown in Figures 5.23 and 5.24. Figure 5.25 illustrates the effects of the r value on the search. When we searched for querypointF (thePizzaHutbuildinginthescenes)withoutr (i.e.,r =M), bothframes 87 0.0 0.2 0.4 0.6 0.8 1.0 1.2 MBR 0.5M 0.4M 0.3M 0.2M 0.1M 0.0M Recall PQ PQ-r50 PQ-r100 PQ-r150 PQ-r200 Figure 5.23: Recall with varying r. 0.0 0.2 0.4 0.6 0.8 1.0 MBR 0.5M 0.4M 0.3M 0.2M 0.1M 0.0M Precision PQ PQ-r50 PQ-r100 PQ-r150 PQ-r200 Figure 5.24: Precision with varying r. shown in Figures 5.25a and 5.25c were returned because they contain the query point. However, the building appears very small (and is difficult to be recognized by humans) in Figure 5.25c since it was located far from the camera. Note that the same building is easily recognizable in 5.25a when the camera was closer to the object. We can effectively exclude 5.25c using an appropriate r value. Figures 5.25b and 5.25d show the alternative satellite images of 5.25a and 5.25c, respectively. Directionalpointquery: Usingthesame1,000querypoints,wesearchedforallFsthat overlap with the query points while varying the viewing direction from the camera to the 88 (a) (b) (c) (d) Figure 5.25: Impacts of bounding distance in video search. query point. We used a ±5 ◦ error margin with the viewing direction in all experiments. Table 5.4 shows the results of point queries with a 45 ◦ viewing direction. The MBR approach has no information about the direction in its estimation so it resulted in the same number of 30,491 Fs which must be processed in the refinement step. When the overestimation constant is not too small (δ ≥ 0.3M), the vector model resulted in an approximately 90% reduction in the number of selectedFs in the filter step compared to the MBR method, while providing a recall value of over 0.9. Significantly – as shown in Figure5.21–themissingFsmostlycontainedquerypointsatthefarsidesoftheviewable scene area. For different viewing directions, similar results were observed. 89 MBR Vector 0.5M 0.3M 0.1M Fs returned 30491 3858 2972 720 Fs actually matched 402 389 381 134 Recall 1.000 0.968 0.948 0.333 Precision 0.013 0.101 0.128 0.186 Table5.4: Resultsofdirectionalpointquerywith45 ◦ ±5 ◦ . Theactualnumberofmatched Fs were 402. Directional point query with bounding distance r: Table 5.5 shows the results of a very specific point query case, i.e., one that considers both the viewing direction and bounding distance. The vector model effectively excludes non-overlappingFs in the filter step. For example, with a 45 ◦ viewing direction and r = 50m there were only 13 overlappinginstances, whichisaverysmallnumberwith respecttothe1,000queriesand 10,652 Fs. The vector model returned 374 Fs, including the matched 13. Note that the MBR model returned 30,491Fs. We repeated the same experiments while varying δ and observed that the vector model provided the best balance between recall and precision with a value of δ =0.3M. Vector r=50 r=100 r=150 r=200 Fs returned 374 1006 2124 2972 Fs actually matched 13 93 151 264 Recall 1.000 1.000 1.000 1.000 Precision 0.035 0.092 0.071 0.089 Table 5.5: Results of directional point query with r. 45 ◦ ± 5 ◦ viewing direction and δ =0.3M. Range query: We generated 1,000 random queries with an identical query region size of 100m by 100m, but different locations. For each range query, we checked the overlap between the query area and the Fs. Figure 5.26 summarizes the results, which show a 90 0 10000 20000 30000 40000 50000 60000 70000 MBR 0.5M 0.4M 0.3M 0.2M 0.1M 0.0M FOVs returned FOVs matched Figure 5.26: RetrievedFs for range query. similar trend as observed with point queries. The vector model with δ = 0.5M provided almost perfect recall, namely 0.998, with a slightly higher number of Fs returned in the filter step. As δ diminishes the recall decreases and the precision increases. The chances for overlap between a given query range and any F increases as the size of the query range becomes larger. When we increased the query size to 300m by 300m, the recall of all approaches (even with δ = 0.0M) became 1.0. At the same time, as the size of an approximation becomes larger, the number of false positives rises. When the size of the queryrangebecomessmaller,theresultsapproachthoseofthepointqueriesinTable5.3. 5.3.4.2 Experiments using Synthetic Data Methodology In this section we evaluate the proposed vector estimation model using synthetically generated data. The synthetic data generation is explained extensively in Appendix A. The generated data covers a 54km by 65km size region quite uniformly. The test dataset includes georeferenced meta-data for 16,530 videos, ranging from 20 to 900 s in duration. 91 The average duration of the videos was around 650 s. In the synthetic data an F was generated every second i.e., oneF per 30 frames of video. The total duration of all videos is around 3000 h, i.e., 10.8 millionFs in total. Thegeneratedmeta-datacollectionincludedtwogroupsofsyntheticdata. Forthefirst group, we assumed that a pedestrian is capturing videos using a hand-held camera while walkingonanunconstraintpath. Wesimulatedthemovementandrotationofthatcamera by selecting the pedestrian camera template in the data generator (see Section A.2). For this dataset group the camera follows a free path with free camera rotations. In a real- world scenario, the distribution of initial locations of such videos are not totally random, but concentrated around some attraction points in geo-space. To simulate this property we generated 1,000 uniformly distributed attraction points within the 54km by 65km size region, and around each attraction point we generated normally distributed 10 to 30 videos. The second synthetic dataset assumed that videos are captured by a passenger on a tourist bus using a hand-held camera. To generate the georeferenced meta-data for this dataset, we used the passenger camera template in the data generator (see Section A.2). Thereforethecameralocationpointsfollowaroadnetworkandthecameracanberotated freely. We generated 1,000 point queries which were randomly distributed within the region. For each query, we searched the georeferenced meta-data to find the Fs that overlap with that query. Similar to the experiment setup in Section 5.3.4.1, we constructed a MySQL database that stored all the meta-data and their approximations (both MBRs and vectors). We used MySQL Server 5.1 installed on a 2.33 GHz Intel Core2 Duo 92 Windows PC. For each query type described in Section 5.3.2 with the MBR and the vector approximation, we created MySQL statements to search through the Fs in the database. We also implemented a user defined function (UDF) for the refinement step which returns the actual overlap instances between a query and an F. We used the Universal Traverse Mercator coordinates for all comparisons. For the evaluation of the search results with synthetic data we repeat some of the experiments reported in Section 5.3.4.1. We used the recall and precision metrics for the filter step. Results The results in Section 5.3.4.1 show that the vector model provided the best balance between recall and precision with a value of δ = 0.3M. Therefore, for the experiments with synthetic data we set the overestimation constant δ to 0.3M. In addition, we set the distance M to 259m which is the maximum viewable distance among all recorded R values ofFs. MBR Vector (with δ=0.3M) Fs returned 225,148 164,824 Fs actually matched 121,354 110,432 Recall 1.00 0.91 Precision 0.539 0.67 Table 5.6: Results of point query on synthetic meta-data. We first ran 1,000 random point queries with 10.8 millionFs in the database without creating any MySQL index on the meta-data tables. The point query results from the MySQL implementation are summarized in Table 5.6. The number of actual overlap in- stances between the query points and all Fs was 121,354. The MBR approach returned 93 225,148potentiallyoverlappingFsinthefilterstepandfoundall121,354actuallyoverlap- ping Fs at the refinement step. The filter step using the vector-model returned 164,824 Fs with overestimation constant δ = 0.3M. The actually overlapping Fs was 110,432. Similar to the results in Table 5.3, vector model achieved better precision compared to MBR model, on the other hand its recall value was 9% less than the MBR. MBR model’s recall value was 1.0. MBR Vector (with δ=0.3M) Fs returned 225,148 27,325 Fs actually matched 3,132 3,014 Recall 1.00 0.962 Precision 0.014 0.110 Table 5.7: Results of directional point query with 45 ◦ ±5 ◦ on synthetic meta-data. The actual number of matchedFs was 3,132. Using the same 1,000 query points, we searched for allFs that overlap with the query points with the viewing direction 45 ◦ from the camera to the query point. We used a ±5 ◦ error margin with the viewing direction. Table 5.7 shows the results of point queries with a 45 ◦ ±5 ◦ viewing direction. Results are similar to those obtained using real-world dataset (see Table 5.4). Recall that, the MBR approach has no information about the direction in its estimation, so it resulted in the same number of 225,148 Fs which must be processed in the refinement step. When the overestimation constant was 0.3M, the vector model resulted in an approximately 88% reduction in the number of selectedFs in the filter step compared to the MBR method, while providing a recall value of over 0.96. The execution of refinement step for 1,000 point queries without applying any filter step took around 45,000 s (≈ 12.5 h). The MBR based filter step completed in 29,415 s. The vector-based filter step without any directionality condition and withδ =0.3M took 94 MBR Vector (with δ =0.3M Vector (with δ = 0.3M and no direction) and direction 45 ◦ ±5 ◦ ) Filter step (s) 29,415 34,203 35,649 Refinement step (s) 497 364 58 Total execution time (s) 29,912 34,567 36,018 Table5.8: Executiontimesof1000pointqueriesonsyntheticdatawithoutMySQLindex. 34,203 s and the directional point query with direction 45 ◦ ± 5 ◦ and δ = 0.3M took 35,649 s. Since the MBR based filter step returns a larger set ofFs, the refinement step took longer compared to vector based filtering steps. The query execution times without using any built-in MySQL index are given in Table 5.8. MBR Vector (with δ =0.3M Vector (with δ =0.3M and no direction) and direction 45 ◦ ±5 ◦ ) Filter step (s) 1,822 1,903 1,956 Refinement step (s) 460 340 51 Total execution time (s) 2,282 2,243 2,007 Table 5.9: Execution times of 1000 point queries on synthetic meta-data with MySQL index. We repeated the above experiments using the same 1000 queries, with MySQL tables indexed on the fields that are accessed in the query execution. Without applying any filter step, the refinement step took 32,205 s for the 1,000 queries. The execution times for the vector model based filter steps without direction condition and with direction 45 ◦ ±5 ◦ were 1,903 s and 1,956 s, respectively. MBR-based filtering took 1,822 s (see Table 5.9). Note that the execution time for the vector-based and MBR-based filter steps are improved dramatically by introducing MySQL indexes. With the improved execution time in the filter step, overall the vector based model achieves better execution performance compared to MBR model. 95 5.3.4.3 Illustration of Directional Query Results: A Real-world Example We developed a web-based search system to demonstrate the feasibility and applicability of our concept of georeferenced video search (see Section 7). The search engine currently supports both directional and non-directional spatial range queries. A map-based query interface allows users to visually draw the query region and indicate the direction. The results of a query contain a list of the overlapping video segments. For each returned video segment, we display the corresponding Fs on the map, and during video playback we highlight theF region whose timecode is closest to the current video frame. In Figures 5.27 and 5.28 we illustrate the filter and refinement steps for an example directional query applied to our real-world georeferenced video data. We would like to retrieve the video segments that overlap with the given rectangular region while the camera was pointing in the North direction. Figure 5.27a shows the video segments returned from the filter step using the MBR model. Recall that the MBR model retains no notion of directionality. Figure 5.27b shows the results of the filter step using the vector model with input direction 0 ◦ (i.e., North) and δ =0.3M. Figures 5.28a and 5.28b show the results from the refinement step with error margins ±5 ◦ and ±25 ◦ with respect to the given direction 0 ◦ , respectively. Note that we applied the refinement step on the output of MBR-based filter step shown in Figure 5.27a. Using the MBR model the filter step returns videos with an aggregated duration of 775 s whereas the vector model based filter step returns only 98 s of videos. The refinement steps shown in Figures 5.28a and 5.28b return 9 s and 65 s long videos for 96 (a) Results of filter step using MBR model (no direction) (b) Results of filter step using vector model (viewing direction=0 ◦ and δ =0.3M) Figure 5.27: Illustration of directional range query results – filter step. (TheF region of the current frame is highlighted.) viewing directions 0 ◦ ±5 ◦ and 0 ◦ ±25 ◦ , respectively. TheFigures5.27and5.28alsodisplaytheFvisualizationsforthecorrespondingvideo segmentsonthemap. NotethataredFregionrepresentstheframecurrentlybeingplayed in the video. As described in Section 5.3.1, the vector model is introduced as a fast filter 97 (a) Results of refinement step (viewing direction 0 ◦ ±5 ◦ ) (b) Results of refinement step (viewing direction 0 ◦ ±25 ◦ ) Figure 5.28: Illustration of directional range query results – refinement step. (The F region of the current frame is highlighted.) step to quickly dismiss the unrelated videos and video segments. Figure 5.27 illustrates an example query where the vector model successfully eliminates most of the unrelated video segments, minimizing the amount of refinement processing. 98 5.4 Summary In this chapter we focused on the management (i.e., indexing and search) of the geo- referenced video meta-data. In the first section, we described the modeling and storage of the collected sensor meta-data using the proposed viewable scene model. The video scenes (FOVScenes) are stored as spatial objects in the database. A spatio-temporal query, specifying a region of interest is executed to retrieve the videos whose viewable scenes show that geographical region. We rigorously explained the search process and provided the algorithms to search the video scenes. We evaluated the effectiveness of the scene based search through a user study using some real-world data collected using our recording prototype. We further compared the accuracy of our FOVScene based search with two other basic scene models, namely PointScene and CircleScene. In the second section, we proposed a novel vector-based estimation modelof a camera viewable scene area. We showed that the vector model can be used in various ways to enhance the effectiveness of a search filter step so that the expensive and complex refinement step can be performed with far fewer potentially overlapping FOVScenes. The vector model successfully supports new geospatial video query features such as a directional point query with a given viewing direction from camera to object and a point query with a bounded distance between camera and object. We also demonstrated an immediate applicability of our proposed model to a common database by constructing an actualMySQLdatabaseusingourreal-worldgeoreferencedvideodataset,andperforming extensiveexperimentswiththedatabase. Inordertoevaluatethefilteringperformanceof the vector model on the large scale, we tested the search system using a large repository 99 of synthetic data. The results demonstrate the effectiveness of the proposed vector based indexingonthelargescale. Thevector-basedfilteringachievesbetterprecisioncompared to MBR approach – especially for directional and bounded distance queries – while it’s run-time performance is comparable to MBR. 100 Chapter 6 Relevance Ranking in Georeferenced Video Search 6.1 Introduction Analogous to a textual search with a web search engine, a georeferenced video search will generally retrieve multiple video segments that may have different relevance to the query posed. When results are returned to a user, the most related videos should be presented to the user first since manual verification (watching videos) can be very time- consuming. Therefore, a very interesting and challenging question is how to rank the search results such that a) the automatic ranking closely corresponds to what a human user might expect and b) the ranking algorithm performs efficiently even for very large video databases. In ranking video search results, it is essential to question the relevance of each video with respect to the user query and to provide an ordering based on estimated relevance. This chapter introduces three ranking methods in the following subsections based on two relevant dimensions to calculate video relevance with respect to a query, i.e., its spatial and temporal overlap. We further present a histogram-based approach that relies 101 on a pre-processing step to significantly improve the response time during subsequent searches. 6.2 Ranking Georeferenced Video Search Results Invideosearch,whenresultsarereturnedtoauser,itiscriticaltopresentthemostrelated videos first since manual verification (viewing videos) can be very time-consuming. This canbeaccomplishedbycreatinganorderwhichranksthevideosfromthemostrelevantto theleastrelevant. Otherwise, althoughavideoclipcompletelycapturesthequeryregion, it may be listed last within the query results. It is essential to question the relevance of each video with respect to the user query and to provide an ordering based on estimated relevance. Two pertinent dimensions to calculate video relevance with respect to a range query are its spatial and temporal overlap. Analyzing how the FOVScene descriptions of a video overlap with a query region gives clues on calculating its relevance with respect to the given query. A natural and intuitive metric to measure spatial relevance is the extent of region overlap. The greater the overlap between F and the query region, the higher the video relevance. It is also useful to differentiate between the videos which overlap with the query region for time intervals of different length. A video which captures the query region for a longer period will probably include more details about the region of interest and therefore can be more interesting to the user. Note that during the overlap period the amount of spatial overlap at successive time instances changes dynamically for each video. Among two videos whose total overlap 102 (a) (b) Figure 6.1: Visualization of the overlap regions between query Q 207 and videos (a) V 46 and (b) V 108 . amounts are comparable, one may cover a small portion of the query region for a long time and the rest of the overlap area only for a short time, whereas another video may cover a large portion of the query region for a longer time period. Figures 6.1a and 6.1b illustrate the overlap between the query Q 207 and the videos V 46 and V 108 , respectively. Although the actual overlapped area of the query is similar for both videos, the coverage of V 108 is much denser. Consequently, among the two videos V 108 ’s relevance is higher. In the following sections, we will explain how we define the overlap between the video FOVScenes and query regions and propose three basic metrics for ranking video search results. We provide a summary of the symbolic notations used in our discussion in Table 6.1. 6.2.1 Preliminaries Let Q be a polygon shaped query region given by an ordered list of its polygon corners: Q={(lon j ,lat j ), 1≤j ≤m} 103 Term Description F the short notation for FOVScene P camera location point Q a query region V k a video clip k V F k a video clip k represented by a set of FOVScene V F k (t i ) a polygon shape FOVScene at time t i , a set of corner points Q a polygon query region represented by a set of corner points O(V F k (t i ),Q) overlap region between V F k andQ at t i , a set of corner points R TA relevance score with Total Overlap Area R D relevance score with Overlap Duration R SA relevance score with Summed Area of Overlap Regions Grid M ×N cells covering the universe V G k (t i ) a FOVScene at time t i represented by a set of overlap grid cells between Grid and V F k (t i ) V G k a video clip k represented by a set of V G k (t i ) Q G apolygonqueryregionrepresentedbyasetofgridcells O G (V G k (t i ),Q) overlap region between V G k and Q at t i , a set of grid cells R G TA relevance score using grid, extend of R TA R G D relevance score using grid, extend of R D R G SA relevance score using grid, extend of R SA Table 6.1: Summary of terms where(lon j ,lat j )isthelongitudeandlatitudecoordinateofthej th cornerpointofQand m is the number of corners in Q. Suppose that a video clip V k consists of n F regions. t s and t e are the start time and end time for video V k , respectively. The sampling time of the ith F is denoted as t i 1 . The starting time of a video t s is defined as t 1 . The i th F represents the video segment between t i and t i+1 and the n th F, which is the last FOVScene,representsthesegmentbetweent n andt e (forconvenience,sayt e =t n+1 ). The set of FOVScene descriptions for V k is given by V F k = n F V k (t i ,P, ~ d,θ,R)| 1≤i≤n o . Similarly, theF at time t i is denoted as V F k (t i ). If Q is viewable by V k , then the set ofF that capture Q is given by, SceneOverlap(V F k ,Q)= © V F k (t i ) | for all i (1≤i≤n) where V F k (t i ) overlaps with Q ª 1 FOVScenescanbecollectedatanytimeandtimestamped. However, inourexperiments, wecollected FOVScenes with a fixed interval, at one second. 104 Figure 6.2: The overlap between a video FOVScene and a polygon query The overlap between V F k and Q at time t i forms a polygon shaped region, as shown in Figure 6.2. Let O(V F k (t i ),Q) denote the overlapping region between video V F k and query Q at time t i . We define it as an ordered list of corner points that form the overlap polygon. Therefore, O(V F k (t i ),Q) = OverlapBoundary(V F k (t i ),Q) = n (lon t i j ,lat t i j ), 1≤j ≤m o (6.1) where m is the number of corner points in O(V F k (t i ),Q). The function OverlapBoundary returns the overlap polygon which encloses the overlap region. In Figure 6.2, the corner points of the overlap polygon are shown with labels P1 through P9. Practically, when a pie-shapedF and polygon shaped Q intersect, the formed overlap region does not always formapolygon. IfthearcofFresidesinsideQ,partoftheoverlapregionwillbeenclosed by an arc rather than a line. Handling such irregular shapes is usually impractical. Therefore we estimate the part of the arc that resides within the query region Q with a piece-wise linear approximation consisting of a series of points on the arc such that each 105 point is 5 o apart from the previous and next point with respect to the camera location. OverlapBoundary computes the corner points of the overlap polygon where: (i) a corner of the query polygon Q is enclosed withinF or (ii) the camera location point is enclosed within Q or (iii) an edge of the query polygon Q crosses the sides of the F or (iv) part of the F arc is enclosed within Q (the intersecting section of the arc is estimated with a series of points). Further details about the implementation of the OverlapBoundary algorithm can be found in Section 5.2.1. 6.2.2 Three Metrics to Describe the Relevance of a Video WeproposethreefundamentalmetricstodescribetherelevanceofavideoV k withrespect to a user query Q as follows: 1. Total Overlap Area (R TA ). The area of the region formed by the intersection of Q and V F k . This quantifies what portion of Q is covered by V F k , emphasizing spatial relevance. 2. Overlap Duration (R D ). The time duration of overlap between Q and V F k in sec- onds. This quantifies how long V F k overlaps with Q, emphasizing temporal rele- vance. 3. Summed Area of Overlap Regions (R SA ). The summation of the overlap areas for the intersecting FOVScenes during the overlap interval. This strikes a balance between the spatial and temporal relevance. 106 6.2.2.1 Total Overlap Area (R TA ) The total overlap area of O(V F k ,Q) is given by the smallest convex polygon which covers all overlap regions formed between V F k and Q. This boundary polygon can be obtained by constructing the convex envelope enclosing all corner points of the overlap regions. Equation 6.2 formulates the computation of the total overlap coverage. The function ConvexHull provides a tight and fast approximation for the total overlap coverage. It approximatestheboundarypolygonbyconstructingtheconvexhullofthepolygoncorner points, where each point is represented as a hlatitude,longitudei pair. Figure 6.1 shows examples of the total overlap coverage between the query Q 207 and videos V 46 and V 108 . The total overlap area is calculated as follows. O ¡ V F k ,Q ¢ =ConvexHull à n [ i=1 © O ¡ V F k (t i ),Q ¢ª ! =ConvexHull n [ i=1 |O(V F k (ti),Q)| [ j=1 ©¡ lon ti j ,lat ti j ¢ª (6.2) Subsequently, the Relevance using Total Overlap Area (R TA ) is given by the area of the overlap boundary polygon O(V F k ,Q), computed as: R TA (V F k ,Q)=Area ¡ O(V F k ,Q) ¢ (6.3) where function Area returns the area of the overlap polygon O(V F k ,Q). A higher R TA value implies that a video captures a larger portion of the query region Q and therefore its relevance with respect to Q can be higher. 107 6.2.2.2 Overlap Duration (R D ) The Relevance using Overlap Duration (R D ) is given by the total time in seconds that V F k overlaps with query Q. Equation 6.4 formulates the computation of R D . R D (V F k ,Q)= n X i=1 (t i+1 −t i ) for i when O ¡ V F k (t i ),Q ¢ 6=∅ (6.4) R D is obtained by summing the overlap time for each F in V F k with Q. We estimate the overlap time for each F as the difference between timestamps of two sequential Fs. When the duration of overlap is long, the video will capture more of the query region and therefore its relevance will be higher. For example, a camera may not move for a while, hence the spatial query overlap will not change but the video will most likely be very relevant. 6.2.2.3 Summed Area of Overlap Regions (R SA ) Total Overlap Area and Overlap Duration capture the spatial and temporal extent of the overlap respectively. However both relevance metrics express only the properties of the overall overlap and do not describe how individual FOVScenes overlap with the query region. For example, in Figure 6.1, for videos V 46 and V 108 , although R TA (V F 46 ,Q 207 ) ∼ = R TA (V F 108 ,Q 207 ) and R D (V F 46 ,Q 207 ) ∼ = R D (V F 108 ,Q 207 ), V F 108 overlaps with around 80% of the query region Q 207 during the whole overlap interval, whereas V F 46 overlaps with only 25% of Q 207 for most of its overlap interval and overlaps with 80% of Q 207 only for the last few FOVScenes. In order to differentiate between such videos, we propose the Relevance using Summed Overlap Area (R SA ) as the summation of areas of all overlap 108 regions during the overlap interval. Equation 6.5 formalizes the computation of R SA for video V F k and query Q. R SA ¡ V F k ,Q ¢ = n X i=1 ¡ Area ¡ O ¡ V F k (t i ),Q ¢¢ ∗(t i+1 −t i ) ¢ (6.5) Here, function Area returns the area of the overlap polygon O(V F k (t i ),Q). The summed overlapareaforasingleFisobtainedbymultiplyingitsoverlapareawithitsoverlaptime. Recall that the overlap time for eachF is estimated as the difference between timestamps of two sequentialFs. The summation of all summed overlap areas for the overlappingFs provides the R SA score for the video V F k . 6.2.3 Ranking Videos Based on Relevance Scores Algorithm 4 [R TA ,R SA ,R D ]=CalculateRankScores(k,Q) 1: Q← Given convex polygon shaped query region 2: k← Video id 3: V F k = Load(V k ) {Load FOVScene descriptions from disk} 4: n= ¯ ¯ V F k ¯ ¯ {n is the number of FOVScenes V F k } 5: M = S n i=1 MBR(V F k (t i )) {M is the MBR that encapsulates the whole video file} 6: if RectIntersect(M,Q) is true then {Filter step 1} 7: for i← 0 to (n-1) do 8: M 1 =MBR(V F k (t i )) 9: if RectIntersect(M 1 ,Q) is true then {Filter step 2} 10: if SceneIntersect(Q,V F k (t i )) is true then {Filter step 3} 11: Opoly = OverlapBoundary(V F k (t i ),Q) 12: R TA poly S = Opoly 13: R SA + = Area(Opoly)*(t i+1 −t i ) 14: R D + = t i+1 −t i 15: end if 16: end if 17: end for 18: end if 19: R TA =Area(convexhull(R TA poly)) Algorithm 4 outlines the computation of the proposed relevance metrics R TA , R SA , andR D foragivenvideoV k andqueryQ. NotethattherelevancescoreforV k iscomputed onlywhenV F k overlapswithQ. InAlgorithm4, weapplyatri-levelfilteringstep(lines 6, 9and10)toeffectivelyeliminatetheirrelevantvideosandvideosegments. First,wecheck 109 whetherqueryQoverlapswiththeMBRenclosingallV F k . Ifso, welookfortheFregions whose MBR overlap with Q. And finally, we further refine the overlappingF regions by checking the overlap between query Q and actual V F k . Such a filtering process improves computational efficiency by gradually eliminating the majority of the irrelevant video sections (see Section 6.3.3). Algorithm 4 calls the subroutine MBR which computes the minimum bounding rectangle for a given F region. Functions RectIntersect(M,Q) and SceneIntersect(Q,V F k (t i )) return true if the given query Q overlaps with the rectangle M or the FOVScene V F k (t i ), respectively. A detailed outline for SceneIntersect can be found in Section 5.2.1. These proposed metrics describe the most basic relevance criteria that a typical user will be interested in. R TA defines the relevance based on the area of the covered region in query Q whereas R D defines relevance based on the length of the video section that cap- tures Q. R SA includes both area and duration of the overlap in the relevance calculation, i.e., the larger the overlap is, the bigger the R SA score will be. Similarly, the longer the overlap duration, the more overlap polygons will be included in the summation. Since each metric bases its relevance definition on a different criteria, we may not expect to obtain a unique ranking for all three metrics. Furthermore, without feedback from users it is difficult to ascertain whether one of them is superior to the others. However, we can claim that a certain metric provides the best ranking when the query is specific in describing the properties of videos that the user is looking for. As an example, in video surveillance systems, the videos that give the maximum coverage extent within the query region will be more relevant. Then, metric R TA will provide the most accurate ranking. In real estate applications, users often would like to see as much detail as possible about 110 the property and therefore, both extent and time of coverage are important. In such applications metric R SA will provide a good ranking. And in traffic monitoring systems, where the cameras are mostly stationary, the duration of the video that captures an ac- cident event will be more significant in calculating relevance. Therefore, metric R D will produce the best ranking. Based on the query specification either a single metric or a combination of the three canbeusedtoobtainthevideoranking. Calculatingtheweightedsumofseveralrelevance metrics (Equation 6.6) is a common technique to obtain an ensemble ranking scheme. Relevance ¡ V F k ,Q ¢ =w 1 R TA ¡ V F k ,Q ¢ +w 2 R D ¡ V F k ,Q ¢ +w 3 R SA ¡ V F k ,Q ¢ (6.6) To obtain the optimal values for weights w 1 , w 2 and w 3 we need a training dataset which provides an optimized ranking based on several metrics. However constructing a reliable training dataset for georeferenced videos is not trivial and requires careful and tedious manual work. There is extensive research on content based classification and ranking of videos using Support Vector Machines (SVM) and other classifiers, which train their classifiers using publicly available evaluationdata (for example the TRECVID benchmark dataset). There is a need for a similar effort to create public training data for georeferencedvideos. In Section 6.3 we will present results obtained through applying individual metrics to calculate the relevance score of a video. The visual content of the videos can be leveraged into the ranking process to improve the ranking accuracy. For example, for the Kibbie Dome query, the video segments in the search results might be analyzed to check whether the view of the camera is occluded with some objects such as trees, cars, etc. We can adopt some state-of-the-art concept 111 detectors [YCKH07] to identify such objects within the video content. The video frames where the camera view is occluded can be weighted less in calculating the spatial and temporaloverlapforthemetricsR TA ,R SA andR D . Inadditiontocontentbasedfeatures, text labels extracted from video filenames, surrounding text and social tags can be useful in video ranking. We plan to elaborate on customized multi-modal ranking schemes for georeferenced video data as part of our future research work. 6.2.4 A Histogram Approach for Calculating Relevance Scores The ranking methods we introduced in Section 6.2.2 calculate the overlap regions for every overlapping FOVScene to obtain the video relevance scores. Since the overlap region computation is computationally expensive these techniques are often not practical for large scale applications. Thus, we also introduce several histogram based ranking techniques that provide comparable ranking results, but at the same time dramatically improve the query response time. Using a predefined grid structure, a histogram pre- computes and stores the amount of overlap between a video’s FOVScenes and the grid cells. During query execution only the histogram data is accessed and queried. Histogram based ranking techniques not only enable faster query computation, but also provide additional information about how densely the video overlaps with the query. Forexample,althoughtheexactshapeoftheoverlapregioniscalculatedforeachindivid- ualF in the previous section, the computed relevance scores do not give the distribution of the overlap throughout the query region, i.e., which parts of the query region are more frequently captured in the video and which parts are captured only in a few frames. The distribution of the density of overlap can be meaningful in gauging a video’s relevance 112 with respect to a query and in answering user customized queries, therefore should be stored. This section describes how we build the overlap histograms (OH) for videos, then presents our histogram based ranking algorithms that rank video search results using the histogram data. The histogram based relevance scores are analogous to precise relevance scoresexceptthatpreciserankingtechniquescalculatetheoverlapamountsforeveryuser query whereas histogram based techniques use the pre-computed overlap information to obtain the rankings. We first partition the whole geospace into disjoint grid cells such that their union covers the entire service space. Let Grid ={c i,j : 1≤i≤M and 1≤j ≤N} be the set of cells for the M×N grid covering the space. Given theF descriptions V F k of video V k , the set of grid cells that intersect with a particular V F k (t i ) can be identified as: V G k (t i )= © c m,n : c m,n overlaps with V F k (t i ) and c m,n ∈Grid ª (6.7) V G k (t i )isthesetofoverlappinggridcellswithV F k (t i )attimet i , i.e., agridrepresentation of an F. To obtain V G k (t i ), we search for the cells that overlap with the borderline of V F k (t i ) and then include all other cells enclosed between the border cells. Then, V G k is a grid representation of V F k which is a collection of V G k (t i ), 1≤ i≤ n. The histogram for V G k , denoted as OH k , consists of grid cells C k = S n i=1 V G k (t i ). For each cell c j in C k , OverlapHist counts the number of F samples that c j overlaps with. In other words, it calculates the appearance frequency (f j ) of c j in V G k (Equa- tion 6.8). f j =OverlapHist ¡ c j ,V G k ¢ =Count ¡ c j , © V G k (t i ) : for all i, 1≤i≤n ª¢ (6.8) 113 Function Count calculates the number of V G k (t i ) that cell c j appears in. Note that OverlapHist describesonlythespatialoverlapbetweenthegridandthevideoFOVScenes. However, in order to calculate the time based relevance scores we also need to create the histogram that summarizes the overlap durations. OverlapHistTime constructs a set of time intervals when c j overlaps with V G k . A set I j holds overlap intervals with cell c j and V G k as pairs of <starting time, overlap duration>. Then, the histogram for V F k , i.e., OH k , consists of grid cells each attached with a appearance frequency value and a set of overlapping intervals. Example 1: The histogram of video clip V k is constructed as follows: OH k ={<c 1 ,f 1 ,I 1 >,<c 2 ,f 2 ,I 2 >,<c 3 ,f 3 ,I 3 >} = {< (2,3),3,{< 2,8 >,< 10,7 >,< 20,5 >} >,< (3,3),1,{< 10,7 >} >,< (4,3),1,{< 10,7 > }>}. This histogram consists of three grid cells c 1 , c 2 , and c 3 appearing 3, 1, and 1 times in V G k , respectively. c 1 appears in three video segments. One starts at 2 and lasts for 8 seconds. Another at 10 for 7 seconds. The other starts at 20 and lasts for 5 seconds. c 2 appears once starting at 10 and lasts for 7 seconds. c 3 appears once starting at 10 and lasts for 7 seconds. Figure 6.4 demonstrates two example histograms, where different frequency values within the histograms are visualized with varying color intensities. Our histogram-based implementation counts the number of overlaps between the Fs and grid cells, and therefore the histogram bins can only have integer values. Alterna- tively, for the histogram cells that partially overlap with the Fs, we might use floating point values that quantify the amount of overlap. Allowing floating point histogram bins will improve the precision of the R G SA metric by assigning lower relevance scores to the 114 videos that partially overlap with the query region compared to those that fully overlap with the query. However, storage and indexing of floating point numbers might intro- duce additional computational overhead when the size of the histogram is fairly large. Also note that the gain in precision by allowing floating point histogram bins is highly dependent on the size of the histogram cells. The tradeoff between precision and perfor- mance should be explored through careful analysis. In order to obtain reliable results, the performance evaluations should be done using a large video dataset. 6.2.4.1 Execution of Geospatial Range Queries Using Histograms Given a polygon shaped query region Q, we first represent Q as a group of grid cells in geospace: Q G ={ all grid cells that overlap with Q} (6.9) We refine the definition of overlap region as a set of overlapping grid cells (O G ) between V G k and Q G . Using the histogram of V G k (i.e., OH k ), the overlapping grid cell set can be defined as: O G (V G k ,Q G )= n (C k of OH k ) \ Q G o (6.10) Figure 6.3 shows the grid representation of an overlap polygon, i.e., O G . Note that the grid cells in O G inherit corresponding frequencies and intervals from OH k . Let Q G be a query region that consists of four grid cells, Q G ={<2,2>,<2,3>,<3,2>,<3,3>}. Then, the overlapping cells with the video in Example 1 become: O G (V G k ,Q G )={<(2,3),3,{< 2,8>,<10,7>,<20,5>}>,<(3,3),1,{< 10,7>}>}. 115 Figure 6.3: Grid representation of the overlap polygon 6.2.4.2 Histogram Based Relevance Scores Using the grid-based overlap regionO G , we redefine the three proposed relevance metrics in Section 6.2.2. Total Overlap Cells (R G TA ): R G TA is the extent of the overlap region on Q G , i.e., how many cells in Q G are overlapping with V G k . Thus, R G TA is simply the cardinality of the overlapping set O G (V G k ,Q G ). In Example 1, R G TA =2. Overlap Duration (R G D ): The duration of overlap between a query Q G and V G k can be easily calculated using the interval sets in OH k : OverlapHistTime. R G D ¡ V G k ,Q G ¢ =CombineIntervals(OH k ) (6.11) Function CombineIntervals combines the intervals in the histogram. Note that there may be time gaps when the intervals for some of the cells are disjoint. There are also overlapping time durations across cells. In Example 1, R G D =20 seconds. 116 SummedNumberofOverlappingCells (R G SA ): R G SA isthetotaltimeofcelloverlap occurrences between V G k and Q G and therefore is a measure of how many cells in Q G are covered by video V G k and how long each overlap cell is covered. Since the histogram of a video already holds the appearance frequencies (f) of all overlapping cells, R G SA becomes: R G SA ¡ V G k ,Q G ¢ = |O G (V G k ,Q G )| X i=1 fi X j=1 SumIntervals(c i ,I j ) (6.12) where SumIntervals adds all intervals of an overlapping grid cell c i . In Example 1, R G SA =27. As we mentioned in the previous sections, a histogram gives the overlap distribution within the query region with discrete numbers. Knowing the overlap distribution is helpful for interactive video search applications where a user might further refine the search criteria and narrow the search results. 6.3 Experimental Evaluation 6.3.1 Data Collection and Methodology In this section we evaluate the proposed ranking algorithms using our real-world dataset collected in Moscow, Idaho. The data collection process is explained in Section 4.4.2. The captured data covers a 6km by 5km size region quite uniformly. The test dataset includes134videos,rangingfrom60to240sinduration. Eachsecond,anFwascollected (i.e., one F per 30 frames of video), resulting in 10,652 Fs in total. We generated 250 random spatial range queries with a fixed query rectangle of 300 m by 300 m within this 6 km by 5 km region. Although our system also supports temporal queries we only used spatial queries in the experiments. The videos in our dataset were captured using 117 a single camera on different days. Therefore, temporal queries return only a few results, unless the query time interval is very large; and no time-overlap exists. We found such queries to be not very illustrative in demonstrating the benefits of the proposed ranking algorithms. For each spatial range query, we searched the georeferenced meta-data to find the videos that overlap with that query (lines 6, 9 and 10 in Algorithm 4). We then calculated the relevance scores based on the three metrics proposed in Section 6.2.2. The rank lists RL TA , RL SA and RL D are constructed from the relevance metrics R TA , R SA andR D , respectively. Aranklistisasortedlistofvideoclipsindescendingorderoftheir relevance scores. Inordertoevaluatetheaccuracyofrankingsfromourproposedschemes,oneneedsthe “ground truth” rank order for comparison. Unfortunately, there exists no well-defined publicly available georeferenced video data set (similar to the TRECVID benchmark evaluationdataforstillimages[SOK06])thatcanbereferencedforcomparison. Therefore we first analyzed and compared the rankings from the proposed schemes with each other. Next, we independently conducted experiments to rank the results by human judges. Finally, by comparing the results from the user study with those from the proposed schemes, we evaluated the accuracy of our ranking schemes. We conducted two sets of experiments to evaluate the ranking accuracy of the pro- posed methods: 1. Experiment 1: We compared the rankings RL TA , RL SA and RL D with each other across the whole set of 250 queries. 2. Experiment 2: Among the 250 random queries we selected 25 easily recognizable query regions and asked human judges to rate each video file using a four point 118 scale ranging from “3-highly relevant” down to “0-irrelevant.” We compared our results to user provided feedback labels over these 25 random queries. 6.3.1.1 Evaluation Metrics Since each ranking scheme interprets the relevance in a different way, they are not ex- pected to produce an identical ranking order across all schemes. However, we conjecture that they all should contain similar sets of video clips within the top N of their rank lists (for some N). A similar result from all three ranking algorithms would indicate that the resulting videos are most interesting to the user. To compare the accuracy of the results, we adopted the Precision at N (P(N)) metric, which is a popular method that describes the fraction of relevant videos ranked in the top N results. We redefine P(N) as the frac- tion of common videos ranked within the top N results of more than one rank list. Note that the exact rank of videos within the top N is irrelevant in this metric. P(N) only shows the precision of a single query. Therefore, to measure the average precision over multiplequeries,weusetheMean Average Precision (MAP),whichisthemeanofseveral P(N) from multiple queries. We evaluated the results of Experiment 1 with MAP scores. For Experiment 2, which includes human judgement in addition to MAP scores, a second evaluation metric termed Discounted Cumulated Gain (DCG) was used [JK02]. DCG systematically combines the video rank order and degree of relevance. The discounted cumulative gain vector − −− → DCG is defined as, DCG[i]= G[1] if i=1 DCG[i−1]+G[i]/log e i otherwise 119 MAP at MAP at MAP at MAP at MAP at MAP at N=1 N=2 N=5 N=10 N=15 N=20 Compare All topN(RL TA ) T topN(RL D ) T topN(RL SA ) N 0.60 0.789 0.918 0.993 0.999 1.0 Compare RL TA and RL SA topN(RL TA ) T topN(RL SA ) N 0.727 0.839 0.961 0.993 1.0 1.0 Compare RL TA and RL D topN(RL TA ) T topN(RL D ) N 0.677 0.842 0.933 0.987 0.999 1.0 Compare RL SA and RL D topN(RL SA ) T topN(RL D ) N 0.745 0.885 0.947 0.987 1.0 1.0 Table 6.2: Comparison of proposed ranking methods: RL TA , RL SA and RL D where − → G is the gain vector which contains the gain values for the ranked videos in order. The gain values correspond to the user assigned relevance labels ranging from 0 to 3. Note that, because of the decaying denominator log e i, a video with high relevance label listed at a top rank will dramatically increase the DCG sum. However, a video with high relevance label listed lower in the rank list will not contribute much to the sum. The idea is to favor the top positioned videos as they should be the most relevant for a user. An optimal ordering where all highly relevant videos are ranked at the top and less relevant clips are listed lower in the rank list will produce the ideal DCG vector. The Normalized-DCG (NDCG) is the final DCG sum normalized by the DCG of the ideal ordering. The higher the NDCG of a given ranking the more accurate it is. 6.3.2 Comparison of Ranking Accuracy 6.3.2.1 Comparison of Proposed Ranking Schemes We compare the ranking accuracy of RL TA , RL SA and RL D using MAP scores. In Table 6.2, the first row calculates the MAP values as the average ratio of the videos that are common to all three rank lists within the top 1, 2, 5, 10 and 20 ranked results for 120 all 250 queries, respectively. The second, third and fourth rows display the MAP scores pair-wisefortwomethodseach: RL TA andRL SA ,RL TA andRL D , andRL D andRL SA . The results show that the precision increases as N grows and achieves a close to perfect score beyond N = 10. Note that the precision is very high even at N = 5. This implies that all three proposed schemes similarly identify the most relevant videos. Table 6.2 displays RL TA ,RL SA ,RL D for a specific query Q 207 . The rank differences in RL TA , RL SA and RL D are mainly due to their different in- terpretations of relevance. To further understand the differences and similarities between the rankings, consultV 46 andV 108 in Table 6.3. Both videos overlap withQ 207 almost for the same duration and they both cover almost the whole query region. Thus, the R TA and R D scores for both videos are very close. However, R SA for V 108 is much higher than for V 46 . To investigate the difference we built the overlap histograms OH 46 and OH 108 and extracted the cells that overlap with query Q 207 . Color highlighted visualizations of O G (V G 46 ,Q G ) and O G (V G 108 ,Q G ) are shown in Figure 6.4. Figure 6.4b’s higher color intensity in the center shows that V 108 intensly covers the middle part of Q 207 . Even though the results among the ranking methods vary somewhat, at this point we do not favor any specific approach. We believe that each ranking scheme emphasizes a different aspect of relevance, therefore query results should be customized based on user preferences and application requirements. 6.3.2.2 Comparison with User Feedback This set of experiments aims to evaluate the accuracy of our ranking methods by com- paring algorithmic results with user provided relevance feedback. Our main intention in 121 RL TA R TA RL SA R SA RL D R D score (km 2 ) score (km 2 ) score (s) 1 46 0.087 108 1.726 108 65 2 108 0.084 43 0.813 46 61 3 43 0.063 46 0.558 43 42 4 107 0.055 42 0.359 107 38 5 42 0.052 107 0.338 133 31 6 131 0.045 131 0.291 131 25 7 132 0.045 133 0.135 42 18 8 133 0.038 132 0.087 106 16 9 109 0.022 109 0.073 118 11 10 118 0.018 118 0.045 109 10 11 47 0.004 106 0.025 44 6 12 106 0.004 44 0.008 132 5 13 44 0.001 47 0.004 47 1 14 65 0.001 65 0.001 65 1 Table 6.3: The ranked video results and relevance scores obtained for query Q 207 . (a) (b) Figure6.4: ColorhighlightedvisualizationsforoverlaphistogramsforvideosV 46 andV 108 performing the user study is to check whether the results of the proposed ranking met- rics make sense for people. Our methodology therefore does not live up to the rigorous process usually attributed to scientific user studies. Relevance judgements were made by a student familiar with the region where the videos were captured. We selected a subset of 25 query regions from the total of 250 queries, specifically those that identified relatively prominent geographical features. Although the queries were given to the user as latitude/longitude coordinates, she had a pretty good idea about what visual content she should expect because she was very familiar with the given query regions. She also 122 used some other visual tools, such as Google Maps Street View [goob], to familiarize herself with the regions that the queries cover. The selected 25 queries returned a total set of 103 different videos; each query returned 14 videos on average. The user manually analyzed all these 103 videos in random order and evaluated the relevance of these videos for each of the 25 queries. The user was asked to rate the relevance based on a four-point scale: “3 - highly relevant”, “2 - relevant”, “1 - somewhat relevant” and “0 - irrelevant”. Trajectories of camera movements were displayed on a map for these 103 videos to aid the user in the evaluation. Finally, the user created a rank list for every query. 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 Rank DCG USER R TA R SA R D Figure 6.5: Discounted Cumulated Gain (DCG) curves We compared the rankings of RL TA , RL SA and RL D to the user rankings using the metrics DCG and NDCG for the 25 selected queries. The average of the DCG vectors from those 25 queries were used for comparisons. Figure 6.5 shows DCG curves for the rank lists RL TA , RL SA , RL D and the USER curve for ranks 1 through 16. The USER curve corresponds to the DCG vector based on the user rankings. Clearly, the DCG curves for the proposed schemes closely match with the USER DCG evaluation. 123 Next, the NDCG scores with respect to the user results were calculated. The NDCG scores of RL TA , RL SA and RL D were 0.975, 0.951 and 0.921, respectively. All scores are close to 1, which implies that all three are highly successful in ranking the most relevant videosatthetop, similartohuman judgement. Weobservedthat rankdifferencesmostly occurred in ratings of less relevant videos. Recall that DCG and NDCG reward relevant videos in the top ranked results more heavily than those closer to the bottom. The high NDCG scores further lend credibility that the proposed ranking methods successfully identify the most relevant videos. Among the proposed schemes, the highest precision was obtained consistently by RL TA at all levels. We conjecture that this is the case because human perception for relevance is more related to how clearly one can see an object (i.e., spatial perception) rather than how long one sees it (i.e., temporal perception). Note that all three ranking schemes describe different properties of a video clip, and result ranking is complicated by the fact that importance is subjective to users and may be application dependent as well. However, in our methodology Figure 6.5 clearly shows that RL TA overall achieves the best accuracy among the three with respect to user ranking. We are aware that user judgement can be subjective and may be prone to errors. The methodology used might not qualify under all the requirements of an intensive user study. Therefore the results only provide an indication of what one might expect. More conclusiveresultscanbeobtainedbyperforminganintensiveuserstudywithafarhigher number of human judges, videos and queries. Such an extensive study is out of the scope of this dissertation. 124 6.3.3 Evaluating the Computational Performance This section evaluates the computational cost of the proposed schemes. In Algorithm 4, four steps account for most of the computational cost: 1) loading of the georeferenced data,i.e.,loadingoftheFOVScene descriptionsfromstorage,2)filteringtoexcludevideos withnooverlap,and3)calculatingtheareaofoverlapbetweenFOVScenes (resultsofthe filter step) and query regions, and 4) computing the relevance scores for the three rank lists. For R TA and R SA , step 3 consumes approximately 60% of the execution time. In step 4, R SA just needs to sum the areas of overlapping polygons, whereas R TA needs to computetheextentofalloverlappingpolygons. Therefore,R TA takeslongertoconstruct the rank list. R D is computationally the most efficient since it only extracts the time overlap and skips the overlap area calculation. Using a 2.33 GHz Intel Core2 Duo computer, we measured the processing time of each scheme to evaluate the same 250 queries in Section 6.3.2.1. The test data included 134 videos with a total duration of 175 minutes. AnF was recorded per every second of video for a total of 10,500F representations used in the calculations. Thedetailedprocessingtimemeasurementsforrunningthemajorstepsoftheranking schemes are summarized in Table 6.4. Steps 1 and 2 were required for all queries and all 134videoswereprocessedperquery. Itisimportanttonotethatduringqueryprocessing, the vast majority of the videos were filtered out through the filter step. As shown in Table 6.4, for each query on average only 8.46 out of 134 videos were actually processed in steps 3 and 4, and allother videos wereexcluded through the filterstep. The details of the filtering is explained in Section 6.2.3. For a particular query, the average processing 125 Step Calculating RL TA Calculating RL SA Calculating RL D Avg No. of V Avg Time Avg No. of V Avg Time Avg No. of V Avg Time processed (s) processed (s) processed (s) 1. Load FOVScene descriptions from disk 134 0.523 134 0.523 134 0.523 2. Filter step 134 0.016 134 0.016 134 0.016 3. Calculate the area of Overlap polygons 8.46 1.176 8.46 1.176 8.46 0.527 4. Calculate the Relevance Scores 8.46 0.367 8.46 0.097 8.46 0.100 Total time (s) 2.082 1.812 1.166 Table 6.4: Measured computational time per query time required to construct RL TA , RL SA and RL D was approximately 2.082, 1.812 and 1.166 seconds, respectively. The query execution time depends on the number of FOVScenes to be processed, which also varies by query. Thus, we next examined how the processing time changes as the number of videos increases. Figure 6.6 shows that the processing time grows linearly as a function of the number of videos for the three rankings, i.e., as a function of the number of Fs. The small fluctuations were caused by the differences in the number of Fs processed in videos (i.e., ¯ ¯ V F k ¯ ¯ ) and the variations in the number of overlapping Fs processed after the filter step in a specific query. Consequently, we can compute the average time to process a singleF per query as follows. With a processing time of 2.082 secondsperqueryover134videos(total175minutesfor10,500Fs)forRL TA ,theaverage execution time per F per query was 0.198 ms. Similarly, it was 0.172 and 0.110 ms for RL SA and RL D , respectively. These numbers can used to obtain an estimation of the queryprocessingtimeforalargerdataset. Forexample, whenthesizeofthequeryrange 126 0 20 40 60 80 100 120 140 0 0.5 1 1.5 2 2.5 Number of Videos Cumulative Processing Time per Query (seconds) R TA R SA R D Figure 6.6: Processing time per query vs the number of videos is the same but the number of FOVScenes 2 increases to 100,000, we can estimate the average query processing time for RL TA as 0.198 ms ×100,000=198 s. In this study we focus on the ranking algorithms and do not aim to provide any efficient methods for the retrieval and indexing of FOVScene descriptions from storage. The methods presented so far are not optimized for computational efficiency. Hence, it is worth mentioning that calculating the area of overlap between a pie-shaped FOVScene and polygon-shaped query is computationally expensive and might not be practical for realtime applications. To address this challenge we introduce a histogram based ranking approach in Sec- tion 6.2.4.2 which can move most of the costly computation overhead to an offline pre- processing step, resulting in simpler and faster query processing. Next, we will present our findings on the accuracy and efficiency of a histogram based ranking. 2 There is no direct relation between the number of Fs and the length of a video because F can be sampled with various intervals. 127 6.3.4 Evaluating Histogram based Ranking We built the overlap histograms (OH 1 through OH 134 ) for all 134 videos as described in Section 6.2.4. The same 250 queries were processed using these histograms and the relevancescoreswerecalculatedforthereturnedvideosbasedonthemetricsweproposed in Section 6.2.4.2. Let RL G TA , RL G SA and RL G D be the rankings obtained from relevance metrics R G TA , R G SA and R G D , respectively. First,inordertoevaluatetheaccuracyofRL G TA ,RL G SA ,andRL G D ,wecomparedthem to the precise rankings RL TA , RL SA , RL D and investigated the precision for various cell sizes. Recallthatweusetheexactareaofpolygon-overlapwhencalculatingtherelevance scores for precise rankings whereas the histogram approximates the overlap region with grid cells. Therefore, we use the rank lists RL TA , RL SA and RL D as baselines for comparison. Figure 6.7 shows the results using the MAP metric for grid cell sizes varying from 25m by 25m to 200m by 200m. Note that the size of the query range was 300m by 300m. MAP results were averaged across all queries in the test. The results show that the precision for all three histogram based rankings decreases linearly as the cell size increases. This is expected as a larger cell size means a coarser representation of the overlappingarea. However,thedegradationoftheprecisionwasnotsignificant(especially considering the performance gain explained later) when the cell size is small and N is large. For example, for cell sizes smaller than 100m by 100m and N greater than two, the MAP score becomes greater than 0.9 in Figure 6.7b. Recall that the precision only represents the fraction of common videos and ignores the differences between actual rank orders. Therefore, we also compared the order of 128 25 50 75 100 125 150 175 200 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Cell Size MAP N=1 N=2 N=3 N=5 N=7 N=10 25 50 75 100 125 150 175 200 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Cell Size MAP N=1 N=2 N=3 N=5 N=7 N=10 (a) RL G TA (b) RL G SA 25 50 75 100 125 150 175 200 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Cell Size MAP N=1 N=2 N=3 N=5 N=7 N=10 (c) RL G D Figure 6.7: MAP at N for (a) RL G TA , (b) RL G SA and (c) RL G D for varying cell sizes. the videos between the histogram rankings and the precise rankings. For each query, we obtained the videos that appear in both RL G TA and RL TA and computed the mean absolute difference between their rank orders. Results were further averaged among all 250 queries. We repeated the same for the rank listsRL G SA versusRL SA andRL G D versus RL D . Table6.5reportstheresultsfordifferentcellsizes. Notethatthefalsepositivesthat might appear in the histogram ranking were not included in the rank order comparison. In Table 6.5, when the grid cell size is small, the mean order difference was as low as 0.2. Even for large cell sizes the mean order difference was around 1.2, which implies that on average each video was displaced only by ±1 position in the rank list. 129 Average rank-order Cell Size Cell Size Cell Size Cell Size difference 25m×25m 50m×50m 100m×100m 200m×200m RL G TA 1 |Q| |Q| P q=1 1 |RL TA (q)| |RL TA (q)| P i=1 ROdiff(q,i) 0.2243 0.3191 0.7508 1.1973 RL G SA 1 |Q| |Q| P q=1 1 |RL TA (q)| |RL TA (q)| P i=1 ROdiff(q,i) 0.2345 0.4069 0.7655 1.1116 RL G D 1 |Q| |Q| P q=1 1 |RL TA (q)| |RL TA (q)| P i=1 ROdiff(q,i) 0.4378 0.7822 1.0366 1.5415 ROdiff(q,i) = ¯ ¯ RO(RL TA (q,i))−RO ¡ RL G TA (q,i) ¢¯ ¯ RO: Rank order, |Q| : Number of queries Table 6.5: Rank order comparison between RL G TA , RL G SA , RL G D and RL TA , RL SA , RL D Next, we measured the query processing time of the histogram based rankings and compareditwiththoseofthepreciserankings. Figure6.8illustratestheprocessingtimes per query with respect to the number of videos for both R TA and R G TA when the cell size was 50m by 50m. Clearly, the histogram ranking demonstrates a high superiority comparedtoR TA . Thisisbecausemostofthecostlyoverlapcomputationsareperformed while the OH is being built as a pre-processing step (e.g., when the video is first loaded into the system). The OH of all videos are constructed just once and all queries can share them. The result is a short online query processing time. For example, the average query processing time was just 5% of that of R TA as shown in Figure 6.8. Similar results were obtained for the other rankings schemes. As we have shown, the accuracy of the histogram ranking is highly dependent on the cellsize. ThesmallerthegridcellsizethebetteranestimationahistogramOH achieves. However, the time to build the OH increases as the cell size shrinks. To understand this tradeoff we investigated the tradeoff between the precision of rankings and the computa- tional cost of building OHs while varying the cell size. We calculated the MAP scores at N = 10 for R G TA and recorded the processing times to build OHs in seconds. Fig- ure 6.9 shows the change in both precision and average processing time to build an OH 130 for different cell sizes. As the cell size increases, the precision linearly decreases while the CPU time exponentially decreases. When the cell size exceeds 75m×75m, the CPU time decreases gradually, while the precision continues to drop steadily. Thus, we conclude that in our experiments the cell size between 50m by 50m and 75m by 75m provides a good tradeoff between the accuracy and the build overhead of histograms. 0 20 40 60 80 100 120 140 0 0.5 1 1.5 2 2.5 Number of Videos Cumulative Histogram Processing Time per Query (seconds) Histogram based ranking Precise ranking Figure 6.8: Comparison of precise and histogram based query processing Figure 6.9: Evaluation of computation time and precision as a function of the grid cell size 131 In our experiments we used a simple data structure to store the histogram data which may not be optimal for searching large video collections. In a production environ- ment, where hundreds of concurrent users might access the search engine, large amounts video data will be added or deleted to/from the system. In that case there would be a need for a storage and index structure that can efficiently handle the updates to the histogram while ensuring a short query processing time. Histogram based indexing tech- niques have been well studied in database research [Ioa03] and hence the performance of our histogram-based ranking approach can be further improved by adopting an existing indexing technique. One important and unique advantage of the histogram approach is that it describes both the extent and density of overlap between video FOVScenes and the query region. By analyzing the overlap distribution in a histogram, it is possible for users to further understand the results. Also, such information can be quite useful in interactive video searchwheretheoverlapdensitythroughthequeryregioncanbeusedtoguidetheuserto further drill down to more specific queries. For example, a visualization of the histogram data similar to Figure 6.4 can be provided to the user for the top ranked videos so that the user can interactively customize the query and easily access the information he or she is looking for. 6.4 Summary In this chapter we investigated the challenging and important problem of ranking video search results based on videos’ spatial and temporal properties. We introduced three 132 ranking algorithms that consider the spatial, temporal and combined spatio-temporal properties of georeferenced video clips. Our experimental results show that the rank- ings from the proposed R TA , R D , and R SA methods are very similar to results based on user feedback. This demonstrates the practical usability of the proposed relevance schemes. One drawback is that their demanding runtime computations may be a concern in large-scale applications. To improve the efficiency of our approach, we also proposed a histogrambasedapproachthatdramaticallyimprovesthequeryresponsetimebymoving costly computations to a one-time, pre-processing step. The obtained results illustrate how the use of histograms provides high-quality ranking results while greatly reducing the execution time. 133 Chapter 7 GRVS: A Georeferenced Video Search Engine 7.1 Introduction Sensors attached to cameras, such as GPS and digital compass devices, allow users to collect geographic properties of camera scenes simultaneously while video is recorded. The captured geographic meta-data have significant potential to aid in the process of indexing and searching of georeferenced video data, especially in location aware video applications. Our system implementation demonstrates a prototype of a georeferenced video search engine (GRVS) that utilizes the viewable scene model for efficient video search. For video acquisition, we use our automated annotation software (see Figure 1.2) that captures videos and their respective viewable scenes (FOVScene). We built a MySQL database with our real-world video datasets and developed a web-based search system to demon- strate the feasibility and applicability of the concept of georeferenced video search. 134 7.2 Search Engine GRVS is a web-based video search engine that allows georeferenced videos to be searched by specifying geographic regions of interest. In a typical scenario, a user marks a query region on a map, and the search engine retrieves the video segments whose viewable scenes overlap with the user query area. In our current implementation the query is a rectangle (i.e., range query). The search engine is comprised of three main components: (1) a database that stores the collectedmeta-data, (2) amediaserverthat managesthe videos, and (3)aweb-based interface that allows the user to specify a query input region and then provides a display ofthequeryresults. Weimplementedtheengineusingthefollowingopensourcesoftware: the LAMP stack (i.e., Linux, Apache, MySQL, PHP), the Wowza Media Server, and the Flowplayer [flo]. 7.2.1 Database Implementation When the user uploads videos into the GRVS system, video meta-data is processed auto- maticallyandviewablesceneinformationisstoredinaMySQLdatabase. Eachindividual FOVScene is represented as a tuple based on the schema given in Table 7.1. Weuploadedbothourreal-worlddatasets, i.e., videoscapturedinMoscow, Idahoand downtown Singapore to the search engine. The total duration of both datasets in total is 17,982 s (i.e., 300 min). Once a query is issued, the video search algorithm scans the FOVScene tables to retrievethevideosegmentsthatoverlapwiththeuser-specifiedregionofinterest. Because 135 Filename Uploaded video file FOVScene id ID of the FOVScene hPlat,Plngi Latitude, longitude coordinate for cam- era location (read from GPS) theta Camera view direction (read from com- pass) R Viewable distance alpha Angular extent for camera field-of-view ltime Local time for the FOVScene timecode Timecode for the FOVScene in video (extracted from video) Table 7.1: Schema for viewable scene (FOVScene) representation. of the irregular shapes of FOVScenes, we implemented several special-purpose MySQL UserDefinedFunctions(UDFs)tofindtherelevantdata. AseparateUDFisimplemented for each query type. Our initial search engine prototype supports spatial range queries (the query is a rectangular region). The user might specify additional query conditions such as, the view direction from the camera, and the distance between the location of the query and the camera (directional and bounded distance queries). The system architecture is flexible such that we can enhance the search mechanism and add support forotherquerytypesinthefuture. Thevideosearchalgorithmsareexplainedextensively in Chapter 5. 7.2.2 Web User Interface A map-based query interface allows users to draw the query region visually. The result of a query contains a list of the overlapping video segments. For each returned video segment, we display the corresponding FOVScenes on the map. To reduce clutter we draw the FOVScenes every 2 seconds. The user can browse through the resulting video segments and interactively play videos. Note that the video server streams precisely the 136 video section that is shown in the query region, not the complete video file. During video playback, the FOVScene whose timecode is closest to the current video frame is highlighted on the map. Each FOVScene is associated with a video frame timecode, which ensures a tight synchronization between the video playback and the FOVScene visualization. A sample screen shot of the web-interface is shown in Figure 7.1. Figure 7.1: Georeferenced video search engine web-interface. WeimplementedthewebinterfaceusingJavaScriptandtheGoogleMapsAPI[gooa]. Ajax techniques were used to send the query window to the MySQL database and to retrieve the query results. With Ajax, web applications can obtain data from the server asynchronously in the background without interfering with the display. The communica- tion with the MySQL database and the UDFs was provided via PHP. For video playback we used the Flowplayer [flo], an open source flash media player. The video files were transcoded into H.264 format. Note that our search engine implementation is platform 137 independent. We successfully deployed our search engine on both Linux and Windows servers. 7.3 Functionality Illustration In Section 5.3.2 we introduced some of the spatial query types that can be applied in georeferenced video search to enforce application specific search information. In this section we provide examples of directional and bounded distance queries. The examples are illustrated through the screen shots from our web-based video search interface. Query with Bounded Distance: Figure7.2illustratestheresultsofaboundeddistance query on our real-world video data. We searched for the video segments that show the Pizza Hut building in the scenes. The query returns 12 video segments (total 120 s of video). Two of the resulting video segments are shown in Figures 7.2(a) and 7.2(b). The Pizza Hut building appears very small (and is difficult to be recognized by humans) in the second figure since it was located far from the camera. Note that the same building is easily recognizable in the first figure when the camera was closer to the object. We can effectively exclude the video segment shown in Figure 7.2(b) using an appropriate bounded distance value (e.g., 100m) in the query. The camera viewable scenes for the video segments are illustrated on the map. The viewable scenes that correspond to the current video frames are highlighted in both images. Directional Query: In Figure 7.3, we illustrate an example of a directional query. We would like to retrieve the video segments that overlap with the given query region (the University of Idaho Kibbie Dome in the scenes) while the camera was pointing in the 138 (a) The object is close to the camera (30m away), therefore can be easily recognized in video (b) The object is far from the camera (130m away), therefore appears very small in video. Figure 7.2: Impacts of bounding distance in video search. The viewable scenes for the current video frames are highlighted on the map. North direction. Figure 7.3(a) shows the video segments returned from the range query withoutthenotionofdirectionality. AndFigure7.3(b)showstheresultsofthedirectional range query with input direction 0 ◦ (i.e., North). Without the direction condition the Kibbie Dome query returns a total of 250 s of video whereas the directional query returns only 65 s of video. As shown in Figure 7.3(b), the directional query precisely returns the related video segments and eliminates the unwanted videos and video sections. (a)Searchresultsforrangequery(nodirection specified). (b) Search results for directional range query (viewing direction 0 ◦ ) Figure 7.3: Illustration of directional range query results. The viewable scenes for the current video frames are highlighted on the map. 139 For a specific application, the bounded distance query, or the directional query, or a combination of the two can be used to effectively retrieve the related video segments. Based on the application’s requirements the query location can be specified as a point or a range. In these scenarios, our search mechanism effectively and efficiently reduces the amount of data returned to user, therefore minimizes the user browsing time. 7.4 Summary We have implemented a web-based video search engine – GRVS – to query a database of georeferenced videos. Using GRVS, users can search for the videos that capture a particular region of interest. Our novel search technique, which is based on our viewable scene model, ensures highly accurate search results. The map-based interface enhanced with visual features provides the user with a clear understanding of the geo-location seen in the video. The demonstration system is avail- able online at http://eiger.ddns.comp.nus.edu.sg/geo/. 140 Chapter 8 Summary In this dissertation, we proposed the use of geographic properties of videos as a means to manage large scale video content. We proposed to utilize the georeferenced meta-data, automatically acquired from attached sensors, to describe the coverage area of mobile video scenes as spatial objects such that large video collections can be organized, indexed and searched effectively. We can summarize our contributions in this dissertation as follows: First, we presented a framework for the management of georeferenced videos, which might serve as a test-bed for various video search applications. The proposed frame- work consists of three main parts: the acquisition of georeferenced meta-data and video streams, indexing and searching of both meta-data and video contents, and the presen- tation of search results. The chapters of this dissertation provide efficient solutions for the major issues highlighted in the framework description. Second, we introduced a methodology for automatic annotation of video clips with a collectionofmeta-datasuchascameralocation,viewingdirection,field-of-view,etc. Such meta-data can provide a comprehensive model to describe the scene a camera captures. 141 Weputforwardaviewablescenemodel whichdescribesthevideoscenesasspatialobjects. We introduced our implemented prototype video recording system which demonstrates the feasibility of acquisition and automatic annotation of georeferenced videos based on the proposed viewable scene model. We collected a sufficiently large set of georeferenced videos using our prototype system. Third, we proposed novel approaches for efficient indexing, searching and retrieval of video clips based on the proposed viewable scene model. We rigorously explained the search process and provided the algorithms that search the video scenes as spatial objects. We evaluated the effectiveness and accuracy of the scene based search using our real-work video dataset. We also introduced a vector-based approximation model of the videosviewablesceneareawhichcanbeutilizedinafilter-stepinordertoavoidthecostly overlap computations in searching viewable scenes. Unlike other spatial approximation models, suchasMBRapproximation, vector-modelretainstheexactcameralocationand direction information which can be beneficial for answering queries that ask for specific viewing directions or distances. Fourth,weinvestigatedthechallengingandimportantproblemofrankingvideosearch results based on videos’ spatial and temporal properties. We introduced three ranking algorithms that consider the spatial, temporal and combined spatio-temporal properties ofgeoreferencedvideoclips. Toimprovetheefficiencyofourapproach,wealsoproposeda histogrambasedapproachthatdramaticallyimprovesthequeryresponsetimebymoving costly computations to a one-time, pre-processing step. Fifth, we developed a web-based georeferenced videos search engine that utilizes the viewable scene model for efficient video search. It provides a user interface where users 142 can search for the videos that capture a particular region of interest. The map-based visual features provides the user with a clear understanding of the geo-location seen in the video. Lastly, we developed a data generator for producing synthetic video meta-data with realistic geographical properties for mobile video management research. The generated meta-data is based on the definition of our viewable scene model. Users can control the behavior of the proposed generator using various input parameters. Our research has shown that using location and viewing direction information, cou- pled with timestamps, efficient video search systems can be developed that alleviate the “semantic gap” that is inherent in managing large video collections. Taking a broad perspective, we can envision two main future directions for pursuing our research. First, our search system can be enhanced further with incorporating the existing or emerging content-based tools. As the content-based methods improve and more semantic informa- tion become available for describing video content, video management applications could do much more for users than they do know by leveraging content based cues together with automatically collected sensor meta-data. Then we will get closer to the ultimate solution for video search where video annotations are generated without user input and yet search performance is comparable to the textual search. Second, in addition to GPS and compass, other sensor devices can be embedded to the cameras to collect additional meta-data which can be used to enhance the search functionality. For example compact andportabledistancesensorsolutioncanbeattachedtocamerastoestimatethedistance to large objects in front of the camera. In our current viewable scene model we assume that no objects in geo-space block the camera view. 143 References [AHS06] MelanieAurnhammer,PeterHanappe,andLucSteels. IntegratingCollab- orative Tagging and Emergent Semantics for Image Retrieval. In Interna- tional World Wide Web Conference (WWW), Collaborative Web Tagging Workshop, May 2006. [AN07] MorganAmesandMorNaaman. WhyweTag: MotivationsforAnnotation in Mobile and Online Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems(CHI), 2007. [AZK08] Sakire Arslan Ay, Roger Zimmermann, and Seon Ho Kim. Viewable Scene Modeling for Geospatial Video Search. In Proceedings of the ACM Inter- national Conference on Multimedia (ACM MM), pages 309–318, 2008. [BF01] Kobus Barnard and David Forsyth. Learning the Semantics of Words and Pictures. IEEE International Conference on Computer Vision, July 2001. [BKSS90] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In Proceedings of the ACM SIGMOD International Con- ference on Management of Data (SIGMOD), 1990. [BKSS94] Thomas Brinkhoff, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. Multi-step Processing of Spatial Joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1994. [Bou] Jean-Yves Bouguet. Camera Calibration Toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib doc/. [BS97] Kate Beard and Vyjayanti Sharma. Multidimensional Ranking for Data in Digital Spatial Libraries. International Journal on Digital Libraries, 1(2):153–160, 1997. [BS02] Thomas Brinkhoff and Ofener Str. A Framework for Generating Network- Based Moving Objects. Geoinformatica, 6, 2002. [CBHK09] David J. Crandall, Lars Backstrom, Daniel Huttenlocher, and Jon Klein- berg. Mapping the World’s Photos. In Proceedings of the International World Wide Web Conference (WWW), 2009. 144 [Cis10] CiscoSystems, Inc. CiscoVisualNetworkingIndex: ForecastandMethod- ology, 2009-2014. White Paper, 2010. [CTH + 09] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nus-wide: a Real-world Web Image Database from Na- tional University of Singapore. In Proceeding of the ACM International Conference on Image and Video Retrieval (CIVR), 2009. [DBFF02] P.Duygulu,KobusBarnard,J.F.G.deFreitas,andDavidA.Forsyth. Ob- ject Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. In Proceedings of the European Conference on Com- puter Vision (ECCV), 2002. [DBG09] Christian D¨ untgen, Thomas Behr, and Ralf Hartmut G¨ uting. Berlin- mod: A Benchmark for Moving Object Databases. The VLDB Journal, 18(6):1335–1368, 2009. [EOWZ07] Boris Epshtein, Eyal Ofek, Yonatan Wexler, and Pusheng Zhang. Hier- archical Photo Organization Using Geo-Relevance. In Proceedings of the ACM International Symposium on Advances in Geographic Information Systems (ACM GIS), pages 1–7, 2007. [flo] Flowplayer - Flash Video Player for the Web. http://flowplayer.org. [GBB + 65] Clarence H. Graham, Neil R. Bartlett, John Lott Brown, Yun Hsia, Con- rad C. Mueller, and Lorrin A. Riggs. Vision and Visual Perception. John Wiley & Sons, Inc., 1965. [GK02] Stefan Gobel and Peter Klein. Ranking Mechanisms in Meta-data Infor- mation Systems for Geo-spatial Data. In EOGEO Technical Workshop, 2002. [gooa] Google Maps API. http://code.google.com/apis/maps/index.html. [goob] Google Maps Street View. http://maps.google.com/help/maps/streetview/. [GST + 06] Shantanu Gautam, Gabi Sarkis, Edwin Tjandranegara, Evan Zelkowitz, Yung-Hsiang Lu, and Edward J. Delp. Multimedia for Mobile Environ- ment: Image Enhanced Navigation. volume 6073, page 60730F. SPIE, 2006. [Gut84] Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD, Proceedings of Annual Meeting, Boston, Mas- sachusetts, pages 47–57, 1984. [HCJL03] Tae-Hyun Hwang, Kyoung-Ho Choi, In-Hak Joo, and Jong-Hun Lee. MPEG-7 Metadata for Video-Based GIS Applications. In Geoscience and Remote Sensing Symposium, pages 3641–3643, vol.6, 2003. 145 [Hec01] Eugene Hecht. Optics. Addison-Wesley Publishing Company, 4 th edition, August 2001. [Ioa03] Yannis Ioannidis. The History of Histograms (abridged). In Proceedings of the International Conference on Very Large Databases (VLDB), 2003. [JB08] YushiJingandShumeetBaluja. Visualrank: ApplyingPageranktoLarge- Scale Image Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30:1877–1890, 2008. [JK02] Kalervo J¨ arvelin and Jaana Kek¨ al¨ ainen. Cumulated Gain-based Eval- uation of IR Techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002. [JNTD06] Alexander Jaffe, Mor Naaman, Tamir Tassa, and Marc Davis. Generating summaries for large collections of geo-referenced photographs. In Pro- ceedings of the International World Wide Web Conference (WWW), pages 853–854, 2006. [KKL + 03] Kyong-Ho Kim, Sung-Soo Kim, Sung-Ho Lee, Jong-Hyun Park, and Jong- Hyun Lee. The Interactive Geographic Video. In Geoscience and Remote Sensing Symposium, pages 59–61, vol.1, 2003. [KN08] Lyndon S. Kennedy and Mor Naaman. Generating Diverse and Represen- tative Image Search Results for Landmarks. In Proceedings of the Interna- tional World Wide Web Conference (WWW), pages 297–306, New York, NY, USA, 2008. ACM. [KS06] Hyunmo Kang and Ben Shneiderman. Exploring Personal Media: A Spa- tialInterfaceSupportingUser-definedSemanticRegions. Journal of Visual Languages & Computing, 17(3):254 – 283, 2006. [KT05] Rieko Kadobayashi and Katsumi Tanaka. 3D Viewpoint-Based Photo Search and Information Browsing. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Re- trieval, pages 621–622, 2005. [LCS05] XiaotaoLiu,MarkCorner,andPrashantShenoy. SEVA:Sensor-Enhanced Video Annotation. In Proceedings of the ACM International Conference on Multimedia (ACM MM), pages 618–627, 2005. [LF04] Ray R. Larson and Patricia Frontiera. Geographic Tnformation Re- trieval (GIR) Ranking Methods for Digital Libraries. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pages 415–415, New York, NY, USA, 2004. ACM. [LHCW07] Chih-Chieh Liu, Chun-Hsiang Huang, Wei-Ta Chu, and Ja-Ling Wu. ITEMS:IntelligentTravelExperienceManagementSystem. InProceedings of the ACM International Workshop on Multimedia Information Retrieval (MIR), pages 291–298, 2007. 146 [LZLM07] Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. A Survey of Content-based Image Retrieval with High-level Semantics. Pattern Recog- nition, 40(1):262–282, 2007. [MCK + 10] JacquelineW.Mills,AndrewCurtis,BarrettKennedy,S.WrightKennedy, and Jay D. Edwards. Geospatial video for field data collection. Applied Geography, 2010. doi:10.1016/j.apgeog.2010.03.008. [MG05] Neil J. McCurdy and William G. Griswold. A Systems Architecture for Ubiquitous Video. In Proceedings of the International Conference on Mo- bile Systems, Applications, and Services (MobiSys), pages 1–14, 2005. [NSPGM04a] MorNaaman,YeeJiunSong,AndreasPaepcke,andHectorGarcia-Molina. Automatically Generating Metadata for Digital Photographs with Geo- graphic Coordinates. In Proceedings of the International World Wide Web Conference (WWW), 2004. [NSPGM04b] MorNaaman,YeeJiunSong,AndreasPaepcke,andHectorGarcia-Molina. Automatic Organization for Digital Photographs with Geographic Coordi- nates. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries, pages 53–62, 2004. [NYGMP05] Mor Naaman, Ron B. Yeh, Hector Garcia-Molina, and Andreas Paepcke. Leveraging Context to Resolve Identity in Photo Albums. In Proceedings of the ACM/IEEE-CS joint conference on Digital libraries (JCDL), pages 178–187, 2005. [Ore86] A. Orenstein. Spatial query processing in an object-oriented database sys- tem. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1986. [Pav09] Theo Pavlidis. Why Meaningful Automatic Tagging of Images is Very Hard. In Proceedings of the IEEE International Conference on Multimedia & Expo (IEEE ICME), pages 1432 – 1435, 2009. [PG05] A. Pigeau and M. Gelgon. Building and Tracking Hierarchical Geograph- ical & Temporal Partitions for Image Collection Management on Mobile Devices. In Proceedings of the ACM International Conference on Multi- media (ACM MM), 2005. [PT03] Dieter Pfoser and Yannis Theodoridis. Generating Semantics-Based Tra- jectoriesofMovingObjects. Computers, Environment and Urban Systems, 27(3):243–263, 2003. [Rob09] Mark R. Robertson. Reelseo. http://www.reelseo.com/youtube-search- december-2009/, 2009. [SK00] BenShneidermanandHyunmoKang. Directannotation: Adrag-and-drop strategyforlabelingphotos. InProceedingsoftheInternationalConference on Information Visualisation (IV), 2000. 147 [SMW + 00] ArnoldW.M.Smeulders,SeniorMember,MarcelWorring,SimoneSantini, Amarnath Gupta, and Ramesh Jain. Content-based Image Retrieval at the End of the Early Years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:1349–1380, 2000. [SOK06] Alan F. Smeaton, Paul Over, and Wessel Kraaij. Evaluation Campaigns and TRECVid. In Proceedings of the ACM International Workshop on Multimedia Information Retrieval (MIR), pages 321–330, 2006. [SOK09] Alan F. Smeaton, Paul Over, and Wessel Kraaij. High-Level Feature De- tection from Video in TRECVid: a 5-Year Retrospective of Achievements. In Ajay Divakaran, editor, Multimedia Content Analysis, Theory and Ap- plications, pages 151–174. Springer Verlag, 2009. [SS08] Ian Simon and Steven M. Seitz. Scene Segmentation Using the Wisdom of Crowds. In Proceedings of the European Conference on Computer Vision (ECCV), 2008. [TBC06] Carlo Torniai, Steve Battle, and Steve Cayzer. Sharing, Discovering and Browsing Geotagged Pictures on the Web. Springer, 2006. [TLR03] Kentaro Toyama, Ron Logan, and Asta Roseway. Geographic Location Tags on Digital Images. In Proceedings of the ACM International Confer- ence on Multimedia (ACM MM), pages 156–166, 2003. [TVS96] Yannis Theodoridis, Michael Vazirgiannis, and Timos Sellis. Spatio- Temporal Indexing for Large Multimedia Applications. In IEEE Inter- national Conference on Multimedia Systems, 1996. [U.S] U.S. Government Information and Maps Depart- ment. The Map Collection - TIGER/line Files. http://www.lib.ucdavis.edu/govdoc/MapCollection/tiger.html. [VT02] Remco C. Veltkamp and Mirela Tanase. Content-based image retrieval systems: A survey. Technical report, Department of Computing Science, Utrecht University, October 2002. [Wra08] Richard Wray. Online video ads put message into the medium. In The Guardian. 2008. URL http://www.guardian.co.uk/media/2008/dec/29/blinkx-internet-video- advertising. [YCKH07] Akira Yanagawa, Shih-Fu Chang, Lyndon Kennedy, and Winston Hsu. DistributedPolicyManagementandComprehensionwithClassifiedAdver- tisements. Technical Report 222-2006-8, Columbia University ADVENT, March 2007. [YSLH03] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and Marti Hearst. Faceted Metadata for Image Search and Browsing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 2003. 148 [Zha09] HongJiang Zhang. Multimedia Content Analysis and Search: New Per- spectives and Approaches. In Proceedings of the ACM International Con- ference on Multimedia (ACM MM), pages 1–2. ACM, 2009. [ZKH + 03] Herwig Zeiner, Gert Kienast, Michael Hausenblas, Christian Derler, and Werner Haas. Video Assisted Geographical Information Systems. Kluwer International Series in Engineering and Computer Science, (740):205–216, 2003. [ZZS + 09] Yan-Tao Zheng, Ming Zhao, Yang Song, H. Adam, U. Buddemeier, A. Bis- sacco, F. Brucher, Tat-Seng Chua, and H. Neven. Tour the World: Build- ing a Web-scale Landmark Recognition Engine. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 0:1085–1092, 2009. 149 Appendix A Generating Synthetic Meta-data for Georeferenced Video Management A.1 Introduction In this dissertation we propose the use of geographical properties of video scenes as an effective means to aid in the search of large video archives. Subsequently, we intro- duced a camera prototype to capture videos along with location and direction meta-data, automatically annotated from sensors. We collected many hours of georeferenced videos usingourrecordingsystemfortheevaluationofourmobilevideomanagementtechniques. Some other studies in the multimedia community have generated their own georeferenced meta-data [LCS05, KKL + 03, EOWZ07]. Such real-world datasets are essential for the evaluation and comparison of various mobile video management techniques. However, the limitation of these datasets is that they were all generated within a controlled en- vironment, therefore the behavior of the video capture can be perceived as subjective. Moreover, due to a highly labor intensive collection process, the amount of georeferenced 150 real-world video data available has not been large enough to evaluate realistic applica- tion scenarios. To enable comprehensive performance evaluations on a large scale, much bigger datasets are required. Consequently, one important and urgent requirement to fa- cilitate future research activities is the availability of well-described georeferenced video meta-data, enabling a systematic and extensive comparison of mobile video management techniques. However, collecting real-world data requires a considerable amount of time and effort. While there is a need for a collaborative effort to build a large repository of georeferenced videos, a complementary solution to fulfill the urgent requirement of the research community is to synthetically generate georeferenced video meta-data. Inthischapter,weproposeanapproachforgeneratingsyntheticvideometa-datawith realistic geographical properties for mobile video management research. We provide our observations about the behavior of mobile cameras and how to emulate such behavior when generating synthetic data. Note that the generated meta-data is based on the definition of our viewable scene model. Users can control the behavior of the proposed generator using various input parameters. Our contributions in this chapter include: • Identification and classification of the requirements on generating synthetic georef- erenced video meta-data. • Design of algorithms to synthetically generate practical video meta-data with re- quested geo-spatial properties. In particular, the generation of video meta-data with realistic camera movements and camera rotations. • Customization of the generation process based on user parameters for broader ap- plications. 151 • Comparison and evaluation of the characteristics of the real-world and synthetic video meta-data. A.2 Synthetic Video Meta-data Generation Requirements Wenextreviewtherequirementsforthegenerationofgeoreferencedvideometa-data. We specificallyinvestigatethebehaviorofmobilecamerasandhowtosimulatesuchbehaviors in the synthetic data. Our goal is not to provide a case study generator that fulfills the requirements of a certain specific application. Instead, we present a customizable system which the user can configure according to the required meta-data properties for more general applications. The highlighted requirements are based on our analysis on the collected real-world data. Property Description Trajectory pattern → Type of the trajectory that the camera follows Rotation pattern → Property that describes the behavior of the camera rotation. Visual angle → The visual angle for the camera field-of-view. Visibility range → The visible distance at which a large object within the cam- era’s field-of-view can still be recognized. Table A.1: Summary of camera template specification. Inanactualvideocapture,thebehaviorofacamera(i.e.,itsmovementsandrotations) depends on the occasion and purpose of the video recording. Such camera behavior can be described by the pattern of movements and rotations, which we refer to as a camera template in this paper. 152 A camera template overview is given in Table A.1. In the template definition, the trajectory pattern propertyidentifiesthemovementpatternofthecameratrajectory. Ex- ample patterns include, network-based when a camera moves along a predefined network such as roads, free movement when a camera has no restriction in moving directions, or a combination of the two. The rotational behavior is described by the rotation pattern property. If the camera rotation is not allowed, this property is set to fixed and if the camera can be freely rotated (e.g., by a human user), this property is set to random. The visual angle and and visibility range corresponds to the parameters θ and R in the scene model, respectively. For a particular camera, we assume that all the georeferenced meta-data streams will have identical θ and R values. Consequently, in generating syn- thetic meta-data, the generator produces meta-data only for the camera trajectory and rotation. Table A.2 provides the detailed list of the camera template parameters, their grouping and also their corresponding notations. Below we provide three typical example templates for mobile videos: 1. Vehiclecamera. Thecameraismountedonavehicleandmovesalongroadnetworks. Examples include cameras equipped on city buses, police cars or ambulances. The trajectory pattern is network-based. The movement speed is assumed mostly fast and steady. The camera heading with respect to the direction of vehicle movement is fixed (i.e., the rotation pattern is fixed). Thus, the camera direction changes only when the vehicle makes turns. 2. Passenger camera. A passenger traveling in a vehicle operates the camera. Exam- ples include hand-held cameras used to capture scenery, buildings, landmarks, etc. The camera trajectory pattern is set to network-based. The camera bearing changes 153 when the moving direction changes and when the user rotates the camera, i.e., the rotation pattern is specified as random. 3. Pedestriancamera. Awalkingpedestrianholdsthecamera. Examplesincludehand- heldcamerasusedtocapturetouristattractions,landmarks,andspecialevents. The camera mostly follows a random trajectory, analogous to a walking path. However, itmightalsofollowapredefinednetwork,suchassidewalksorhikingtrails. Thetra- jectory pattern is a combination of network-based and free movement. The camera can be rotated freely, therefore the rotation pattern is set to random. Camera Template Trajectory pattern 1) T network (V road m [ ],P e ,t s ,n,p,A) 2) T free (V walk m ,U) 3) T mixed (V walk m ,P e ,t s ,D rand ,N rand ) Visual angle θ Visibility range R Rotation pattern 1) fixed 2) random (offset,ω max ,α max ) V road m [ ] Arrayofspeedlimits(inkm/h)fordifferentroadtypesontheroad network. V walk m Maximum walking speed (in km/h). P e The probability of having a stop event. t s The duration of stop event (in s). n, p ParametersforBinomialdistribution. Usedinmodelingthecamera deceleration. A Camera acceleration rate (in m/s 2 ). U Distribution of initial camera locations. D rand , N rand Distribution and percentage of the random movement sections in T mixed . offset The angle offset (in degrees) with respect to the moving direction. ω max The maximum allowed rotation per time unit (in degrees/s). α max Maximum allowed rotation (in degrees) with respect to offset (i.e., the camera direction is limited within offset∓α max angle range). Table A.2: Details of camera template parameters. 154 A.2.1 Camera Movement Requirements Forthecameratemplatesdescribedabove,therearethreemajorbehavioralpatterns(i.e., trajectory patterns) for the camera movement: 1. T network : Moving on a given road network. 2. T free : Moving on an unrestricted path. 3. T mixed : Partly following a network and partly moving on a random path. In order to realistically simulate the movements, some other practical properties are re- quiredinadditiontothetrajectorypatterns,suchas: 1)maximumspeed(V road m [ ],V walk m ), 2) frequency and duration of stops (P e , t s ), and 3) the camera acceleration and decelera- tion (n, p, A), etc. Table A.2 lists the required camera template parameters to configure the camera movements of the trajectory types. A mobile camera is an example of a moving object whose positions are samples at points in time. In the field of spatio-temporal databases, there exists an extensive col- lection of research for generating synthetic moving object test data for evaluating spatio- temporal access methods. We adopt and revise some well-known moving object data generation techniques for the proposed synthetic data generation. Details on generation of camera movements will be described in Section A.3.1. A.2.2 Camera Rotation Requirements There can be certain restrictions for a practical simulation of rotational behaviors, even for freely rotating camera directions. These restrictions are required to be enforced by assigning appropriate values to certain camera template parameters (see Table A.2). For example,offset parameterdefinestheinitialbearingofthecamera. Ifthecameradirection 155 is fixed then the camera direction will be analogous to the moving direction, maintaining the angle offset with respect to the moving direction. Often, to ensure visual quality in capturedvideos, therotationspeedislimitedwithinacertainthreshold. ω max parameter defines the maximum allowable rotation per time unit. For the random rotation pattern, the user might want to restrict the maximum rotation with respect to the offset. The α max parameter defines the maximum allowed rotation around the offset. In collecting real world georeferenced meta-data, the camera location and direction values are acquired from two independent sensors. However, based on our analysis on the real data, we observed that the movement and rotation of a camera is correlated. For example, cameras mostly point towards the movement direction (with some varying offset). In addition, when the camera trajectory makes a turn at a corner or a curve, the camera direction also changes. For fast moving cameras (e.g., a camera mounted on a car) there is a close correlation between the movement speed and the amount of rotation. Suchrelationshipsarecausedbyhuman behaviorincontrollingthecamerarotation. The faster the camera moves the less it rotates, mainly because human operators try to avoid fast horizontal paning such as to cautiously maintain the quality of videos. Figure A.1 illustratesthisobservationbasedontherealworlddata. Theaveragerotationofacamera linearly decreases as its movement speed increases. The aforementioned properties surely cannot consider all possible aspects of mobile videorecording. Thecharacteristicsofgeoreferencedvideodatamaywidelyvarydepend- ing on when, where, by whom and for what purpose the video is captured. Users may want to create and customize new camera templates by defining different movement and rotation specifications. Hence, a synthetic data generator needs to be able to fulfill these 156 0 2 4 6 8 10 12 14 16 18 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 Average Rotation Speed (km/h) Rotation (Real Data) Figure A.1: Analysis of real-world data: average rotation as a function of the camera speed. requirements. By giving users the flexibility to customize the camera behavior, our data generator can provide synthetic data applicable to a wide range of application scenarios. A.3 Video Meta-data Generation Our methodology for generating synthetic georeferenced meta-data is based on the ob- servations stated in Section A.2. We generate the synthetic data in two major steps (see Figure A.2): First we create camera movement trajectories, and then in the second step we estimate the camera direction for each location on the movement path. A.3.1 Generating Camera Movement To create the camera trajectories, we adopted two widely accepted spatio-temporal data generators for moving objects: the Brinkhoff generator [BS02] and the GSTD genera- tor [PT03]. The former is used to generate the trajectories of a camera that moves on a 157 Figure A.2: Generator architecture. road network (T network ), while the later is used to simulate the trajectories of a pedes- trian (T free ). The basic behaviors of the generators are controlled by a set of parameters, however, more sophisticated additional changes for the georeferenced video management researchrequiredadditionalconsiderations. Inordertoensurethatthegeneratedcamera trajectories from the Brinkhoff and GSTD algorithms had the required specifications, we modified both software algorithms and added some features. A.3.1.1 Network-based Camera Movement (T network ) We generate network-based trajectories using the Brinkhoff algorithm with the required maximum speed (V road m [ ]), given a camera template. The array V road m [ ] defines the maximum allowed speeds for traveling on freeways, main roads and side roads. The Brinkhoff generator adjusts the speed of a moving object based on the maximum speed allowedonthecurrentedge(i.e.,road)aswellastheloadonthatedge(thecurrentnumber of objects using it). The Brinkhoff generator uses TIGER/Line files [U.S] to build the 158 road network. The routes that the moving objects follow are determined by selecting the best network edges between the starting and destination points. The objects always try to move at the maximum possible speed and their movement is continuous unless there is congestion on the current edge. The Brinkhoff algorithm does not introduce any stops at intersections (e.g., at traffic lights and stop signs). In addition, there is no support for modeling the acceleration and deceleration of the objects. Thus, we modified the original algorithm and inserted stops and decelerations at some road crossings and transitions. The probability of having a stop event at an intersection is given by P e . In a deceleration event, at every time unit the cameras’ current speed υ is reduced to υ×B(n,p)/n where B(n,p) is a binomially distributed random variable (n is the number of experiments and p is the probability for a positive outcome). In a typical run, we use n=20 and p=0.5, which results in a 50% speed reduction at each time unit [DBG09]. In a stop event, the camera decelerates and reduces its speed to 0 km/h. After a full stop, it stays steady for a duration of t s seconds. The acceleration event occurs when the available speed limit on the current edge increases or when the camera starts to move after a stop event. The camerasacceleratewithaconstantaccelerationrateofAm/s 2 [DBG09]. AllV road m [ ],P e , t s , n, p, and A values are obtained from the camera template definition. The output for a T network trajectory includes the timestamped camera location updates with a sampling rate of f sample . The generator creates N trajectories with a maximum trajectory length of L. N and L values are input parameters to the generator. 159 A.3.1.2 Unconstrained Camera Movement (T free ) We use the GSTD algorithm to generate the camera trajectories with unconstrained movement. GSTD was one of the first generators for creating moving object data. The behaviorofmovementscanbecustomizedbyadjustingthedomainofdisplacementsΔc[], i.e., the minimum and maximum possible movements on the x and y axes. The object movement can either be random or partly restricted on one axis (i.e., the objects are forced to move towards a particular direction). To ensure a certain speed level for the camera trajectories, we modified the algorithm and added a speed control mechanism. The user provides the desired maximum speed (V walk m in km/h) in the camera template specification. Then, the generator calculates the object speed for every location point on the trajectory and ensures that overall, the user speed requirements are fulfilled. To simulate the movement of a pedestrian, the maximum walking speed is set to 5.43 km/h. The initial locations of the cameras are calculated based on the distribution function U, which is given in the camera template. It is important to note that GSTD represents object locations as relative distances with respect to the mid-point of the region. However, our georeferenced video model [AZK08] requires the location meta-data in the geographic coordinate system (i.e., as latitude/longitude coordinates). Therefore, we also modified the trajectory representation of GSTD and generated object movements with latitude/longitude coordinates. Such a representation enables the monitoring of the object speed given that the speed requirements are defined in common metric units (i.e., km/h). In addition, unlike the original GSTD algorithm, we report location updates in the trajectory periodically with a parameterized sampling rate (f sample ). 160 A.3.1.3 Mixed Camera Movement (T mixed ) The trajectories in this category are combinations of sub-trajectories from T network and T free . Cameras sometimes follow the network and sometimes move randomly on an unconstrained path. Algorithm 5 formalizes the trajectory generation of T mixed . We first generate a T network trajectory (line 2), then we randomly select n point pairs S = {hp i,1 ,p i,2 i, (1 ≤ i ≤ n)} on the trajectory, where each point pair S i represents a segment on the trajectory. Thedurationofeachsegment(|S i |)canchangebetween0and|T init |/4, where |T init | is the length of the original T network trajectory. Function GetLineSegments() ran- domly selects n subsegments in T init , such that N rand = ( P n i=1 S i )/|T init |. Recall that, N rand is the camera template argument that specifies the percentage of random move- ment in the generated T mixed . The distribution of the segments on the trajectory is given by the camera template argument D rand . It can be either uniform, Gaussian (around the middle), or skewed (towards the beginning or end of the trajectory). Each of these segments (S(i)) is removed from the trajectory and replaced by a new sub-trajectory T rand (i) with random movement that starts at p i,1 and ends at p i,2 (lines4-7). The func- tion CreateRandTraj() implements a restricted version of GSTD, where the starting and end points are pre-defined. Note that if p i,1 = p i,2 , nothing is removed. One important issue in creating aT mixed trajectory was merging the trajectories from T network and T free together. We made sure that the trajectories from both groups have the same output format. Also, it is not necessarily true that the removed and inserted segments are of equal time length (i.e., |S(i)| = |T rand (i)|). Therefore, there might be 161 overlapsand/orgapsinthetimestampsofthecombinedtrajectories. Algorithm5revises thetimestampsinT mixed (functionAdjustTimestamps())andmakessurethatthetiming within the final trajectory is consistent. Algorithm 5 MixedMovement(netT arg,randT arg,CAM) 1 CAM ← camera template specification, netT arg ← arguments for creating T network , randT arg ← arguments for creating T free , Returns: T mixed trajectory 2 T init = CreateNetTraj(netT arg,CAM) {Generate a T network trajectory} 3 S[ ] = GetLineSegments(|T init |,CAM); 4 for all S(i) =hp i,1 ,p i,2 i, (1≤i≤n) do 5 T rand (i) = CreateRandTraj(randT arg, p i,1 , p i,2 ); {Create a T free trajectory starting at p i,1 and ending at p i,2 } 6 T mixed = T init .Replace(p i,1 , p i,2 ,T rand (i)); 7 end for 8 return AdjustTimestamps(T mixed ); Algorithm 6 M k = AssignDirections(T k ,CAM) 1 T k ← The trajectory array for V k , CAM ← Camera template specification, Returns: M k ← The meta-data array for V k 2 Copy T k to M k ; {Initially set the direction for each data point to ”moving direction+CAM.offset” } 3 for i = 1 to |T k | do 4 T k (i).dir = GetMovingDirection(T k ,i,CAM); 5 end for; 6 AdjustRotations(M k ,CAM); 7 if CAM.rotation pattern is “random” then 8 RandomizeDirections(M k ,CAM); 9 end if 10 return M k ; A.3.2 Generating Camera Rotation An important difference between the traditional moving object data generators and our work is the computation of the camera direction. Assigning meaningful camera direction angles for the location points along the trajectory is one of the novel features of the proposed data generator. The assignment of direction angles is customized following the specifications provided by users (see Table A.2). 162 A.3.2.1 Generating Camera Direction Angles As shown in Figure A.2, after the generator creates the camera trajectories, it assigns appropriate direction angles for all sampled points on the trajectories. Algorithm 6 formalizes the assignment of camera directions. The camera movement trajectory T k and camera template specification CAM are given to the algorithm as input. First, the algorithm assigns the camera direction to be the moving direction and stores the camera trajectory with moving directions in arrayM k . Next, it adjusts the rotations that exceed the CAM.ω max threshold (lines 3-6). If the camera can be freely rotated (i.e., random rotation), the algorithm assigns an appropriate random direction angle for each point in M k (function RandomizeDirections()). The amount of rotation is computed with regard to the current camera speed and the specifications defined in CAM. A.3.2.2 Calculating Moving Direction The moving direction of the camera is estimated as the direction vector between two consecutive trajectory points. Given a camera trajectory T k , the moving direction at sample point t is given by the unit vector, − → m = −−−−−−−−→ T k (t-1)T k (t). The camera direction angle (with respect to North) at sample point t can be obtained by: T k (t).dir = cos −1 ( − → m.y | − → m| )+CAM.offset In Algorithm 6, the function GetMovingDirection() calculates the moving direction angle at all trajectory points and initializes the camera direction to the moving direction plus the given offset (CAM.offset). Next, the function AdjustRotations() checks whether the rotations due to the changes in the moving direction are below the required threshold 163 CAM.ω max . Thesyntheticallygeneratedcameratrajectoriesoftenhavesharpturnswhich results in abrupt rotations in the camera moving direction, which is unusual in reality. Ourobservationonreal-worlddatashowsthatthechangeinthemovingdirectionismuch smoother. As an example, for vehicle and passenger cameras, when the vehicles turn at corners and steep curves on the road network, the moving direction will gradually change and it will take several seconds to complete the turn. (a) (b) (c) Figure A.3: Illustration of camera direction adjustment for vehicle cameras. (a) Real- world data, (b) synthetic data before direction adjustment, and (c) synthetic data after direction adjustment. FigureA.3ashowsanexampletrajectoryfromtherealdatawherethecarmakesaleft turn. Thechangeinthespeedisillustratedwithdifferentcolorcodes,i.e.,redpointsmark the slowest and blue points the fastest speeds. The arrows show the moving directions. FigureA.3billustratesatrajectorysyntheticallygeneratedforthesameroadintersection. When the point moving direction is estimated from the movement vector, the rotation anglewithinasingletimeunitcanbeaslargeastherightangle(therotationatthecorner point≥90 o in Figure A.3b). In order to simulate the direction change during a turn, the generator distributes the total rotation amount among several trajectory points before andafterthecornerpoint. AsseeninFigureA.3c,withthismethodthemovingdirection 164 changes gradually, and the rotation between two consecutive points is guaranteed to be under CAM.ω max . The camera locations are also updated based on the adjusted moving directions. Algorithm 7 AdjustRotations(M k ,CAM) 1 M k ← The meta-data array for V k , CAM ← Camera template specification, Returns: M k with revised directions 2 for i = 2 to |M k | do 3 {If the difference between two consecutive angles is greater than CAM.ωmax} 4 if AngleDiff(M k (i).dir, M k (i−1).dir) > CAM.ωmax then 5 Δ fwd = 1; Δ bwd = 0; {Scan backwards and forwards to find two data points such that the rotation between them is possible without exceeding CAM.ωmax} 6 repeat 7 dir change=AngleDiff(M k (i+Δ fwd ).dir,M k (i-Δ bwd ).dir) 8 if Δ fwd > Δ bwd then 9 increase(Δ bwd ); 10 else 11 increase(Δ fwd ) 12 end if 13 until (dir change ≤ ωmax * (Δ fwd +Δ bwd )); 14 {Adjust all directions between (i-Δ bwd ) and (i+Δ fwd ) such that rotation amount do not exceed CAM.ωmax and CAM.offset∓αmax} 15 for j= i-Δ bwd to (i+Δ fwd ) do 16 M k (i).dir=InterpolateAngle(j,M k (i-Δ bwd ).dir, M k (i+Δ fwd ).dir, CAM.ωmax, CAM.αmax) 17 end for 18 end if 19 end for 20 return M k ; The function AdjustRotations() checks for the rotation angle at each sample point, andsmoothesthesharpturnsbyinterpolatingtherotationangleamongseveralneighbor- ing sample points. Algorithm 7 formalizes the adjustment of rotations. The meta-data array M k with moving directions is given to the algorithm as input. The algorithm checks the rotation amounts between all consecutive trajectory points in M k . If the rota- tion amount exceeds CAM.ω max at point M k (i), then the algorithm scans the trajectory backwards and forwards until it finds two trajectory pointsM k (i-Δ bwd ) andM k (i+Δ fwd ) such that the camera can be safely rotated from M k (i-Δ bwd ) to M k (i+Δ fwd ) without violating the CAM requirements. When such points are found the directions between 165 points M k (i-Δ bwd ) and M k (i+Δ fwd ) are interpolated (line 18). Given a sample point j, function InterpolateAngle() returns the j th interpolated angle, i.e., M k (i−Δ bwd ).dir+j µ M k (i+Δ fwd ).dir−M k (i−Δ bwd ).dir Δ bwd +Δ fwd ¶ ItguaranteesthatthereturneddirectionisbetweenCAM.offset-CAM.α max andCAM.offset+ CAM.α max . A.3.2.3 Assigning Random Direction Angles In Algorithm 6, if the camera’s bearing with respect to its moving direction is fixed, the function returns the meta-data array with adjusted moving directions. However, if the camera rotation is allowed, then the algorithm randomly rotates the directions at each sample point towards left or right. Given the initial camera directions based on the moving direction, function RandomizeDi- rections() updates direction angles to simulate the rotation of the camera by a user. Algorithm 8 details the randomization of the directions. It generates a random rotation amount which is inversely proportional to the current speed level (line 4). The faster a camera moves the less rotation there will be. In addition, the algorithm makes sure that the rotation amount is less than CAM.ω max and the assigned camera direction angles are between CAM.offset-CAM.α max and CAM.offset+CAM.α max (line 5). The algorithm returns the meta-data array with updated directions. Algorithm 8 RandomizeDirections(M k ,CAM) 1 M k ← The meta-data array for V k , CAM ← Camera template specification provided by user, Returns: M k with revised directions. 2 for i = 2 to |M k | do 3 repeat 4 rotate = (getRandNum()*getRandSign())÷M k (i).speed; 5 until (M k (i).dir +rotate ≤ CAM.ωmax) and (CAM.αmax ≤ M k (i).dir+rotate ≤ CAM.αmax) 6 M k (i).dir = M k (i).dir + rotate; 7 return M k ; 8 end for 166 A.3.3 Creating Meta-data Output Our generator can either create the georeferenced meta-data output in text format or directly insert the meta-data tuples into a MySQL database. The output is the list of the sampled camera location (P) and direction ( ~ d) updates. Following the input sampling rate (f sample ), each meta- data item is assigned a timestamp. For timestamps we use the Coordinated Universal Time (UTC) standard format. The first data point in a meta-data stream is assigned the current UTC time. The rest of the timestamps are adjusted according to the given granularity in time based on f sample . A.4 Experimental Evaluation A.4.1 Comparison with Real-world Dataset In this section we empirically examine some high-level properties of the synthetic georeferenced video meta-data. We analyze the movements and rotations of both the real-world and synthetic datasets. The effectiveness of the synthetic data generation approach is evaluated using a high- level comparison between the real-world and synthetic data obtained for the same geographical region. A.4.1.1 Datasets and Evaluation Methodology Using the proposed georeferenced video meta-data generator, we produced two groups of syn- thetic data, namely S car using the vehicle camera template and S pass using the passenger camera template. Foreachsynthetic datagroup, wegenerated10datasets. Forthefirstgroup(S car ), the cameras cannot be rotated, therefore the camera orientation is the same as the moving direction. For the second group (S pass ) the camera can be rotated freely. Both synthetic data groups were created based on the road network of Moscow, Idaho, where the real-world data were collected. 167 The maximum speed and object class parameters for both S car and S pass were adjusted to make the maximum camera speed comparable to the real-world data. The average length of the video files (i.e., L) in the synthetic datasets was also similar to that of the real-world video files. The output sampling rate f sample was set to 1 sample/s. Table A.3 summarizes the properties of the generated data. The values from the 10 datasets in each of the groups S car and S pass were averaged and we only report these averaged values in Table A.3. Number of Total number of Average Maximum video files trajectory points video length (s) speed (km/h) Synthetic Data with Fixed Camera (S car ) 40 11849 299 87.91 SyntheticDatawith Free Camera Rotation (S pass ) 40 11946 300 87.28 Real-world Data (RW) 25 10170 426 83.50 Table A.3: Properties of the synthetic and real-world datasets. To evaluate the applicability of the synthetic datasets, we are interested to analyze the im- portant characteristics of synthetic datasets and compare them to those of the real-world dataset (RW). WecompareS car andS pass withRW intermsof, (1)thespeedofthecamerasand(2)the rotation of the cameras. We report the average and maximum values for speed and rotation, and thefrequencydistributionofthedifferentspeedandrotationlevelswithinthedatasets. Notethat the properties of the real-world dataset is highly subjective and depends on how the user moves and rotates the camera. During real-world data collection, we rotated the camera as needed to capture important scenes and avoided fast movements and rotations. A.4.1.2 Comparison of Camera Movement Speed First, we analyze how fast and how rapidly cameras move. When we generated S car and S pass , we set the maximum allowable speed parameter V road m based on the speed limits of the different types of roads in the region (i.e., 88 km/h for highways, 56 km/h for main roads, and 40 km/h for side roads). Recall that the generator calculates the maximum allowable speed on different edges 168 of the road network based on this parameter. The other camera template parameter values are A=12 m/s 2 [DBG09], P e =0.3, and t s =3s. For all 10 datasets in S car and S pass , we calculated the absolute speed of the camera at each trajectory point. Maximum Average StdDev of speed speed speed (km/h) (km/h) (km/h) Synthetic Data with Fixed Camera (S car ) 87.91 27.14 12.82 Synthetic Data with Free Camera Rotation (S pass ) 87.28 27.32 13.01 Real-world Data (RW) 83.50 27.03 13.68 Table A.4: Characteristics of the camera speed. Table A.4 reports the maximum and average values and the standard deviation of the camera speed. The values are averaged for the 10 datasets in S car and S pass . We observe that the average and standard deviation values for bothS car andS pass are very close to the values ofRW. This implies that the point speed distributions of the synthetic and real-world datasets are quite similar. To further investigate this, we constructed the histogram for different speed ranges in S car and RW. 0 5 10 15 20 25 30 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 Percentage of Trajectory Points (%) Speed (km/h) Synthetic Data Real Data (S car ) Figure A.4: Comparison of camera speed distributions for S car and RW. Figure A.4 shows the percentages of the trajectory points for various speed ranges. While S car andRW have similar trends, forS car the majority of the trajectory points have speed values 169 around the average speed, whereas for RW the distribution of the camera point speed is more uniform. In dataset S car , cameras always try to travel at the maximum allowed speed. They only slow down or stop when they are required to do so by traffic lights, stop signs, etc. The cameras always accelerate/decelerate with a certain rate. On the other hand, in RW cameras usually speed-up and slow down with a lower acceleration/deceleration rate resulting in smoother speed transitions. This is due to the fact that human operators were very cautious in driving a car during the real-time video capture. Note that the trend in the speed histogram might change for different real-world datasets. The average speed limit for the road network used in the data generation was around 40km/h (≈ 25mph). This explains the densepopulation of the trajectory points between speed levels 20-40 km/h. (a) (b) Figure A.5: Illustration of camera movement speed on map. (a) Real world data RW, (b) synthetic data S car . Figure A.5 illustrates the camera trajectories of the real-world dataset and a sample synthetic dataset. The color highlights on the trajectories show how fast cameras are moving on the road network. The blue color indicates the fastest speed while the red color marks the slowest speed. Note that overall, the movement behaviors of the cameras in both datasets are similar. 170 The camera movements of both the S car and S pass datasets are based on road networks, therefore their trajectory generation is the same. We obtained similar results for the comparison of S pass with RW as we did for S car and RW. We omit these results due to space considerations. A.4.1.3 Comparison of Camera Rotation Next, our objective is to examine the properties of the camera rotations in the synthetic datasets and analyze the practicality of the data generator by comparing the rotation characteristics of S car and S pass with RW. We calculated the absolute rotation (in degrees/s) of the camera at each trajectory point for all 10 datasets in S car and S pass . In Table A.5, we report the maximum and average rotations as well as the standard deviation of rotations within each dataset. Again, the values were averaged for the 10 datasets in S car and S pass . Maximum Average StdDev of rotation rotation rotation (degrees/s) (degrees/s) (degrees/s) Synthetic Data with Fixed Camera (S car ) 32.33 4.64 7.24 Synthetic Data with Free Camera Rotation (S pass ) 55.27 12.59 9.35 Real-world Data (RW) 107.30 11.53 14.02 Table A.5: Characteristics of the camera rotation. Themaximum rotations obtained fromS car andS pass werelimited bytherotation thresholds provided to the data generator as inputs (i.e., 30 and 60 degrees/s, respectively). For RW, the maximum rotation was relatively large. Although not frequent, there were a few fast rotations in thereal-worlddataresultinginlargervaluesformaximumandstandarddeviation measurements. For S car , since the cameras mostly travel on straight roads, the average rotation is quite small (≤5 degrees/s). On the other hand, for S pass , where cameras can be rotated freely, the average rotation is very close to RW. Figures A.6 and A.7 show the percentage distributions of camera rotations for both real and synthetic data. In Figure A.6, around 88% of the trajectory points rotate only 10 degrees or less per second. The remaining 12% correspond to the parts of the trajectories where the cameras 171 0 10 20 30 40 50 60 70 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 Percentage of Trajectory Points (%) Rotation (degrees/s) Synthetic Data Real Data (S car ) Figure A.6: Comparison of camera rotation distributions for S car and RW. 0 5 10 15 20 25 30 35 40 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40 40-45 45-50 Percentage of Trajectory Points (%) Rotation (degrees/s) Synthetic Data Real Data (S pass ) Figure A.7: Comparison of camera rotation distributions for S pass and RW. maketurnsontheroadnetwork. AsshowninFigureA.7, thedistributionofrotationsforS pass is moreuniformandthedistributiontrendissimilartoRW. OneimportantobservationisthatRW exhibits higher percentages at lower rotation values, and lower percentages at higher rotations while percentages of trajectory points for S pass are more uniform among different rotation levels. This is due to the fact that in real-world data for some video sections the camera was rotated more frequently and more rapidly to capture certain attractions/landmarks. It was not rotated whentherewasnospecifictargettocapture. Aswementionedearlier,suchbehaviorissubjective. 172 One possible improvement for our generator is to simulate different user behaviors for capturing the points of interest in the region. (a) (b) FigureA.8: Illustrationofcamerarotationonmap(a)realworlddata(RW)(b)synthetic data (S pass ). Figures A.8 (a) and (b) visualize the rotation levels along the camera trajectories forRW and S pass , respectively. In the figure, different colors show different levels of the camera rotation. The red, green and blue highlights show small, medium and large rotations, respectively. A.4.2 Performance Issues Here we provide a high-level analysis on the run time requirements of the generator. Our ob- jective is to show that the generator can create synthetic datasets in a reasonable amount of time with off-the-shelf computational resources. Recall that the generator implements two major components: generating camera movements and assigning camera directions. For the trajectory generation, the performance of the proposed generator depends on the performance of the two underlying algorithms, Brinkhoff and GSTD, whose run time requirements have been examined before and are well-known. Therefore, we focus on the run time of the direction assignment algo- rithm (i.e., Algorithm 6). 173 We generated several synthetic datasets with different sizes and measured the computation time to produce each dataset. All measurements were performed on a 2.33 GHz Intel Core2 Duo Windows PC. Table A.6 summarizes the measured times for different types of datasets and parameter settings. All data were generated for the same geographical region (i.e., Moscow, ID). In Table A.6, the time for generating trajectories and assigning camera directions are listed separately. Overall the generator can produce very large datasets (containing millions of tuples) in a reasonable amount of time, i.e., within an order of minutes. Camera Trajectory Rotation Number of Number of Trajectory Direction Total Template pattern pattern videos points generation assignment time (s) time(s) time (s) Vehicle camera T network Fixed camera 2,980 2,772,220 124 39 163 Passenger camera T network Random rotation 2,980 2,744,973 115 201 316 Pedestrian camera T free Random rotation 2,979 2,703,542 32 263 255 Pedestrian camera T mixed Random rotation 2,970 2,654,875 271 215 486 Table A.6: Summary of the time requirements for synthetic data generation. A.5 Summary Wepresentedanapproachforgeneratingsyntheticvideometa-datawithrealisticgeospatialprop- erties, based on the behavioral patterns of mobile cameras when they move and rotate. The requirements of the synthetic data generation were identified from the examples of real-world camera behaviors. Subsequently, we explained the data generation process and devised the algo- rithms used in meta-data creation. The high-level properties of the synthetically generated data andthoseofreal-worldgeoreferencedvideodatawerecompared. Weconcludedthatthesynthetic meta-data exhibit equivalent characteristics to the real data, and hence can be used in a variety of mobile video management research. We plan to make the generator software sources available to the research community. 174
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient management techniques for large video collections
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Utilizing real-world traffic data to forecast the impact of traffic incidents
PDF
A data integration approach to dynamically fusing geospatial sources
PDF
Query processing in time-dependent spatial networks
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
A statistical ontology-based approach to ranking for multi-word search
PDF
Mechanisms for co-location privacy
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Efficient updates for continuous queries over moving objects
PDF
Enabling spatial-visual search for geospatial image databases
PDF
A complex event processing framework for fast data management
PDF
Scalable data integration under constraints
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Recognition and characterization of unstructured environmental sounds
PDF
Reduction of large set data transmission using algorithmically corrected model-based techniques for bandwidth efficiency
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
Asset Metadata
Creator
Arslan Ay, Sakire
(author)
Core Title
Leveraging georeferenced meta-data for the management of large video collections
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/13/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
georeferencing,GPS,meta-data,mobile video,OAI-PMH Harvest,sensor,video search
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Zimmermann, Roger (
committee chair
), Kuo, C.-C. Jay (
committee member
)
Creator Email
arslan@usc.edu,sakire@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3526
Unique identifier
UC1164000
Identifier
etd-Ay-4037 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-412917 (legacy record id),usctheses-m3526 (legacy record id)
Legacy Identifier
etd-Ay-4037.pdf
Dmrecord
412917
Document Type
Dissertation
Rights
Arslan Ay, Sakire
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
georeferencing
GPS
meta-data
mobile video
sensor
video search