Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 911 (2009)
(USC DC Other)
USC Computer Science Technical Reports, no. 911 (2009)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Relevance Ranking in Georeferenced Video Search Sakire Arslan Ay Department of Computer Science University of Southern California Los Angeles, CA 90089, USA Roger Zimmermann School of Computing National University of Singapore Singapore 117417, Republic of Singapore Seon Ho Kim Department of Computer Science & Information Technology University of the District of Columbia Washington, DC 20008, USA October 30, 2009 Abstract The rapid adoption and deployment of ubiquitous video sensors has led to the collection of voluminousamountsofdata. However, theindexingandsearchoflargevideodatabasesremains a very challenging task. Augmenting media clips with meta-data descriptions is a very useful technique in this context. In our earlier work we proposed the notion of a viewable scene model derived from the fusion of location and direction sensor information with a video stream. Such georeferenced media streams are useful in many applications and, very importantly, they can effectively be searched. The result of a georeferenced video query will in general consist of a number of video segments that satisfy the query conditions, but with more or less relevance. For example, a building of interest may appear in a video segment, but may only be visible in a corner. Therefore, an essential and integral part of a video query is the ranking of the result set according to the relevance of each clip. An effective result ranking is even more important for video than it is for text search, since the browsing of results can only be achieved by viewing each clip, which is very time consuming. In this study we investigate and present three ranking algorithms that use spatial and temporal video properties to effectively rank search results. To allow our techniques to scale to large video databases, we further introduce a histogram based approach that allows fast online computations. An experimental evaluation demonstrates the utility of the proposed methods. 1 Introduction Theubiquitousavailabilityofdigitalvideosensorshasresultedindiverseapplicationsrangingfrom casual videography to professional multi-camera observation systems. As a result an increasing number of video clips are being collected and this is creating complex data handling challenges. While the size of the collected information is poised to exceed even the largest textual databases, video data is tremendously difficult to index and search effectively. To fully realize the potential of large video collections, fast and effective search techniques are an essential part of managing such archives. Current query techniques can be largely grouped into two clusters: (a) content based retrieval and (b) meta-data assisted methods. Tremendous progress has been achieved in techniques that rely on content-based extraction. However, broadening those methods beyond specific domains (e.g., news or sports) is proving to be very challenging and the 1 results are still not always satisfactory. A computationally more tractable approach relies on the processing of meta-data associated with video information. Such auxiliary data may consist of manual, textual annotations and/or some automatically generated tags. This method also enables high-level, semantic descriptions of video scenes which are very useful for human users. As a drawback, textual annotations often need to be added by hand which is a laborious and error- prone process. Inourpreviousworkwehaveproposedtheuseofthegeographicallocationpropertiesofcertain video clips as an effective means to aid in the search of large video archives [5]. Specifically we have put forward a viewable scene model that is comprised of a camera’s position in conjunction with its view direction, distance and zoom level. There are several compelling features to this approach. First, the meta-data for the viewable scene model can be collected automatically by attaching various small devices to a camera, such as a global positioning system (GPS) sensor and a compass (see Figure 1). This eliminates manual work and allows this method to be accurate and scalable. Second, queries can be processed effectively and efficiently in geo-space. In this work we specifically targeted the video search applications where the query is specified as either (i) a geographical region or (ii) some textual or visual input that can be interpreted as a region in geo-space. Pharos iGPS-500 Receiver OceanServer OS5000-US Compass JVC JY-HD10U camera Figure 1: Experimental hardware and software to acquire georeferenced video. In our initial prototype we focused on a proof-of-concept to illustrate how the viewable scene model can provide an effective way to search large video archives. We investigated the challenges of fusingmultiplesensordatastreams(e.g.,cameralocation,field-of-view,direction)andthehandling of camera mobility. One aspect that was not addressed in our prior work, and which we present in this study, is the question of georeferenced video result ranking. Analogous to a textual search with a web search engine, a geo-spatial video search will generally retrieve multiple video segments that may have different relevance to the query posed. Therefore, a very interesting and challenging question is how to rank the search results such that (a) the automatic ranking closely corresponds to what a human expert might expect and (b) the ranking algorithm performs efficiently even for very large video databases. In this manuscript we introduce three ranking algorithms that consider the spatial, temporal and spatio-temporal properties of georeferenced video clips. We further present a histogram-based approach that relies on a pre-processing step to dramatically improve the response time during subsequent searches. Before elaborating on our approach in detail, Section 2 contains a survey of the related work. In Section 3 we briefly review the main concepts related to georeferenced videos and our viewable scene model. We then introduce the proposed ranking algorithms in Section 4. This is followed by a presentation of results based on a real-world data set in Section 5. Some 2 further discussion is provided in Section 6. Finally, Section 6 concludes with several open issues and future research directions. 2 Related Work Associating GPS coordinates with digital media (images and videos) has become an active area of research [24]. In this section, we review the existing work related to search techniques in geo- referenced media retrieval and ranking. We will start our survey with methods that specifically consider still images and then move on to videos. We will also briefly describe some prior work in the area of indexing and storage. Lastly, we will mention a few commercial GPS-enabled cameras that produce georeferenced still images. Techniques for Images. There has been significant research on organizing and browsing personal photos according to location and time. Toyama et al. [31] introduced a meta-data pow- ered image search and built a database, also known as World Wide Media eXchange (WWMX), which indexes photographs using location coordinates (latitude/longitude) and time. A number of additional techniques in this direction have been proposed [22, 23]. There are also several com- mercial web sites [2, 3, 4] that allow the upload and navigation of georeferenced photos. All these techniques use only the camera geo-coordinates as the reference location in describing images. We instead rely on the field-of-view of the camera to describe the scene. More related to our work, Ephstein et al. [8] proposed to relate images with their view frustum (viewable scene) and used a scene-centric ranking to generate a hierarchical organization of images. Several additional meth- ods are proposed for organizing [26, 16] and browsing [9, 30] images based on camera location, direction and additional meta-data. Although these research work are similar to ours in using the camera field-of-view to describe the viewable scene, their main contribution is on image browsing andgroupingofsimilarimagestogether. Someapproaches[30,17]uselocationandothermetadata, as well as tags associated with images, and the images’ visual features to generate representative images within image clusters. Geo-location is often used as a filtering step. Some techniques [8, 26] solely use location and orientation of camera in retrieving the “typical views” of important objects. However then the contribution is on segmentation of image scenes and organizing photos based on the image scene similarity. Our work describes a more broad scenario that considers mobile cam- eras capturing geo-tagged videos and the associated view frustum, which is dynamically changing over time. Furthermore, our ranking technique do not target any specific application domain and therefore can easily be applied to any specific application. Techniques for Video. There exist only a few systems that associate videos with their corresponding geo-location. Hwang et al. [13] and Kim et al. [18] propose a mapping between the 3D world and the videos by linking the objects to the video frames in which they appear. However, their work neglects to provide any details on how to use camera location and direction to build links between video frames and world objects. Liu et al. [21] presented a sensor enhanced video annotation system (referred to as SEVA) which enables searching videos for the appearance of particular objects. SEVA serves as a good example to show how a sensor rich, controlled environment can support interesting applications, however it does not propose a broadly applicable approach to geo-spatially annotate videos for effective video search. In our prior work [5] we have extensively investigated these issues and proposed the use of videos’ geographical properties (such as camera location and camera heading) to enable effective search of large video collections. We introduced a viewable scene model to describe the video content and demonstrated how this model enhances the search performance. All techniques mentioned above present ideas about how to search georeferenced video collections but do not provide any solutions for analyzing the relevance 3 of search results. To our knowledge, the proposed techniques in this study are the first to address video ranking based on the viewable scene cues. We believe that our approach, when enhanced with an efficient spatio-temporal storage and indexing mechanism, will serve as a general purpose and flexible video search and ranking mechanism that is applicable to any types of video with associated location and direction tags. Consequently it can be the basis for a tremendous number of multimedia applications. Beyond georeferenced video ranking, the topic of content based video retrieval and ranking has been studied extensively. The TREC Video Retrieval Evaluation (TRECVID) [27] benchmarking activity has been promoting progress in content-based retrieval of digital video since 2001. Each year, various feature detection methods from dozens of research groups are tested on hundreds of hours of video [28]. Unlike the research activities within the TRECVID benchmark, our focus is solelyonhigh-leveldescriptionsofvideosusinggeoreferencedmeta-dataratherthanvisualfeatures. The two methods are orthogonal to each other and could be combined to create powerful solutions, depending on application needs. Geospatial Search and Ranking Methods. Although ranking videos based on geospatial properties has not been well studied, there have been several ranking techniques developed for Geographic Information Retrieval (GIR) systems. Most of these studies compute spatial similarity measures based on the overlap between query region and spatial description of documents using the associated meta-data. Some earlier work [6] studied the basic spatial and temporal relevance calculation methods. More recently, Larson et al. [19] provided a comprehensive summary of geospatial ranking techniques and Gobel and Klein [10] proposed a global ranking algorithm based onspatial,temporalandthematicparameters. Toquantifytherelevanceofavideo’sviewablescene to a given query we leveraged some of these fundamental, pioneering spatial ranking techniques. Although similar ranking schemes have been studied before, our work is novel in applying these techniques to rank video data based on viewable scene descriptions. Indexing and Storage. In our work we propose to use a histogram to accumulate the relevance scores for the camera viewable scenes. Data summarization using histograms is a well- studiedresearchprobleminthedatabasecommunity. Acomprehensivesurveyofhistogramcreation techniquescanbefoundin [14]. In[8]theauthorsuseagridofvotingcellstodiscovertheimportant partsofanimage. Theirtechniquesuseonlythespatialattributestodiscovertherelevantsegments of the image scene whereas our ranking methods incorporate both spatial and temporal attributes in calculating relevance. Commercial products. There exist several GPS-enabled digital cameras which can save the location information with the digital image file as a picture is taken (e.g., Sony GPS-CS1, Ricoh 500SE,JoboPhotoGPS,NikonD90withGP-1). Veryrecentmodelsadditionallyrecordthecurrent heading (e.g., Ricoh SE-3, Solmeta DP-GPS N2). All current cameras support geotagging for still images only. We believe that, as more still cameras (with video mode) and video camcorders will incorporate GPS and compass sensors, more location and direction tagged videos will be produced and there will be a strong need to perform efficient and effective search on those video data. 3 Searching Georeferenced Videos The basis for our ranking algorithms is a detailed description of the video content based on the geospatial properties of the region it covers, so that large video collections can be indexed and searched effectively. We refer to this space as the viewable space of the video scene. In this section, weprovideabriefsummaryofourpriorwork[5]forcompletenessofthediscussionanddescribehow toquantify,storeandquerytheviewablesceneofcapturedvideos. InSection4weintroduceseveral 4 P θ φ θ,φ : horizontal and vertical viewable angles P <longitude,latitude,altitude>: camera location R R : visible distance θ : viewable angle P <longitude,latitude>: camera location R : visible distance (a) (b) θ P R d d : camera direction vector d d : camera direction vector (in 3D) Figure 2: Illustration of FOVScene model (a) in 2D (b) in 3D. new methods to discover the most relevant videos based on the video scene’s spatial similarity to the user query. Wemodeltheviewablespaceofascenewithparameterssuchasthecameralocation,theangleof the view, and the camera direction. The camera’s viewable scene changes when the camera moves or rotates. This dynamic scene information has to be acquired from sensor-equipped cameras, stored within an appropriate catalog or schema and indexed for efficient querying and retrieval. Our proposed approach consists of four components: 1) modeling of the viewable scene, 2) data acquisition, 3) indexing and querying, and 4) ranking search results. We will now describe the first three components in turn. Ranking retrieved videos is our novel contribution and will be discussed extensively in Section-4. 3.1 Georeferenced Video Annotations through Viewable Scene Modeling We define the scene that a camera captures as the viewable scene of the camera. In the field of computer graphics, this area is referred to as camera field-of-view (FOV for short). We describe a camera’s viewable scene in 2D space by using the following four parameters: (1) The camera position P consisting of the hlatitude, longitudei coordinate read from a positioning device (e.g., GPS), (2) the camera direction vector − → d which is obtained based on the orientation angle provided by a digital compass, (3) the camera viewable angle θ which describes the angular extent of the scene imaged by the camera [11], and (4) the far visible distance R which is the maximum distance at which a large object within the camera’s field-of-view can be recognized. The camera viewable sceneisdenotedbyFOVScene(P, ~ d,θ,R)orF(P, ~ d,θ,R)inshort. Wewillusetheterm‘FOVScene’ and the symbol ‘F’ interchangeably throughout this manuscript. The full field-of-view is obtained with the maximum visual angle, which depends on the lens/image sensor combination used in the camera [12]. Smaller image sensors have a smaller field-of-view than larger image sensors (when used with the same lens). Alternatively, shorter focal-length lenses have a larger field-of-view than longer focal-length lenses (when used with the same image sensor). The viewable angle θ can be obtained via the following formula (Equation 1) [12]: θ = 2tan −1 y 2f (1) where y is the size of the image sensor and f is the focal length of the lens. The relationship between the visible distance R and the viewable angle θ is given in Equation 2 [12]. As the camera view is zoomed in or out, the θ and R values will be adjusted accordingly. θ =2arctan à y(Rcos( θ 2 )−f) 2fRcos( θ 2 ) ! (2) 5 Our approach utilizes georeferenced video content captured from a sensor-equipped camera (see Figure 1), which can accurately estimate its current location P, orientation ~ d and visual angle θ. In 2D space, the viewable scene of the camera at time t, (F(P, ~ d,θ,R,t)) forms a pie-slice-shaped area as illustrated in Figure 2a. Figure 2b shows an example camera F volume in 3D space. For a 3D representation of F, we would need the altitude of the camera location point and the pitch and roll values to describe the camera heading on the zx and zy planes (i.e., whether the camera is directedupwardsordownwards). Webelievethattheextensionto3Disstraightforward, especially since we already acquire the altitude level from the GPS device and the pitch and roll values from the compass. In this study we will represent the FOVScene in 2D space only. 3.2 Georeferenced Meta-data Acquisition Theviewablesceneofacamerachangesasitmovesorchangesitsorientationingeo-space. Inorder to keep track of what the camera sees over time, we need to record the FOVScene descriptions with a certain frequency and produce time stamped meta-data together with time stamped video streams. Our meta-data streams are analogous to sequences of hP, ~ d,θ,R,ti quintuples, where t is the time instant at which FOVScene information is recorded. In this section, we first describe how we record the location and direction meta-data and briefly explain the viewable angle and visible distance calculations using camera properties. We then discuss some timing and synchronization issues, andfinallyprovideabriefdiscussionontheaccuracyoflocationandheadingmeasurements. Recording georeferenced video streams. Our sensor rich video recording system in- corporates three devices: a video camera, a 3D digital compass, and a Global Positioning System (GPS) device. We assume that the optical properties of the camera are known. The digital com- pass, mounted on the camera, periodically reports the direction in which the camera is pointing. The camera location is read from the GPS device as a hlatitude, longitudei pair. Video can be captured with various camera models – we use a high-resolution (HD) camera. Our custom-written recording software receives direction and location updates from the GPS and compass devices as soon as new values are available and records the updates along with the current computer time and coordinated universal (UTC) time. Video data is received from the camera as data packet blocks. Each video data packet is processed in real time to extract frame timecodes and these extracted timecodes are recorded along with the local computer time when the frame was received. Creating aframeleveltimeindexforthevideostreamminimizesthesynchronizationerrorsthatmightoccur due to clock skew between the camera clock and the computer clock. In addition, such a temporal video index, whose timing is compatible with other datasets, enables easy and accurate integration with the GPS and compass data. Calculating viewable angle (θ) and visible distance (R). Assuming that the optical focal length f and the size of the camera image sensor y are known, the camera viewable angle θ can be calculated through Equation 1. The default focal length for the camera lens is obtained from the camera specifications. However, when there is a change in the camera zoom level, the focal length f and consequently the viewable angle θ will change. To capture the change in θ, the camera should be equipped with a special unit that will measure the focal length for different zoom levels. Such functionality is not commonly available in today’s off-the-shelf digital cameras and camcorders. To simulate the changes in the viewable angle, we have manually recorded the exact videotimecodesalongwiththechangeinthezoomlevel. UsingtheCameraCalibrationToolbox[1] we have measured the f value for five different zoom levels (from the minimum to the maximum zoom level). For all other zoom levels, the focal length f is estimated through interpolation. 6 The visible distance R can be obtained based on the equation, R = fh y (3) wheref is the lens focal length, y is the image sensor height andh is the height of the target object thatwillbefullycapturedwithinaframe. Withregardtothevisibilityofanobjectfromthecurrent cameraposition, thesizeoftheobjectalsoaffectsthemaximumcamera-to-objectviewingdistance. For large objects (e.g., mountains, high buildings) the visibility distance will be large whereas for small objects of interest the visibility distance will be small. For simplicity in our initial setup we assume R to be the maximum visible distance for a fairly large object. As an example, consider the buildings A and B shown in Figure 3a. Both buildings are approximately 8.5m-tall and both are located within the viewable angle of the camera. The distances from the buildings A and B to the camera location are measured as 100m and 300m, respectively. The frame snapshot for the FOVScene in Figure 3a is shown in Figure 3b. We assume that, with good lighting conditions and no obstructions, an object can be considered visible within a captured frame if it occupies at least 5% of the full image height. For our JVC JY-HD10U camera, the focal length is f = 5.2mm and the CCD image sensor height is y = 3.6mm. Using Equation 3 the height of building A is calculated as 12% of the video frame, therefore is considered visible. However, building B is not visible since it covers only 4% of the image height. Based on the above discussion, the threshold for the far visible distance R for our visible scene model is estimated at around 250m. We currently targetamid-rangefarvisibledistanceof200-300m. Webelievethatthisrangebestfitswithtypical applications that would most benefit from our georeferenced video search (e.g., traffic monitoring, surveillance). Close-up and far-distance will be considered as a part of our future research. (a) (b) Figure 3: Illustration of visibility. (a) Building A is 100m away from the camera and is within the camera FOVScene. Building B is 300m away from the camera and outside of the FOVScene. (b) Building A covers 12% of the frame height and therefore is assumed to be visible. Building B covers only 4% of the frame height, therefore is assumed not to be visible. Timing and Synchronization. The meta-data entries for compass updates and video frame timecodes have millisecond-granular timing. However GPS location updates are available only every second. In order to calculate the camera FOVScenes, all three meta-data streams need to be combinedandstoredasasinglestreamwithanassociatedcommontimeindex. Ideallyeachcamera will store theF coverage for each individual video frame. However, in large scale applications there may be thousands of moving cameras with different sensing capabilities. In a sensor-rich system with several attached devices, one challenge is the synchronization of the sensor data read from the 7 attached devices which have different data output rates. Our recording software creates separate data streams for each device, where each meta-data entry is timestamped with the time when the update was received from the device. Later these data streams are combined with a 2-pass algorithm. Such an algorithm processes data in a sliding time window centered at the current time. Itwillalwaysmatchthedataentriesthathavetheclosesttimestamps(pastorfuture). Inoursetup the meta-data output rate for GPS, compass and video are 1, 40, and 30 samples/sec respectively. Therefore,wematcheachGPSentrywiththetemporallyclosestvideoframetimecodeandcompass direction. For each meta-data entry, in addition to the local time we record the satellite time (in UTC) that is received along with the GPS location update. The use of the recorded satellite time can be twofold: (1) it enables synchronizing the current computer time with the satellite time (2) it may be used as the time base when executing temporal queries, i.e., by applying the temporal condition of the query to the satellite time. Timestamping the FOVScene entries with the satellite time ensures a global temporal consistency among all georeferenced video collections. Measurement Errors. The accuracy of the FOVScene calculation is somewhat dependent ontheprecisionofthelocationandheadingmeasurementsobtainedfromGPSandcompassdevices. A typical GPS device is accurate approximately within 10 meters. In our proposed viewable scene model, the area of the region that a typical HD camera captures (FOVScene) is on order of tens of thousands of square meters (e.g., at full zoom-out approx. 33,000m 2 ). Therefore, a difference of 10m is not very significant compared to the size of the viewable scene we consider. Additionally, missingGPSlocations–duetovariousreasonssuchasatunneltraversal–canberecoveredthrough estimation such as interpolation. There exists extensive prior work on estimating moving object trajectories in the presence of missing GPS locations. An error in the compass heading may be more significant. Many digital compasses ensure azimuth accuracy of better than 1 ◦ (e.g., about 0.6 ◦ for the OS5000 digital compass in our system), which will have a minor effect on the viewable scene calculation. However, when mounted on real platforms the accuracy of a digital compass might be affected by local magnetic fields or materials. For our experiments the compass was calibrated within the setup environment to minimize any distortion in compass heading. For some cameras the cost of add-on sensors may be an issue and manufacturers might use accelerometers plus camera movement to estimate the direction. In that case the accuracy will be lower and needs to be further investigated. It is also worth mentioning that multimedia applications often tolerate some minor errors. If a small object is at the edge of viewable scene and is not included in the search results, often it will not be recognized by a human observer as well. 3.3 Querying Georeferenced Videos Figure 4 sketches the proposed video search architecture. Users collect georeferenced videos using the acquisition software and upload them into the search system. The uploaded video meta-data is processed and the viewable scene descriptions are stored in a spatio-temporal database. An intuitive way is to store a separateF quintuple including the camera id, video id, frame timecode, camera location, visual angle and camera heading for each video frame. The user query is either a geographical region or some textual or visual input that can be interpreted as a region in geospace. The Query Interpreter in Figure 4 translates the user query into a spatio-temporal query. As an example, the query “Find the videos of the University of Idaho Kibbie Dome” is translated into the coordinates of the corner points of the rectangular region that approximates the location of the dome. The database is then searched for this spatial region and the video segments that capture the Kibbie Dome are retrieved. If the query specifies a temporal interval, only the videos that were recorded during the specified time window are returned. The 8 Figure 4: Georeferenced Video Search Architecture. FOVScene coverage of a moving camera over time is analogous to a moving region in the geo- spatial domain, therefore traditional spatio-temporal query types, such as range queries, k nearest neighbor (kNN) queries or spatial joins, can be applied to theF data. In our initial work, we limit our discussion to spatio-temporal range queries. The typical task we would like to accomplish is to extract the video segments that capture a given region of interest during a given time interval. As explained in Section 3.2, we can construct theF(t,P, ~ d,θ,R) description for every time instance. Hence, for a given region and time of interestQ, we can extract the sequence of video frames whose viewable scene overlap with the region of Q and whose timestamps overlap with the time interval of Q. Going from most specific to most general, the region of query Q can be a point, a line (e.g., a road), a poly-line (e.g., a trajectory between two points), a circular area (e.g., neighborhood of a point of interest), a rectangular area (e.g., the space delimited with roads) or a polygon area (e.g., the space delimited by certain buildings, roads and other structures). Details of range query processing can be found in our prior work [5]. The query processing mechanism presented in [5] does not differentiate between highly-relevant andirrelevantdataandpresentsallresultstoauserinrandomorder. Inmanyapplicationsthiswill notbeacceptable. Inthisstudyweimprovetheproposedquerymechanismandproposetechniques to rank the search results based on their relevance to the user query. The Video Ranking module in Figure 4 rates search results according to the spatio-temporal overlap properties, i.e., how much and how long the resulting video segments overlap with the query region. Although the objective of our study is to rank results based on the queries’ spatio-temporal attributes, for some applications video ranking accuracy can further be improved by leveraging the features extracted from the visual video content. In Figure 4, the Concept Detection module provides information about the semantic content of the video segments to aid the ranking process. A detailed discussion of content based video search and ranking techniques is out of the scope of our paper. A review of state-of-the-art solutions can be found in the litterature [20, 7, 29]. In Section 4.3 we provide some example queries for which the ranking accuracy can be improved by leveraging visual features in addition to spatio-temporal overlap properties. Note that content based feature extraction is not implemented in our prototype system. Another potential problem with the prior search mechanism built on top of a relational model is the computational overhead. When video search is performed on large video collections, effective storage and indexing of georeferenced video meta-data becomes extremely important. In a typical query, all frames that belong to the query time interval have to be checked for overlaps. Because 9 Term Description F the short notation for FOVScene P camera location point Q a query region V k a video clip k V F k a video clip k represented by a set of FOVScenes V F k (t i ) a polygon shape FOVScene at time t i , a set of corner points Q a polygon query region represented by a set of corner points O(V F k (t i ),Q) overlap region between V F k and Q at t i , a set of corner points R TA relevance score with TotalOverlapArea R D relevance score with OverlapDuration R SA relevance score with SummedAreaofOverlapRegions Grid M ×N cells covering the universe V G k (t i ) a FOVScene at time t i represented by a set of overlap grid cells between Grid and V F k (t i ) V G k a video clip k represented by a set of V G k (t i ) Q G a polygon query region represented by a set of grid cells O G (V G k (t i ),Q) overlap region between V G k and Q at t i , a set of grid cells R G TA relevance score using grid, extend of R TA R G D relevance score using grid, extend of R D R G SA relevance score using grid, extend of R SA Table 1: Summary of terms of the computational complexity of the operations involved, the cost of processing spatial queries and obtaining relevance scores may be significant. Computational efficiency can be improved by adopting an index structure to store and query FOVScene descriptions. Inthisstudywerestrictourexamplequeriestosimplespatio-temporalrangesearches. However, using the camera view direction ( ~ d) in addition to the camera location (P) to describe the camera viewable scene provides a rich information base for answering more complex geospatial queries. For example, if the query asks for the views of an area from a particular angle, more meaningful scene results can be returned to the user. Alternatively, the query result set can be presented to the user as distinct groups of resulting video sections such that videos in each group will capture the query region from a different view point. Some further aspects of a complete system to query georeferenced videos – such as indexing and query optimization – will be explored as part of our future work. 4 Ranking Georeferenced Video Search Results In video search, when results are returned to a user, it is critical to present the most related videos first since manual verification (viewing videos) can be very time-consuming. This can be accomplished by creating an order which will rank the videos from the most relevant to the least relevant. Otherwise, although a video clip completely captures the query region, it may be listed last within query results. It is essential to question the relevance of each video with respect to the user query and to provide an ordering based on estimated relevance. Two relevant dimensions to calculate video relevance with respect to a spatio-temporal query are its spatial and temporal overlap. Analyzing how the FOVScene descriptions of a video overlap with a query region gives clues on calculating its relevance with respect to the given query. A natural and intuitive metric to 10 measure spatial relevance is the extent of region overlap. The greater the overlap between F and the query region, the higher the video relevance. It is also useful to differentiate between the videos which overlap with the query region for intervals of different length. A video which captures the query region for a longer period will probably include more details about the region of interest and therefore can be more interesting to the user. Note that during the overlap period the amount of overlap at each time instant changes dynamically for each video. Among two videos whose total overlap amounts are comparable, one may cover a small portion of the query region for a long time and the rest of the overlap area only for a short time, whereas another video may cover a large portion of the query region for a longer time period. Figures 5a and 5b illustrate the overlap between the query Q 207 and the videos V 46 and V 108 , respectively. Although the actual overlapped area of the query is similar for both videos, the coverage of V 108 is much denser. Consequently, among the two videos V 108 ’s relevance is higher. (a) (b) Figure 5: Visualization of the overlap regions between query Q 207 and videos (a) V 46 and (b) V 108 Inthefollowingsections,wewillexplainhowwedefinetheoverlapbetweenthevideoFOVScenes and query regions and propose three basic metrics for ranking video search results. We provide a summary of the symbolic notations used in our discussion in Table 1. 4.1 Preliminaries Let Q be a polygon shaped query region given by an ordered list of its polygon corners: Q={(lon j ,lat j ), 1≤j ≤m} where (lon j ,lat j ) is the longitude and latitude coordinate of the j th corner point of Q and m is the number of corners in Q. Suppose that a video clip V k consists of nF regions and t s and t e are the start time and end time for video V k , respectively. The sampling time of the ith F is denoted as t i 1 . The starting time of a video t s is defined as t 1 . The i th F represents the video segment between t i and t i+1 and the n th F, which is the last FOVScene, represents the segment between t n and t e (for convenience, say t e = t n+1 ). The set of FOVScene descriptions for V k is given by V F k = n F V k (t i ,P, ~ d,θ,R)| 1≤i≤n o . Similarly, theF at time t i is denoted as V F k (t i ). If Q is viewable by V k , then the set ofF that capture Q is given by SceneOverlap(V F k ,Q)= © V F k (t i ) | for all i (1≤i≤n) where V F k (t i ) overlaps with Q ª 1 FOVScenes can be collected at any time and timestamped. However, in our experiments, we collected FOVScenes with a fixed interval, at one second. 11 Figure 6: The overlap between a video FOVScene and a polygon query TheoverlapbetweenV F k andQattimet i , formsapolygonshapedregion, asshowninFigure6. Let O(V F k (t i ),Q) denote the overlapping region between video V F k and query Q at time t i . We define it as an ordered list of corner points that form the overlap polygon. Therefore, O(V F k (t i ),Q) = OverlapBoundary(V F k (t i ),Q) = n (lon t i j ,lat t i j ), 1≤j ≤m o (4) where m is the number of corner points in O(V F k (t i ),Q). The function OverlapBoundary returns the overlap polygon which encloses the overlap region. In Figure 6, these corner points are shown with labels P1 through P9. Practically, when a pie-shapedF and polygon shaped Q intersect, the formed overlap region does not always form a polygon. If the arc ofF resides inside Q, part of the overlapregionwillbeenclosedbyanarcratherthanaline. Handlingsuchirregularshapesisusually impractical. ThereforeweestimatethepartofthearcthatresideswithinthequeryregionQwitha piece-wise linear approximation consisting of a series of points on the arc such that each point is 5 o apart from the previous and next point with respect to the camera location. The implementation of the function OverlapBoundary is given in Algorithm 1. Note that OverlapBoundary computes the corner points that enclose the overlap polygon where: • a corner of the query polygon Q is enclosed withinF (lines 7 through 11), • a corner point of F (i.e, camera location point or starting or ending points of the arc) is enclosed within Q (lines 12 through 16), • a side of the query polygon Q crosses the sides of theF (lines 17 through 22), and • part of theF arc is enclosed within Q (the intersecting section of the arc is estimated with a series of points) (lines 23 through 37). A summary of the subroutines that Algorithm 1 calls are given in Table-2. The algorithm for function pointFOVIntersect can be found in [5]. The computation of other subroutines are trivial and omitted due to space limitations. 4.2 Three Metrics to Describe the Relevance of a Video We propose three fundamental metrics to describe the relevance (R) of a video V k with respect to a user query Q as follows: 1. Total Overlap Area (R TA ). The area of the region formed by the intersection of Q and V F k . This quantifies what portion of Q is covered by V F k , emphasizing spatial relevance. 12 Algorithm 1 OverlapPoly =OverlapBoundary(V F k (t i ),Q) 1: Q ← Given convex polygon shaped query region 2: Q =<N Q ,E Q > (N Q ←Set of vertices in Q; E Q ←Set of edges in Q) 3: V F k (t i ) ← FOVScene region for frame t i 4: N F ← Set of corner points in V F k (t i ), including camera location point and starting and ending points of the arc in V F k (t i ) 5: E F ← two edges of the pie shaped V F k (t i ) 6: P,d,R,Θ← camera location point, direction vector, visible distance, and visual angle parameters of V F k (t i ) {Check if any of the points in N Q are within V F k (t i ), if so add them to Opoly} 7: for j← 1 to |N Q | do 8: if pointFOVIntersect(N Q (j),V F k (t i )) is true then 9: Opoly S =N Q (j) 10: end if 11: end for {Check if any of the points in N F are within Q, if so add them to Opoly} 12: for j ← 1 to |N F | do 13: if pointPolygonIntersect(N F (j),Q) is true then 14: Opoly S =N F (j) 15: end if 16: end for {Check if any of the edges in E Q intersect with the edges in E F . If so, add the intersection points to Opoly} 17: for j ← 1 to |E Q | and k← 1 to |E F | do 18: X=LineIntersect(E Q (j),E F (k)) 19: if X 6=NULL then 20: Opoly S =X 21: end if 22: end for {Check if any of the edges in E Q intersect with the arc of V F k (t i )). If so, estimate the intersecting section of arc as a poly-line and add the points in poly-line to Opoly} 23: for j ← 1 to |E Q | do 24: < p 1 ,p 2 > = lineCircleIntersect(E Q (j),P,R) {return the two points that line e intersects with the circle centered at P with radius R. } 25: if is not NULL then 26: if IsWithinAngle(p 1 ,P,θ, ~ d) and IsWithinAngle(p 2 ,P,θ, ~ d) then 27: A =EstimateArc(p 1 ,p 2 ,P) 28: else if IsWithinAngle(p 1 ,P,θ, ~ d) then 29: A = EstimateArc(p 1 ,n F ,P) {One of the endpoints of arc should be in Q. Let n F be that point} 30: else if IsWithinAngle(p 2 ,P,θ, ~ d) then 31: A = EstimateArc(p 2 ,n F ,P) 32: end if 33: for k ← 1 to |A| do 34: Opoly S =A(k) 35: end for 36: end if 37: end for 38: return OverlapPoly = convhull(Opoly) pointFOVIntersect(p,V F k (t i )) Returns true if point p overlaps with V F k (t i ) pointPolygonIntersect(p 1 ,Q) Returns true if point p 1 overlaps with polygon Q LineIntersect(ℓ 1 ,ℓ 2 ) If line segments ℓ 1 and ℓ 2 intersect then returns the intersection point, else returns NULL lineCircleIntersect(ℓ,P,R) If line segment ℓ intersects with the circle centered at P with radius R then returns the intersection point(s), else returns NULL IsWithinAngle(p 1 ,P,θ, ~ d) Returns true if point p 1 resides within the circle sector with center P, direction ~ d, and angle θ EstimateArc(p 2 ,n F ,P) Estimates the arc centered at P delimited by points p 1 and p 2 with a polyline. Returns the sequence of points that form the polyline Table 2: Subroutines called by function OverlapBoundary in Algorithm 1 2. Overlap Duration (R D ). The time duration of overlap between Q and V F k in seconds. This quantifies how long V F k overlaps with Q, emphasizing temporal relevance. 3. Summed Area of Overlap Regions (R SA ). The summation of the overlap areas for the inter- secting FOVScenes during the overlap interval. This strikes a balance between the spatial and temporal relevance. 13 4.2.1 Total Overlap Area (R TA ) ThetotaloverlapareaofO(V F k ,Q)isgivenbythesmallestconvexpolygonwhichcoversalloverlap regions formed between V F k and Q. This boundary polygon can be obtained by constructing the convex envelope enclosing all corner points of the overlap regions. Equation 5 formulates the com- putation of total overlap coverage. Function ConvexHull provides a tight and fast approximation for the total overlap coverage. It approximates the boundary polygon by constructing the convex hull of the polygon corner points, where each point is represented as a hlongitude, latitudei pair. Figure 5 shows examples of the total overlap coverage between the query Q 207 and videos V 46 and V 108 . The total overlap area is calculated as follows. O ¡ V F k ,Q ¢ =ConvexHull à n [ i=1 © O ¡ V F k (t i ),Q ¢ª ! =ConvexHull n [ i=1 |O(V F k (ti),Q)| [ j=1 ©¡ lon ti j ,lat ti j ¢ª (5) Subsequently, the Relevance using Total Overlap Area (R TA ) is given by the area of the overlap boundary polygon O(V F k ,Q), computed as (Equation 6). R TA (V F k ,Q) =Area ¡ O(V F k ,Q) ¢ (6) wherefunctionAreareturnstheareaoftheoverlappolygonO(V F k ,Q). AhigherR TA valueimplies that avideocaptures alarger portion ofthe query regionQ and therefore its relevancewith respect to Q can be higher. 4.2.2 Overlap Duration (R D ) The Relevance using Overlap Duration (R D ) is given by the total time in seconds thatV F k overlaps with query Q. Equation 7 formulates the computation of R D . R D (V F k ,Q) = n X i=1 (t i+1 −t i ) for i when O ¡ V F k (t i ),Q ¢ 6=∅ (7) R D is obtained by summing the overlap time for eachF inV F k withQ. We estimate the overlap time for each F as the difference between timestamps of two sequential Fs. When the duration of overlap is long, the video will capture more of the query region and therefore its relevance will be higher. For example, a camera may not move for a while, hence the spatial query overlap will not change but the video will most likely be very relevant. 4.2.3 Summed Area of Overlap Regions (R SA ) Total Overlap Area and Overlap Duration capture the spatial and temporal extent of the overlap respectively. However both relevance metrics express only the properties of overall overlap and do not describe how individual FOVScenes overlap with the query region. For example, in Fig- ure 5, for videos V 46 and V 108 , although R TA (V F 46 ,Q 207 ) ∼ = R TA (V F 108 ,Q 207 ) and R D (V F 46 ,Q 207 ) ∼ = R D (V F 108 ,Q 207 ), V F 108 overlaps with around 80% of the query region Q 207 during the whole overlap interval, whereas V F 46 overlaps with only 25% of Q 207 for most of its overlap interval and overlaps with80%ofQ 207 onlyforthelastfewFOVScenes. Inordertodifferentiatebetweensuchvideos, we propose the Relevance using Summed Overlap Area (R SA ) as the summation of areas of all overlap 14 regions during the overlap interval. Equation 8 formalizes the computation of R SA for video V F k and query Q. R SA ¡ V F k ,Q ¢ = n X i=1 ¡ Area ¡ O ¡ V F k (t i ),Q ¢¢ ∗(t i+1 −t i ) ¢ (8) Here, function Area returns the area of the overlap polygon O(V F k (t i ),Q). The summed overlap area for a single F is obtained by multiplying its overlap area with its overlap time. Recall that the overlap time for eachF is estimated as the difference between timestamps of two sequentialFs. The summation of all summed overlap areas for the overlappingFs provides the R SA score for the video V F k . 4.3 Ranking Videos Based on Relevance Scores Algorithm 2 [R TA ,R SA ,R D ]=CalculateRankScores(k,Q) 1: Q← Given convex polygon shaped query region 2: k← Video id {Load FOVScene descriptions from disk} 3: V F k = Load(V k ) 4: n= ¯ ¯ V F k ¯ ¯ 5: M = S n i=1 MBR(V F k (i)) {M is the MBR that encapsulates the whole video file} 6: if RectIntersect(M,Q) is true then {Filter step 1} 7: for i← 0 to (n-1) do 8: M 1 =MBR(V F k (i)) 9: if RectIntersect(M 1 ,Q) is true then {Filter step 2} 10: if SceneIntersect(Q,V F k (i)) is true then {Filter step 3} 11: Opoly = OverlapBoundary(V F k (i),Q) 12: R TA poly S = Opoly 13: R SA + = Area(Opoly)*(t i+1 −t i ) 14: R D + = t i+1 −t i 15: end if 16: end if 17: end for 18: end if 19: R TA =Area(convexhull(R TA poly)) Algorithm 2 outlines the computation of the proposed relevance metrics R TA , R SA , and R D for a given video V k and query Q. Note that the relevance score for V k is computed only when V F k overlaps with Q. In Algorithm 2, we apply a tri-level filtering step (lines 6, 9 and 10) to eliminate the irrelevant videos and video segments. First, we check whether query Q overlaps with the MBR enclosing all V F k . If so, we look for the F regions whose MBR overlap with Q. And finally, we further refine the overlapping F regions by checking the overlap between query Q and actual V F k . Such a filtering process improves computational efficiency by gradually eliminating the majority of the irrelevant video sections (see Section 5.3). Algorithm 2 calls the subroutine MBR which computes the minimum bounding rectangle for a given F region. Functions RectIntersect(R,Q) and SceneIntersect(V F k (t i )) return true if the given query Q overlaps with the rectangle R or the FOVScene V F k (t i ), respectively. A detailed outline for SceneIntersect can be found in [5]. These proposed metrics describe the most basic relevance criteria that a typical user will be interested in. R TA defines the relevance based on the area of the covered region inqueryQ whereas R D define relevance based on the length of the video section that captures Q. R SA includes both areaanddurationoftheoverlapintherelevancecalculation,i.e.,thelargertheoverlapis,thebigger theR SA score will be. Similarly, the longer the overlap duration, the more overlap polygons will be included in the summation. Since each metric bases its relevance definition on a different criteria, we may not expect to obtain a unique ranking for all three metrics. Furthermore, without feedback 15 from users it is difficult to ascertain whether one of them is superior to the others. However, we can claim that a certain metric provides the best ranking when the query is specific in describing the properties of videos that the user is looking for. As an example, in video surveillance systems, the videos that give the maximum coverage extent within the query region will be more relevant. Then, metric R TA will provide the most accurate ranking. In real estate applications, users often would like to see as much detail as possible about the property and therefore, both extent and time of coverage are important. In such applications metric R SA will provide a good ranking. And in traffic monitoring systems, where the cameras are mostly stationary, the duration of the video that captures an accident event will be more significant in calculating relevance. Therefore, metric R D will produce the best ranking. Basedonthequeryspecificationeitherasinglemetricoracombinationofthethreecanbeused toobtainthevideoranking. Calculatingtheweightedsumofseveralrelevancemetrics(Equation9) is a common technique to obtain an ensemble ranking scheme. Relevance ¡ V F k ,Q ¢ =w 1 R TA ¡ V F k ,Q ¢ +w 2 R D ¡ V F k ,Q ¢ +w 3 R SA ¡ V F k ,Q ¢ (9) To obtain the optimal values for weights w 1 , w 2 and w 3 we need a training data set which provides an optimized ranking based on several metrics. However constructing a reliable training data for georeferenced videos is not trivial and requires careful and tedious manual work. There is extensive research on content based classification and ranking of videos using Support Vector Machines (SVM) and other classifiers, which train their classifiers using publicly available evaluation data (for example the TRECVID benchmark dataset). There is a need for a similar effort to create public training data for georeferenced videos. In Section 5 we will present results obtained through applying individual metrics to calculate the relevance score of a video. As described in Section 3.3, the visual content of the videos can be leveraged into the ranking process to improve the ranking accuracy. For example, for the Kibbie Dome query, the video segments in the search results might be analyzed to check whether the view of the camera is occluded with some objects such as trees, cars, etc. We can adopt some state-of-the-art concept detectors[32]toidentifysuchobjectswithinthevideocontent. Thevideoframeswherethecamera viewisoccludedcanbeweightedlessincalculatingthespatialandtemporaloverlapforthemetrics R TA ,R SA andR D . Inadditiontocontentbasedfeatures,textlabelsextractedfromvideofilenames, surroundingtextandsocialtagscanbeusefulinvideoranking. Weplantoelaborateoncustomized multi-modal ranking schemes for georeferenced video data as part of our future research work. 4.4 A Histogram Approach for Calculating Relevance Scores The ranking methods we introduced in Section 4.2 calculate the precise overlap regions for ev- ery overlapping FOVScene to obtain the video relevance scores. Since the precise overlap region computation is computationally expensive these techniques are often not practical for large scale applications. In this paper, we introduce several histogram based ranking techniques that provide comparable rank results, but at the same time dramatically improve the query response time. A histogram pre-computes and stores the amount of overlap between a video’s FOVScenes and the grid cells. During query execution only the histogram data is accessed and queried. Histogram based ranking techniques not only enable faster query computation, but also provide additionalinformationabouthowdenselythevideooverlapswiththequery. Forexample,although the exact shape of the overlap region is calculated for each individualF in the previous section, the computed relevance scores do not give the distribution of the overlap throughout the query region, i.e., which parts of the query region are more frequently captured in the video and which parts are captured only in a few frames. The distribution of the density of overlap can be meaningful 16 in gauging a video’s relevance with respect to a query and in answering user customized queries, therefore should be stored. In this section we first describe how we build the overlap histograms (OH) for videos, then we present our histogram based ranking algorithms that rank video search results using the histogram data. The histogram based relevance scores are analogous to precise relevance scores except that precise ranking techniques calculate the overlap amounts for every user query whereas histogram based techniques use the pre-computed overlap information to get the rankings. We first partition the whole geospace into disjoint grid cells such that their union covers the entireservicespace. LetGrid={c i,j : 1≤i≤M and 1≤j ≤N}bethesetofcellsfortheM×N grid covering the space. Given theF descriptions V F k of videoV k , the set of grid cells that intersect with a particular V F k (t i ) can be identified as, V G k (t i )=GridFOVOverlap ¡ V F k (t i ) ¢ = © c m,n : c m,n overlaps with V F k (t i ) and c m,n ∈Grid ª (10) V G k (t i ) is the set of overlapping grid cells with V F k (t i ) at time t i , i.e., a grid representation of anF. Then, V G k is a grid representation of V F k which is a collection of V G k (t i ), 1≤i≤n. The histogram for V G k , denoted as OH k , consists of grid cells C k = S n i=1 V G k (t i ). Function GridFOVOverlap given in Algorithm 3 determines these overlapping cells. In Algorithm 3, we first search for the cells that overlap with the borderline of V F k (t i ) (lines 4 through 9) and then include all other cells enclosed between the border cells (line 10) (see Figure 7). Algorithm 3 calls the subroutines DivideLineSeg andDivideArc whichdividethegivenlinesegmentorarcintosub-segmentsoflength half of the grid cell size, and returns the set of points L i that define those segments. The function GridCellsbyPoint retrieves all the grid cells that overlap with L i . We adopted the “Line-Edge Intersection” algorithm [25] to retrieve the cells that overlap with a given line or arc segment. These cells cover only the borderline of V G k (t i ). We add the cells enclosed between them by calling the subroutine AddEnclosedCells. Details are omitted due to space limitations. Algorithm 3 V G k (t i )=GridFOVOverlap(V F k (t i )) 1: V F k (t i )← Given FOVScene description 2: Pa 1 ,Pa 2 ← Start and end points of arc in V F k (t i ) 3: h← The grid cell size {Retrieve the cells that overlap with the borderline of V F k (t i )} 4: L 1 =DivideLineSeg(h,P,Pa 1 ) {L 1 is the set of points on the line segment ℓ 1 =PPa 1 ; each h/2 apart from each other.} 5: L 2 =DivideLineSeg(h,P,Pa 2 ) {L 2 is the set of points on the line segment ℓ 2 =PPa 2 ; each h/2 apart from each other.} 6: L 3 =DivideArc(h,Pa 1 ,Pa 2 ) {A is the set of points on the arc ℓ 1 = \ Pa 1 Pa 2 ; each h/2 apart from each other.} 7: V G k (t i ) S = GridCellsbyPoint(L 1 ) 8: V G k (t i ) S = GridCellsbyPoint(L 2 ) 9: V G k (t i ) S = GridCellsbyPoint(L 3 ) {Include all Grid cells enclosed between the border cells} 10: V G k (t i ) S = AddEnclosedCells(V G k (t i )) For each cell c j in C k , OverlapHist counts the number of F samples that c j overlaps with. In other words it calculates the appearance frequency (f j ) of c j in V G k (Equation 11). f j =OverlapHist ¡ c j ,V G k ¢ =Count ¡ c j , © V G k (t i ) : for all i, 1≤i≤n ª¢ (11) Function Count calculates the number of V G k (t i ) that cell c j appears in. Note that OverlapHist describes only the spatial overlap between the grid and the video FOVScenes. However, in order to calculate the time based relevance scores we also need to create the histogram that summarizes the overlap durations. OverlapHistTime constructs a set of time intervals when c j overlaps with V G k . A set I j holds overlap intervals with cell c j and V G k as pairs of <starting time, overlap duration>. 17 Figure 7: Grid representation of overlap polygon Then, the histogram for V F k , i.e., OH k , consists of grid cells each attached with a appearance frequency value and a set of overlapping intervals. Example 1: The histogram of video clip V k is constructed as follows: OH k ={<c 1 ,f 1 ,I 1 >,<c 2 ,f 2 ,I 2 >,<c 3 ,f 3 ,I 3 >} = {< (2,3),3,{< 2,8>,<10,7>,<20,5>}>,< (3,3),1,{< 10,7>}>,<(4,3),1,{< 10,7>}>}. This histogram consists of three grid cells c 1 , c 2 , and c 3 appearing 3, 1, and 1 times in V G k , respectively. c 1 appears in three video segments. One starts at 2 and lasts for 8 seconds. Another at 10 for 7 seconds. The other starts at 20 and lasts for 5 seconds. c 2 appears once starting at 10 and lasts for 7 seconds. c 3 appears once starting at 10 and lasts for 7 seconds. Ourhistogram-based implementationquantizes the number ofoverlapsbetween theFsand grid cells, therefore the histogram bins can only have integer values. Alternatively, for the histogram cellsthatpartiallyoverlapwiththeFs, wemightusefloatingpointvaluesthatquantifytheamount of overlap. Allowing floating point histogram bins will improve the precision of the R G SA metric by assigning lower relevance scores to the videos that partially overlap with the query region than those that fully overlap with the query. However, storage and indexing of floating point numbers can be problematic when the size of the histogram is fairly large. Therefore, the tradeoff between precision and performance should be explored through careful analysis. Also note that the gain in precision by allowing floating point histogram bins is highly dependent on the size of the histogram cells. The effect of the cell size should also be investigated in the analysis. Due to space limitations we could not include further discussions about the possible extensions of the histogram approach. Figure 8 demonstrates two example histograms, where different frequency values within the histograms are visualized with varying color intensities. Note that Algorithm 3 enables us to differentiate between the cells that are fully contained and the cells that are partially contained within theF region. Such a distinction is useful for more accurate estimation of the overlap region; however, is out of the scope of this paper and will be elaborated as part of our future work. 4.4.1 Execution of Geospatial Range Queries Using Histogram Given a polygon shaped query region Q, we first represent Q as a group of grid cells in geospace: Q G ={ all grid cells that overlap with Q} (12) 18 We refine the definition of overlap region as a set of overlapping grid cells (O G ) between V G k and Q G . Using the histogram of V G k (i.e., OH k ), the overlapping grid cell set can be defined as: O G (V G k ,Q G )= n (C k of OH k ) \ Q G o (13) Note that the grid cells in O G inherit corresponding frequencies and intervals from OH k . Let Q G be a query region that consists of four grid cells, Q G ={< 2,2>,<2,3>,< 3,2>,< 3,3>}. Then, the overlapping cells with the video in Example 1 become: O G (V G k ,Q G )={< (2,3),3,{< 2,8>,<10,7>,< 20,5>}>,<(3,3),1,{< 10,7>}>}. 4.4.2 Histogram Based Relevance Scores Using the grid-based overlap region O G , we redefine the three proposed relevance metrics in Sec- tion 4.2. Total Overlap Cells (R G TA ): R G TA is the extent of the overlap region on Q G , i.e., how many cells in Q G are overlapping with V G k . Thus, R G TA is simply the cardinality of the overlapping set O G (V G k ,Q G ). In Example 1, R G TA =2. Overlap Duration (R G D ): The duration of overlap between a query Q G and V G k can be easily calculated using the interval sets in OH k : OverlapHistTime. R G D ¡ V G k ,Q G ¢ =CombineIntervals(OH k ) (14) Function CombineIntervals combines the intervals in the histogram. Note that there may be time gaps when the intervals for some of the cells are disjoint. There are also overlapping time durations across cells. In Example 1, R G D = 20 seconds. Summed Number of Overlapping Cells (R G SA ): R G SA is the total time of cell overlap occur- rences between V G k and Q G and therefore is a measure of how many cells in Q G are covered by video V G k and how long each overlap cell is covered. Since the histogram of a video already holds the appearance frequencies (f) of all overlapping cells, R G SA becomes; R G SA ¡ V G k ,Q G ¢ = |O G (V G k ,Q G )| X i=1 fi X j=1 SumIntervals(c i ,I j ) (15) where SumIntervals adds all intervals of an overlapping grid cell c i . In Example 1, R G SA =27. As we mentioned in the previous sections, a histogram gives the overlap distribution within the query region with discrete numbers. Knowing the overlap distribution is helpful for interactive videosearchapplicationswhereausermightfurtherrefinethesearchcriteriaandnarrowthesearch results. 5 Experimental Evaluation 5.1 Data Collection and Methodology 5.1.1 Data Collection To collect georeferenced video data, we have constructed a prototype system which includes a camera, a 3D compass and a GPS receiver. We used the JVC JY-HD10U camera with a frame 19 size of approximately one megapixel (1280x720 pixels at a data rate of 30 frames per second). It produces MPEG-2 HD video streams at a rate of slightly more than 20 Mb/s and video output is available in real time from the built-in FireWire(IEEE 1394)port. To obtain the orientation of the camera, we employed the OS5000-US Solid State Tilt Compensated 3 Axis Digital Compass, which providesprecisetiltcompensatedheadingswithrollandpitchdata. Toacquirethecameralocation, the Pharos iGPS-500 GPS receiver has been used. A program was developed to acquire, process, and record the georeferences along with the MPEG2 HD video streams. The system can process MPEG2 video in real-time (without decoding the stream) and each video frame can be associated with its viewable scene information. In all of our experiments, an FOVScene was constructed every second, i.e., one F per 30 frames of video. More details on acquisition and synchronization issues havebeenprovidedinSection3.2. Althoughoursensorrichvideorecordingsystemhasbeentested mainly with a camera that produces MPEG-2 video output, with little effort it can be configured to support any digital camera producing compressed or uncompressed video streams. Figure 1 shows the setup for our recording prototype. We have mounted the recording sys- tem setup on a pickup truck and captured video outdoors in Moscow, Idaho, traveling at different speeds (max. 25 MPH). During video capture, we frequently changed the camera view direction. The captured video covered a 6 kilometer by 5 kilometer region quite uniformly. However, for a few popular locations we shot several videos, each viewing the same location from different directions. The total captured data includes 134 video clips, ranging from 60 to 240 seconds in duration. Figure 5 shows the viewable scene visualizations for two of these video clips on a map. For visual clarity, viewable scene regions are drawn every three seconds. Due to space limitations we cannot include more example visualizations. Further samples can be found at http://eiger.ddns.comp.nus.edu.sg/geospatialvideo/ex.html. 5.1.2 Methodology The collected 134 georeferenced video files cover a 6 kilometer by 5 kilometer region. We generated 250 random spatial range queries with a fixed query range of 300 meter by 300 meter within this region. Although our system also supports temporal queries we only used spatial queries in the experiments. The videos in our dataset were captured using a single camera on different days. Therefore, temporal queries return only a few results, unless the query time interval is very large and no time-overlap exists. We found such queries to be not very illustrative in demonstrating the benefits of the proposed ranking algorithms. For each spatial range query, we searched the georeferenced meta-data to find the videos that overlap with that query (lines 6, 9 and 10 in Algorithm 2). We then calculated the relevance scores based on the three metrics proposed in Section 4.2. The rank lists RL TA , RL SA and RL D are constructed from the relevance metrics R TA , R SA and R D , respectively. A rank list is a sorted list of video clips in descending order of their relevance scores. Inordertoevaluatetheaccuracyofrankingsfromourproposedschemes,oneneedsthe“ground truth”rankorderforcomparison. Unfortunately,thereexistsnowell-definedpubliclyavailablegeo- referencedvideodataset(similartotheTRECVIDbenchmarkevaluationdataforstillimages[27]) thatcanbereferencedforcomparison. Thereforewefirstanalyzedandcomparedtherankingsfrom the proposed schemes with each other. Next, we independently conducted experiments to rank the results by human judges. Finally, by comparing the results from the user study with those from the proposed schemes, we evaluated the accuracy of our ranking schemes. Weconductedtwosetsofexperimentstoevaluatetherankingaccuracyoftheproposedmethods: 1. Experiment 1: We compared the rankings RL TA , RL SA and RL D with each other across the whole set of 250 queries. 20 2. Experiment2: Amongthe250randomqueriesweselected25easilyrecognizablequeryregions andaskedhumanjudgestorateeachvideofileusingafourpointscalerangingfrom“3-highly relevant” down to “0-irrelevant.” We compared our results to user provided feedback labels over these 25 random queries. 5.1.3 Evaluation Metrics Since each ranking scheme interprets the relevance in a different way, they are not expected to produce an identical ranking order across all schemes. However, we conjecture that they all should contain similar sets of video clips within the topN of their rank lists (for someN). A similar result from all three ranking algorithms would indicate that the resulting videos are most interesting to the user. To compare the accuracy of the results, we adopted the Precision at N (P(N)) metric, whichisapopularmethodthatdescribesthefractionofrelevantvideosrankedinthetopN results. We redefine P(N) as the fraction of common videos ranked within the top N results of more than one rank list. Note that the exact rank of videos within the top N is irrelevant in this metric. P(N) only shows the precision of a single query. Therefore, to measure the average precision over multiple queries, we use the Mean Average Precision (MAP), which is the mean of several P(N) from multiple queries. We evaluated the results of Experiment 1 with MAP scores. For Experiment 2, which includes human judgement in addition to MAP scores, a second evaluation metric termed Discounted Cumulated Gain (DCG) was used [15]. DCG systematically combines the video rank order and degree of relevance. The discounted cumulative gain vector ~ DCG is defined as DCG[i]= ½ G[1] if i=1 DCG[i−1]+G[i]/log e i otherwise where ~ G is the gain vector which contains the gain values for the ranked videos in order. The gain values correspond to the user assigned relevance labels ranging from 0 to 3. Note that, because of the decaying denominator log e i, a video with high relevance label listed at a top rank will dramatically increase the DCG sum. But a video with high relevance label listed lower in the rank list will not contribute much to the sum. The idea is to favor the top positioned videos as they should be the most relevant for a user. An optimal ordering where all highly relevant videos are ranked at the top and less relevant clips are listed lower in the rank list will produce the ideal DCG vector. The Normalized-DCG (NDCG) is the final DCG sum normalized by the DCG of the ideal ordering. The higher the NDCG of a given ranking the more accurate it is. 5.2 Comparison of Ranking Accuracy 5.2.1 Comparison of Proposed Ranking Schemes We compare the ranking accuracy of RL TA , RL SA and RL D using MAP scores. In Table 3, the first row calculates the MAP values as the average ratio of the videos that are common to all three rank lists within the top 1, 2, 5, 10 and 20 ranked results for all 250 queries, respectively. The second, third and fourth rows display the MAP scores pair-wise for two methods each: RL TA and RL SA , RL TA and RL D , and RL D and RL SA . The results show that the precision increases as N grows and achieves a close to perfect score beyond N = 10. Note that the precision is very high even at N = 5. This implies that all three proposed schemes similarly identify the most relevant videos. Table 3 displays RL TA ,RL SA ,RL D for a specific query Q 207 . The rank differences in RL TA , RL SA and RL D are mainly due to their different interpretations of relevance. To further understand the differences and similarities between the rankings, consult 21 MAP at MAP at MAP at MAP at MAP at MAP at N=1 N=2 N=5 N=10 N=15 N=20 Compare All topN(RL TA ) T topN(RL D ) T topN(RL SA ) N 0.60 0.789 0.918 0.993 0.999 1.0 Compare RL TA and RL SA topN(RL TA ) T topN(RL SA ) N 0.727 0.839 0.961 0.993 1.0 1.0 Compare RL TA and RL D topN(RL TA ) T topN(RL D ) N 0.677 0.842 0.933 0.987 0.999 1.0 Compare RL SA and RL D topN(RL SA ) T topN(RL D ) N 0.745 0.885 0.947 0.987 1.0 1.0 Table 3: Comparison of proposed ranking methods: RL TA , RL SA and RL D RL TA R TA RL SA R SA RL D R D score (km 2 ) score (km 2 ) score (secs) 1 46 0.087 108 1.726 108 65 2 108 0.084 43 0.813 46 61 3 43 0.063 46 0.558 43 42 4 107 0.055 42 0.359 107 38 5 42 0.052 107 0.338 133 31 6 131 0.045 131 0.291 131 25 7 132 0.045 133 0.135 42 18 8 133 0.038 132 0.087 106 16 9 109 0.022 109 0.073 118 11 10 118 0.018 118 0.045 109 10 11 47 0.004 106 0.025 44 6 12 106 0.004 44 0.008 132 5 13 44 0.001 47 0.004 47 1 14 65 0.001 65 0.001 65 1 Table 4: The ranked video results and relevance scores obtained for query Q 207 V 46 and V 108 in Table 4. Both videos overlap with Q 207 almost for the same duration and they both cover almost the whole query region. Thus, the R TA and R D scores for both videos are very close. However, R SA forV 108 is much higher than forV 46 . To investigate the difference we built the overlap histograms OH 46 and OH 108 and extracted the cells that overlap with query Q 207 . Color highlighted visualizations of O G (V G 46 ,Q G ) and O G (V G 108 ,Q G ) are shown in Figure 8. Figure 8(b)’s higher color intensity in the center shows that V 108 intensly covers the middle part of Q 207 . Even though the results among the ranking methods vary somewhat, at this point we do not favor any specific approach. We believe that each ranking scheme emphasizes a different aspect of relevance, therefore query results should be customized based on user preferences and application requirements. (a) (b) Figure 8: Color highlighted visualizations for overlap histograms for videos V 46 and V 108 22 5.2.2 Comparison with User Feedback This set of experiments aims to evaluate the accuracy of our ranking methods by comparing algo- rithmic results with user provided relevance feedback. Our main intension in preforming the user study is to check whether the results of the proposed ranking metrics make sense for people. Our methodology therefore does not live up to the rigorous process usually attributed to scientific user studies. Relevance judgements were made by a student familiar with the region where the videos were captured. We selected a subset of 25 query regions from the total of 250 queries, specifically those that identified relatively prominent geographical features. Although the queries were given to the user as latitude/longitude coordinates, she had a pretty good idea about what visual content she should expect because she was very familiar with the given query regions. She also used some other visual tools, such as Google Maps Street View 1 , to familiarize herself with the regions that queries cover. The selected 25 queries returned a total set of 103 different videos; each query re- turned 14 videos on average. The user manually analyzed all these 103 videos in random order and evaluated the relevance of these videos for each of the 25 queries. The user was asked to rate the relevance based on a four-point scale: “3 - highly relevant”, “2 - relevant”, “1 - somewhat relevant” and “0 - irrelevant”. Trajectories of camera movements were displayed on a map for these 103 videos to aid the user in the evaluation. Finally, the user created a rank list for every query. 0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18 Rank DCG USER R TA R SA R D Figure 9: Discounted Cumulated Gain (DCG) curves We compared the rankings of RL TA , RL SA and RL D to the user rankings using the metrics DCGandNDCGforthe25selectedqueries. TheaverageoftheDCGvectorsfromthose25queries were used for comparisons. Figure 9 shows DCG curves for the rank lists RL TA , RL SA , RL D and the USER curve for ranks 1 through 16. The USER curve corresponds to the DCG vector based on the user rankings. Clearly, the DCG curves for the proposed schemes closely match with the USER DCG evaluation. Next, the NDCG scores with respect to the user results were calculated. The NDCG scores of RL TA , RL SA and RL D were 0.975, 0.951 and 0.921, respectively. All scores are close to 1, which implies that all three are highly successful in ranking the most relevant videos at the top, similar to human judgement. We observed that rank differences mostly occurred in ratings of less relevant videos. Recall that DCG and NDCG reward relevant videos in the top ranked results more heavily than those closer to the bottom. The high NDCG scores further lend credibility that the proposed ranking methods successfully identify the most relevant videos. Among the proposed schemes, the highest precision was obtained consistently by RL TA at all levels. We conjecture that this is the case because human perception for relevance is more related 1 http://maps.google.com/help/maps/streetview 23 to how clearly one can see an object (i.e., spatial perception) rather than how long one sees it (i.e., temporal perception). Note that all three ranking schemes describe different properties of a video clip, and result ranking is complicated by the fact that importance is subjective to users and may be application dependent as well. However, in our methodology Figure 9 clearly shows that RL TA overall achieves the best accuracy among the three with respect to user ranking. We are aware that user judgement can be subjective and may be prone to errors. The method- ology used might not qualify under all the requirements of an intensive user study. Therefore the results only provide an indication of what one might expect. More conclusive results can be ob- tained by performing an intensive user study with a far higher number of human judges, videos and queries. Such an extensive study is out of the scope of this paper and will be part of our future work. 5.3 Evaluating the Computational Performance This section evaluates the computational cost of the proposed schemes. In Algorithm 2, four steps account for most of the computational cost: 1) loading of the georeferenced data, i.e., loading of the FOVScene descriptions from storage, 2) filtering to exclude videos with no overlap, and 3) calculating the area of overlap between FOVScenes (results of the filter step) and query regions, and 4) computing the relevance scores for the three rank lists. For R TA and R SA , step 3 consumes approximately60%oftheexecutiontime. Instep4,R SA justneedstosumtheareasofoverlapping polygons, whereas R TA needs to compute the extent of all overlapping polygons. Therefore, R TA takes longer to construct the rank list. R D is computationally the most efficient since it only extracts the time overlap and skips the overlap area calculation. Using a 2.33 GHz Intel Core2 Duo computer, we measured the processing time of each scheme to evaluate the same 250 queries in Section 5.2.1. The test data included 134 videos with a total duration of 175 minutes. An F was recorded per every second of video for a total of 10,500 F representations used in the calculations. The detailed processing time measurements for running the major steps of the ranking schemes are summarized in Table 5. Steps 1 and 2 were required for all queries and all 134 videos were processed per query. It is important to note that during query processing, the vast majority of the videos were filtered out through the filter step. As shown in Table 5, for each query on average only 8.46 out of 134 videos were actually processed in steps 3 and 4, and all other videos were excluded through the filter step. The details of the filtering is explained in Section 4.3. For a particular query, the average processing time required to construct RL TA , RL SA and RL D was approximately 2.082, 1.811 and 1.66 seconds, respectively. The query execution time depends on the number of FOVScenes to be processed, which also varies by query. Thus, we next examine how the processing time changes as the number of videos increases. Figure 10 shows that the processing time grows linearly as a function of the number of videos for the three rankings, i.e., as a function of the number of Fs. The small fluctuations were caused by the differences in the number of Fs processed in videos (i.e., ¯ ¯ V F k ¯ ¯ ) and the variations in the number of overlapping Fs processed after the filter step in a specific query. Consequently, we can compute the average time to process a single F per query as follows. With a processing time of 2.082 seconds per query over 134 videos (total 175 minutes for 10,500 Fs) for RL TA , the average execution time per F per query was 0.198 milliseconds. Similarly, it was 0.172 and 0.110 milliseconds for RL SA and RL D , respectively. These numbers can used to obtain an estimation of the query processing time for a larger data set. For example, when the size of the query range is the same but the number of FOVScenes 2 increases to 100,000, we can estimate the average query 2 There is no direct relation between the number of Fs and the length of a video because F can be sampled with various intervals. 24 Step Calculating RL TA Calculating RL SA Calculating RL D Avg No. of V Avg Time Avg No. of V Avg Time Avg No. of V Avg Time processed (secs) processed (secs) processed (secs) 1. Load FOVScene descriptions from disk 134 0.523 134 0.523 134 0.523 2. Filter step 134 0.016 134 0.016 134 0.016 3. Calculate the area of Overlap polygons 8.46 1.176 8.46 1.176 8.46 0.527 4. Calculate the Relevance Scores 8.46 0.367 8.46 0.097 8.46 0.100 Total time (sec) 2.082 1.812 1.166 Table 5: Measured computational time per query processing time for RL TA as 0.198 msec ×100,000=198 seconds. 0 20 40 60 80 100 120 140 0 0.5 1 1.5 2 2.5 Number of Videos Cumulative Processing Time per Query (seconds) R TA R SA R D Figure 10: Processing time per query vs Number of videos Inthisstudywefocusontherankingalgorithmsanddonotaimtoprovideanyefficientmethods for the retrieval and indexing of FOVScene descriptions from storage. The methods presented so far are not optimized for computational efficiency. Hence, it is worth mentioning that calculating the area of overlap between a pie-shaped FOVScene and polygon-shaped query is computationally expensive and might not be practical for realtime applications. To address this challenge we introduce a histogram based ranking approach in Section 4.4.2 which can move most of the costly computation overhead to an offline preprocessing step, resulting in simpler and faster query processing. Next, we will present our findings on the accuracy and efficiency of a histogram based ranking. 5.4 Ranking based on Histogram Webuilttheoverlaphistograms(OH 1 throughOH 134 )forall134videosasdescribedinSection4.4. Thesame250querieswereprocessedusingthesehistogramsandtherelevancescoreswerecalculated for the returned videos based on the metrics we proposed in Section 4.4.2. Let RL G TA , RL G SA and RL G D be the rankings obtained from relevance metrics R G TA , R G SA and R G D , respectively. First, in order to evaluate the accuracy of RL G TA , RL G SA , and RL G D , we compared them to the precise rankings RL TA , RL SA , RL D and investigated the precision for various cell sizes. Recall that we use the exact area of polygon-overlap when calculating the relevance scores for precise rankings whereas the histogram approximates the overlap region with grid cells. Therefore, we use the rank lists RL TA , RL SA and RL D as baselines for comparison. Figure 11 shows the results using the MAP metric for grid cell sizes varying from 25m by 25m to 200m by 200m. Note that 25 Average rank-order Cell Size Cell Size Cell Size Cell Size difference 25m×25m 50m×50m 100m×100m 200m×200m RL G TA 1 |Q| |Q| P q=1 1 |RL TA (q)| |RL TA (q)| P i=1 ROdiff(q,i) 0.2243 0.3191 0.7508 1.1973 RL G SA 1 |Q| |Q| P q=1 1 |RL TA (q)| |RL TA (q)| P i=1 ROdiff(q,i) 0.2345 0.4069 0.7655 1.1116 RL G D 1 |Q| |Q| P q=1 1 |RL TA (q)| |RL TA (q)| P i=1 ROdiff(q,i) 0.4378 0.7822 1.0366 1.5415 ROdiff(q,i) = ¯ ¯ RO(RL TA (q,i))−RO ¡ RL G TA (q,i) ¢¯ ¯ RO: Rank order, |Q| : Number of queries Table 6: Rank order comparison between RL G TA , RL G SA , RL G D and RL TA , RL SA , RL D the size of the query range was 300m by 300m. MAP results were averaged across all queries in the test. The results show that the precision for all three histogram based rankings decreases linearly as the cell size increases. This is expected as a larger cell size means a coarser representation of the overlapping area. However, the degradation of the precision was not significant (especially considering the performance gain explained later) when the cell size is small and N is large. For example, forcellsizessmallerthan100mby100mandN greaterthantwo, theMAPscorebecomes greater than 0.9 in Figure 11(b). 25 50 75 100 125 150 175 200 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Cell Size MAP N=1 N=2 N=3 N=5 N=7 N=10 25 50 75 100 125 150 175 200 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Cell Size MAP N=1 N=2 N=3 N=5 N=7 N=10 25 50 75 100 125 150 175 200 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Cell Size MAP N=1 N=2 N=3 N=5 N=7 N=10 (a)RL G TA (b)RL G SA (c)RL G D Figure 11: MAP at N for (a) RL G TA , (b) RL G SA and (c) RL G D for varying cell sizes Recall that the precision only represents the fraction of common videos and ignores the differ- ences between actual rank orders. Therefore, we also compared the order of the videos between the histogram rankings and the precise rankings. For each query, we obtained the videos that appear in both RL G TA and RL TA and computed the mean absolute difference between their rank orders. Resultswerefurtheraveragedamongall250queries. WerepeatedthesamefortheranklistsRL G SA versus RL SA and RL G D versus RL D . Table 6 reports the results for different cell sizes. Note that the false positives that might appear in the histogram ranking were not included in the rank order comparison. InTable6, whenthegridcellsizeissmall, themeanorderdifferencewasaslowas0.2. Even for large cell sizes the mean order difference was around 1.2, which implies that on average each video was displaced only by ±1 position in the rank list. Next, we measured the query processing time of the histogram based rankings and compared it with those of the precise rankings. Figure 12 illustrates the processing times per query with respect to the number of videos for both R TA and R G TA when the cell size was 50m×50m. Clearly, the histogram ranking demonstrates a high superiority compared to R TA . This is because most of the costly overlap computations are performed while the OH is being built as a pre-processing step (e.g., when the video is first loaded into the system). The OH of all videos are constructed 26 just once and all queries can share them. The result is a short online query processing time. For example, the average query processing time was just 5% of that of R TA as shown in Figure 12. Similar results were obtained for the other rankings schemes. As we have shown, the accuracy of the histogram ranking is highly dependent on the cell size. The smaller the grid cell size the better an estimation a histogram OH achieves. However, the time to build the OH increases as the cell size shrinks. To understand this tradeoff we investigated the tradeoff between the precision of rankings and the computational cost of building OHs while varyingthecellsize. WecalculatedtheMAPscoresatN =10forR G TA andrecordedtheprocessing timestobuildOHsinseconds. Figure13showsthechangeinbothprecisionandaverageprocessing timetobuildanOH fordifferentcellsizes. Asthecellsizeincreases,theprecisionlinearlydecreases while the CPU time exponentially decreases. When the cell size exceeds 75m×75m, the CPU time decreases gradually, while the precision continues to drop steadily. Thus, we conclude that in our experiments the cell size between 50m×50m and 75m×75m provides a good tradeoff between the accuracy and the build overhead of histograms. 0 20 40 60 80 100 120 140 0 0.5 1 1.5 2 2.5 Number of Videos Cumulative Histogram Processing Time per Query (seconds) Histogram based ranking Precise ranking Figure 12: Comparison of precise and histogram based query processing 25 50 75 100 125 150 200 0.75 0.80 0.85 0.90 0.95 1 Cell Size MAP at N=10 for R G TA 0 5 10 15 20 25 Average Processing Time to Build OH (seconds) Average Processing Time to Build OH MAP at N=10 for R G TA Figure 13: Evaluation of computation time and precision as a function of the grid cell size In our experiments we used a simple data structure to store the histogram data which is not optimizedforsearchinglargevideocollections. Inapracticalscenario,wherehundredsofconcurrent usersaccessthesearchengine,largeamountsvideodatawillbeaddedordeletedto/fromthesystem. There is a need for a storage and index structure that can efficiently handle the updates to the histogram while ensuring a short query processing time. Histogram based indexing techniques are well studied in database research [14]. The performance of our histogram-based ranking approach can be further improved by adopting an existing indexing technique. 27 One important and unique advantage of the histogram approach is that it describes both the extent and density of overlap between video FOVScenes and the query region. By analyzing the overlap distribution in a histogram, it is possible for users to further understand the results. Also such information can be quite useful in interactive video search where the overlap density through the query region can be used to guide the user to further drill down to more specific queries. For example, a visualization of the histogram data similar to Figure 8 can be provided to the user for the top ranked videos so that user can interactively customize the query and easily access the information he or she is looking for. We plan to elaborate on the histogram data analysis as part of our future work. 6 Conclusions and Future Work Inthisstudyweinvestigatedthechallengingandimportantproblemofrankingvideosearchresults based on videos’ spatial and temporal properties. We built on our prior work on georeferenced video search [5] and introduced three ranking algorithms that consider the spatial, temporal and combined spatio-temporal properties of georeferenced video clips. Our experimental results show thattherankingsfromtheproposedR TA ,R D ,andR SA methodsareverysimilartoresultsbasedon user feedback. This demonstrates the practical usability of the proposed relevance schemes. One drawback, their demanding runtime computations may be a concern in large-scale applications. To improve the efficiency of our approach, we also proposed a histogram based approach that dramatically improves the query response time by moving costly computations to a one-time, pre- processing step. The obtained results illustrate how the use of histograms provides high-quality ranking results while requiring only a fraction of the execution time. There are several future directions through which we plan to further enhance the video search capabilities. (i) In our study we only demonstrated examples of simple spatial range queries. However, the proposed viewable scene model that compromises the camera view direction and camera location provides a rich information base to run more sophisticated query types. For example, a user may ask to view the query region from a specific view angle. We plan to enhance our query mechanism to support such user customized queries as well as other spatio-temporal query types such as kNN and trajectory queries, etc. (ii) We plan to explore the best approaches for incorporating our system into a standard web- based video search engine. To enable video search on a larger scale, a standard format for georef- erenced video annotations must be established. We plan to elaborate on this issue and look for possible ways to facilitate integration of other providers’ data. (iii) Several additional factors influence the effective viewable scene in a video, such as occlu- sions, visibility depth, resolution, etc. The proposed viewable scene model has to be extended and improved to account for these factors. Occlusions have been well studied in computer vision research. We plan to investigate the addition of an existing occlusion determination algorithm into our model. References [1] Camera Calibration Toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib doc/. [2] Flickr. http://www.flickr.com. [3] Geobloggers. http://www.geobloggers.com. 28 [4] Woophy. http://www.woophy.com. [5] Sakire Arslan Ay, Roger Zimmermann, and Seon Ho Kim. Viewable Scene Modeling for Geospatial Video Search. In MM ’08: Proceeding of the 16th ACM International Conference on Multimedia, pages 309–318, New York, NY, USA, 2008. ACM. [6] Kate Beard and Vyjayanti Sharma. Multidimensional Ranking for Data in Digital Spatial Libraries. Intl. Journal on Digital Libraries, 1(2):153–160, 1997. [7] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv., 40(2):1–60, 2008. [8] Boris Epshtein, Eyal Ofek, Yonatan Wexler, and Pusheng Zhang. Hierarchical Photo Organization Using Geo-Relevance. In 15 th ACM Intl. Symposium on Advances in Geographic Information Systems (GIS), pages 1–7, 2007. [9] ShantanuGautam,GabiSarkis,EdwinTjandranegara,EvanZelkowitz,Yung-HsiangLu,andEdwardJ. Delp. Multimedia for Mobile Environment: Image Enhanced Navigation. volume 6073, page 60730F. SPIE, 2006. [10] Stefan Gobel and Peter Klein. Ranking Mechanisms in Meta-data Information Systems for Geo-spatial Data. In EOGEO Technical Workshop, 2002. [11] Clarence H. Graham, Neil R. Bartlett, John Lott Brown, Yun Hsia, Conrad C. Mueller, and Lorrin A. Riggs. Vision and Visual Perception. John Wiley & Sons, Inc., 1965. [12] Eugene Hecht. Optics. Addison-Wesley Publishing Company, 4 th edition, August 2001. [13] Tae-Hyun Hwang, Kyoung-Ho Choi, In-Hak Joo, and Jong-Hun Lee. MPEG-7 Metadata for Video- Based GIS Applications. In Geoscience and Remote Sensing Symposium, pages 3641–3643, vol.6, 2003. [14] Yannis Ioannidis. The History of Histograms (abridged). In Proc. of VLDB Conference, 2003. [15] Kalervo J¨ arvelin and Jaana Kek¨ al¨ ainen. Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002. [16] RiekoKadobayashiandKatsumiTanaka. 3DViewpoint-BasedPhotoSearchandInformationBrowsing. In 28 th Intl. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 621–622, 2005. [17] Lyndon S. Kennedy and Mor Naaman. Generating Diverse and Representative Image Search Results for Landmarks. In WWW ’08: Proceeding of the 17th International Conference on the World Wide Web, pages 297–306, New York, NY, USA, 2008. ACM. [18] Kyong-Ho Kim, Sung-Soo Kim, Sung-Ho Lee, Jong-Hyun Park, and Jong-Hyun Lee. The Interactive Geographic Video. In Geoscience and Remote Sensing Symposium, pages 59–61, vol.1, 2003. [19] Ray R. Larson and Patricia Frontiera. Geographic Tnformation Retrieval (GIR) Ranking Methods for Digital Libraries. In JCDL ’04: Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 415–415, New York, NY, USA, 2004. ACM. [20] MichaelS.Lew,NicuSebe,ChabaneDjeraba,andRameshJain. Content-basedmultimediainformation retrieval: State of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl., 2(1):1–19, 2006. [21] Xiaotao Liu, Mark Corner, and Prashant Shenoy. SEVA: Sensor-Enhanced Video Annotation. In 13 th ACM Intl. Conference on Multimedia, pages 618–627, 2005. [22] MorNaaman, YeeJiunSong, AndreasPaepcke, andHectorGarcia-Molina. AutomaticOrganizationfor Digital Photographs with Geographic Coordinates. In 4 th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 53–62, 2004. [23] A. Pigeau and M. Gelgon. Building and Tracking Hierarchical Geographical & Temporal Partitions for Image Collection Management on Mobile Devices. In 13 th ACM Intl. Conference on Multimedia, 2005. 29 [24] Kerry Rodden and Kenneth R. Wood. How do People Manage their Digital Photographs? In SIGCHI Conference on Human Factors in Computing Systems, pages 409–416, 2003. [25] Andrew Shapira. Fast Line-edge Intersections on a Uniform Grid. pages 29–36, 1990. [26] Ian Simon and Steven M. Seitz. Scene Segmentation Using the Wisdom of Crowds. In Proc. ECCV, 2008. [27] Alan F. Smeaton, Paul Over, and Wessel Kraaij. Evaluation Campaigns and TRECVid. In MIR ’06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pages 321–330, New York, NY, USA, 2006. ACM Press. [28] AlanF.Smeaton,PaulOver,andWesselKraaij. High-LevelFeatureDetectionfromVideoinTRECVid: a 5-Year Retrospective of Achievements. In Ajay Divakaran, editor, Multimedia Content Analysis, Theory and Applications, pages 151–174. Springer Verlag, Berlin, 2009. [29] Cees G. M. Snoek and Marcel Worring. Concept-based video retrieval. Found. Trends Inf. Retr., 2(4):215–322, 2009. [30] Carlo Torniai, Steve Battle, and Steve Cayzer. Sharing, Discovering and Browsing Geotagged Pictures on the Web. Springer, 2006. [31] Kentaro Toyama, Ron Logan, and Asta Roseway. Geographic Location Tags on Digital Images. In 11 th ACM Intl. Conference on Multimedia, pages 156–166, 2003. [32] Akira Yanagawa, Shih-Fu Chang, Lyndon Kennedy, and Winston Hsu. Distributed policy management and comprehension with classified advertisements. Technical Report 222-2006-8, Columbia University ADVENT, March 2007. 30
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 912 (2009)
PDF
USC Computer Science Technical Reports, no. 659 (1997)
PDF
USC Computer Science Technical Reports, no. 627 (1996)
PDF
USC Computer Science Technical Reports, no. 623 (1995)
PDF
USC Computer Science Technical Reports, no. 964 (2016)
PDF
USC Computer Science Technical Reports, no. 650 (1997)
PDF
USC Computer Science Technical Reports, no. 592 (1994)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 762 (2002)
PDF
USC Computer Science Technical Reports, no. 625 (1996)
PDF
USC Computer Science Technical Reports, no. 685 (1998)
PDF
USC Computer Science Technical Reports, no. 844 (2005)
PDF
USC Computer Science Technical Reports, no. 615 (1995)
PDF
USC Computer Science Technical Reports, no. 846 (2005)
PDF
USC Computer Science Technical Reports, no. 699 (1999)
PDF
USC Computer Science Technical Reports, no. 948 (2014)
PDF
USC Computer Science Technical Reports, no. 908 (2009)
PDF
USC Computer Science Technical Reports, no. 843 (2005)
PDF
USC Computer Science Technical Reports, no. 886 (2006)
PDF
USC Computer Science Technical Reports, no. 748 (2001)
Description
Sakire Arslan Ay, Roger Zimmermann, Seon Ho Kim. "Relevance ranking in georeferenced video search." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 911 (2009).
Asset Metadata
Creator
Ay, Sakire Arslan
(author),
Kim, Seon Ho
(author),
Zimmermann, Roger
(author)
Core Title
USC Computer Science Technical Reports, no. 911 (2009)
Alternative Title
Relevance ranking in georeferenced video search (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
30 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270854
Identifier
09-911 Relevance Ranking in Georeferenced Video Search (filename)
Legacy Identifier
usc-cstr-09-911
Format
30 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/