Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient management techniques for large video collections
(USC Thesis Other)
Efficient management techniques for large video collections
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT MANAGEMENT TECHNIQUES FOR LARGE VIDEO COLLECTIONS by Ping-Hao Wu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2010 Copyright 2010 Ping-Hao Wu Dedication This dissertation is dedicated to my mom, my dad, and my sister. Thank you for all of your love and support. ii Table of Contents Dedication ii List Of Tables v List Of Figures vi Abstract viii Chapter 1 Introduction 1 1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Background Review 10 2.1 Denition of Duplicate Video . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Duplicate Video Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Sequence Alignment and Sux Array . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Sux Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 Sux Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 Enhanced Sux Array . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.5 Finding Maximal Unique Matches . . . . . . . . . . . . . . . . . . 23 2.4 Video Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter 3 Duplicate Video Detection Based on Camera Transitional Behaviors 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Anchor Frame Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.1 Video Temporal Structure . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Shot Boundary Detection . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.3 Camera Motion Detection . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Signature Matching Using Sux Array . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Candidate Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Signature Generation . . . . . . . . . . . . . . . . . . . . . . . . . 50 iii 3.3.3 Signature Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Duplicate Video Detection using Shot Boundaries . . . . . . . . . 54 3.4.2 Duplicate Video Detection using Panning/Tilting . . . . . . . . . . 65 3.4.3 Duplicate Video Detection on Varying Frame Rates . . . . . . . . 70 Chapter 4 Video Genre Inference Based on Camera Capturing Models 78 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Camera Capturing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Filming Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.2 Camera Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.3 Camera Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Preliminary Experiment Results . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.1 Camera Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.2 Camera Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4 Professional versus Amateur Video . . . . . . . . . . . . . . . . . . . . . . 100 4.4.1 Comparison of Professional and Amateur Video . . . . . . . . . . . 100 4.4.2 Relevant Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5 Extraction and Inference of Relevant Features . . . . . . . . . . . . . . . . 105 4.5.1 Activity Segment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.2 Camera Shakiness . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5.3 Sharpness of Frames . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.5.4 Color Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6.2 Classication Accuracy and Discussion . . . . . . . . . . . . . . . . 111 4.6.3 ROC Performance and Discussion . . . . . . . . . . . . . . . . . . 113 Chapter 5 Conclusion and Future Work 115 5.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Bibliography 119 iv List Of Tables 2.1 Sux tree applications and their traversal types. . . . . . . . . . . . . . . 19 2.2 The enhanced sux array of string S =banana. . . . . . . . . . . . . . . 22 3.1 25 query video clips and their modications. . . . . . . . . . . . . . . . . . 56 3.2 Results of matching percentages with the query video and its ground truth. 63 3.3 The Time Needed for Each Dierent Tasks . . . . . . . . . . . . . . . . . 64 3.4 Ground truth and attack applied for each query. . . . . . . . . . . . . . . 68 3.5 Match percentages of queries with their ground truths. . . . . . . . . . . . 69 3.6 Queries for crawling video from YouTube . . . . . . . . . . . . . . . . . . 72 4.1 Classication results with dierent classiers . . . . . . . . . . . . . . . . 111 4.2 Detail classication results . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 v List Of Figures 2.1 Examples of duplicate video copies. . . . . . . . . . . . . . . . . . . . . . . 11 2.2 The sux tree for S =banana . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 The lcp-interval tree for string S = banana, where dashed-line boxes are not parts of the tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 The temporal structure of video. . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Anchor frames and the proposed signature. . . . . . . . . . . . . . . . . . 34 3.3 Comparison of the horizonal camera motion components of (a) the origi- nal video and (b) the query video attacked with camcording and subtitle insertion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 An example of ltered horizontal motion and its rst-order derivative. . . 45 3.5 Histograms ofT (t) of (a) the original video, (b) the same video but attacked by strong re-encoding, and (c) an unrelated video. . . . . . . . . . . . . . 49 3.6 An example of matching between two sequences, where red circles indicate the maximal unique matches. . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 Query video clips and the corresponding ground truths: (a) Query1, (b) Query3, (c) Query5, (d) Query6, (e) Query9, (f) Query10, (g) Query11, (h) Query13, (i) Query14, and (j) Query15 . . . . . . . . . . . . . . . . . 58 3.8 An example of candidate pruning given (a) Query5 and (b) Query19 as the query video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.9 Comparison of the mean of the standard deviation for all query video clips. 66 3.10 Query video clips (left) and their ground truths (right) attacked by (a) camcording and subtitles and (b) camcording. . . . . . . . . . . . . . . . . 67 vi 3.11 Sample frames of duplicate video clips of Query1. . . . . . . . . . . . . . . 73 3.12 The precision-recall curves of Query 1 (denoted by crosses) and Query 2 (denoted by circles). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.13 The precision-recall plots for (a) Query1 and (b) Query2. . . . . . . . . . 77 4.1 The process of determining new camera frames. . . . . . . . . . . . . . . . 85 4.2 Dierent camera distances: (a) extreme long shot, (b) long shot, (c) medium long shot, (d) medium shot, (e) medium close-up, (f) close-up, and (g) ex- treme close-up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3 Video genres considered in this experiment. . . . . . . . . . . . . . . . . . 91 4.4 Several foreground maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5 The camera distance histogram for each genre. . . . . . . . . . . . . . . . 94 4.6 Shots of drama03.mp4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.7 Shots of a short clip in a drama movie. Frames circled with boxes corre- spond to the identied cameras. . . . . . . . . . . . . . . . . . . . . . . . . 98 4.8 Number of shots and number of cameras for each genre . . . . . . . . . . 99 4.9 Examples of amateur video clips. . . . . . . . . . . . . . . . . . . . . . . . 102 4.10 The curves of original and smoothed camera motion in the vertical direction.108 4.11 Two ROC curves of amateur video classication, where the dashed and the solid lines correspond to the C4.5 decision tree only and the C4.5 decision tree with the AdaBoost algorithm, respectively. . . . . . . . . . . . . . . . 114 vii Abstract In this research, we focus on two techniques related to the management of large video collection: video copy detection and automatic video classication. After the introductory chapter and a brief review in Chapter 2, our main research results are presented in Chapters 3 and 4. In Chapter 3, a fast duplicate video detection system based on the camera transitional behavior and the sux array data structure is proposed. The proposed system matches video clips according to their temporal structures, which are represented by a set of frames corresponding to unique events, called anchor frames. Noticing the natural association between the camera operation and the resulting video, we use the camera transitional behavior to indicate the unique events. Specically, shot boundaries and the begin and end points of camera panning and tilting movements are detected as anchor frames. The length between adjacent anchor frames is computed to form a one-dimensional sequence, called the gap sequence, which serves as the signature of the video. An ecient gap sequence matching algorithm based on the sux array data structure is adopted to match two given video signatures, which can achieve linear-time processing. A candidate pruning stage is also proposed to reduce the computation as much as possible. Specically, video clips that are very unlikely to be duplicates of the input query video are eliminated in viii this stage before the signature matching is performed. Experimental results show that the proposed framework is not only ecient (in terms of computational speed) but also eective (in terms of high accuracy) in identifying duplicate video pairs. In Chapter 4, two novel features that take the shooting process into consideration are rst proposed for video genre classication, which are the number of camera used in a short time interval, and distance of the camera to the shooting subject. Preliminary experiment results show that the proposed features capture additional genre-related infor- mation. Some conclusion about the genre can be inferred from the proposed features to a certain degree. Then the properties of amateur and professional video clips are observed and analyzed. Although a large amount of work has been proposed by considering cin- ematic principles, most extracted features are low-level features without much semantic information. In the proposed scheme, features are designed to take the camera operation and the nature of amateur video clips into account. These features address various dier- ences in video quality and editing eects. They are tested on video clips collected from an Internet video sharing website, with several classiers. Experimental results on this test video data set demonstrate that the camera usage can be inferred from the proposed features and, thus, reliable separation of professional and amateur video contents can be achieved. Concluding remarks and future extensions are given in Chapter 5. ix Chapter 1 Introduction 1.1 Signicance of the Research Digital representations have been widely adopted for multimedia contents. This is at- tributed to the advances in broadband networks, VLSI, and multimedia compression technologies. Broadband networks provide higher and higher bandwidth in these years, which makes broadband services such as the video-on-demand become reality; VLSI tech- nologies provide more and more computing resources to achieve complicated jobs; the rapid development of image/video compression techniques facilitates ecient exchange and storage of multimedia contents. With these technologies, the creation of large online video repositories for easy access becomes feasible, and the size of digital video collection increases rapidly in recent years. An enormous amount of video is generated and dis- tributed by users everyday. Ecient management of a large video database becomes an important issue. One of the main issues arising from large on-line video collection is the copyright protection issue. As there is no central management on the Internet, it is inevitable to have 1 duplicates. Various easily accessible video editing tools make the situation even worse. In contrast to the digital watermarking technology where identication of duplicates is based on the previously inserted hidden information, a content-based duplicate video detection system attempts to identify a video clip based on its own content. One of the advantages of content-based detection is that video clips that are already in the Internet can be determined whether they are duplicates of some other existing ones or not. In addition, identifying all near-duplicates can also be benecial to applications such as the video search engine, where search results can be clustered, and users may save their browsing time accordingly. The other issue is related to ecient access. Nowadays, the on-line video collection is getting larger and larger. Video les can be generated by anyone using a hand-held camcorder or a camera-enabled mobile phone. The professional and the user-generated video contents make our video repositories extremely huge so that it is infeasible for human to handle it manually. It also becomes harder and harder for users to go through the whole video collection to nd the video of their interest. To allow ecient browsing, search and retrieval, one intuitive solution is to cluster video clips according to their genres automatically. Then, the choices for users can be narrowed down. Besides on-line video repositories, other applications include managing television broadcasting archives, video conferencing records, etc. There are several technical challenges for content-based video copy detection and video classication systems, which are described below. 2 For video copy detection, the extracted signature must be discriminative yet robust enough to dierent attacks that video might undergo, whether the attacks are un- intentional changes made during the distribution, or intentional attempts to fool the detection systems. For video classication, the extracted signature should re ect the ways how the director want to present the scenes. Specically, since a dierent camera operation may lead to dierent eects and user perception, features related to the shooting process should be considered. Since the size of video data is typically much larger as compared to that of image and audio data, the extracted signatures could take a lot of storage space. They should be reduced as much as possible to facilitate ecient storage. In other words, a compact representation is preferred. The amount of video in the database (for example, the database of movies in the past several years) is huge. The matching between signatures of two video clips should be performed eciently. Otherwise, the search of duplicate video, which involves comparing the candidate video with all video clips in the database, would be computationally expensive. 1.2 Review of Previous Work Automatic management of a large video collection is an active and challenging research area. Its signicance is highlighted by the content-based video retrieval evaluation work- shop TRECVID, which is held every year starting from 2001 [85]. TRECVID provides 3 large data sets and common tasks so that researchers can compare their systems and al- gorithms under the same conditions. It is devoted to research in automatic segmentation, indexing, and content-based retrieval of digital video. In this section, previous works on video copy detection and video genre classication are brie y reviewed. More details on previous works are given in Chapter 2. Generally speaking, a video copy detection system consists of two stages: signature (or features) extraction, and signature matching. First, an unique and compact repre- sentation of a video clip, called the signature, is extracted. Then, a distance measure between two signatures is dened. If the distance is small enough, the two video clips are said to be duplicates. Numerous algorithms using dierent signatures and matching algorithms have been proposed. Hampapur et al. [31] compared several features including motion, the color histogram, and the ordinal feature. Yang et al. [98] presented a survey on several dierent video identication algorithms. Law-To et al. [54] also evaluated and compared several algorithms. Depending how features are extracted, there are two categories: 1) global signatures and 2) local signatures. Global signatures are based on the global information of the whole frame such as the color information [2], [15], [70], the ordinal measure [46], [65], shot boundaries [38], [107], or motion information [31], [46]. Global signatures are in general more compact than local signatures. However, sometimes changes in a local region cannot be re ected on the global information such as the color histogram. Local signatures are designed to capture such changes more reliably. The simplest way is to divide a video frame into several blocks (say,MN), and compute features for each block, including color, motion, etc. Recently, research has been conducted using signatures based 4 on several points of interests in video frames. As the name suggests, points of interest tend to have a higher amount of information than other points in the frame. Examples of points of interest include Harris corner detector [34] and SIFT [63]. The similarity between video frames are dened as the distance between associated feature vectors. To reduce the number of comparisons, sometimes only a subset of frames are selected according to a set of seed vectors [15], [16], [61]. As to search algorithms, there exist several dierent ways, e.g. the use of some indexing structures [38], the sliding window approach [31], [70] or dynamic programming [2], [74], [107]. Automatic video genre classication is a grand challenging problem in the multimedia information retrieval eld. A large number of researchers have been working on it for more than two decades. Many dierent approaches have been proposed [11]. The task involves feature extraction and classication, and researchers have classied genres based on dierent information modalities (features) present in the video. Visual modality has been used quite commonly in video genre classication. The visual features used in video classication can be roughly categorized into shot-based, color-based, and motion-based. Shot-based features include shot durations [39], [40], percentages of certain shot transition [89], etc. Color-based features [23], [24], [35], [97] include the color histogram, average saturation, color variance, etc. Motion-based features [25], [93]include distribution of camera motion, frame dierence, etc. Other modalities such as audio and text modalities have also been used. A few researchers use cinematic principles or concepts from lm theory as additional features for video classication. For example, horror movies are often dark while comedy is usually 5 brighter. Action movies tend to a have shorter shot duration than character movies [91]. Dierent types of sports have a dierent distribution of camera movements [101]. Some researchers classify an entire video clip to a particular genre while others focus on classifying segments of video (e.g., identifying violent scenes in a movie). Some classify a video le into one of board categories such as the movie genre while others classify a video le to a sub-category such as dierent types of sports video. As to the classication method, it could be a decision tree (e.g. the C4.5 decision tree [75]), the Bayesian approach, the support vector machine (SVM), neural networks, hidden Markov models, and Gaussian mixture models, etc. 1.3 Contribution of the Research Several contributions are made in this research. They are summarized below. Duplication video detection technique Existing video copy detection systems all have some weaknesses. The features used in most algorithms are frame-based and direct comparison of feature vectors is required, which is not only sensitive to various attacks but also computationally expensive. In this research, we propose a system that can address these weaknesses. Specic contributions include the following. 1. Existing systems are all feature-based while the proposed detection system is syntax-based. Instead of extracting features from frames and comparing them directly, we propose a video signature based on the temporal structure 6 in Chapter 3, which is very compact. In addition, since the proposed signa- ture does not contain frame characteristics, it is insensitive to various attacks applied to the source video. 2. To extract the temporal structure of a video clip, we propose the idea to utilize the camera transitional behaviors as they are inherently linked with how the dierent events in the video are organized. The behaviors that we detect in this work include the shot boundaries and the positions where camera panning and tilting occur. 3. The proposed system performs the matching between two video signatures extremely fast. The matching can be done in O (n) time where n is the sum of the lengths of the two signatures under consideration. Note that the length of the signature is signicantly shorter than that of the video. The proposed system can thus nd out duplicate video from a large video database in a very short time. Automatic video classication technique For automatic video classication, although many existing approaches incorporate the cinematic principles in the lm industry, the features extracted are still at the lower level and cannot re ect the shooting process well. In Chapter 4, features re ecting the dierent camera usage are proposed. 1. Instead of counting the average number of shots, the number of camera in a certain short time window is estimated. Consider a typical dialog scene in a drama movie, where two persons in the conversation. Usually there would be 7 two cameras, each taking care of one person, and the director usually alternates these two cameras, changing to the one shooting the person that is talking. Therefore, there might be tens of shot changes during this conversation, but actually there are only two cameras. In this case, the number of camera conveys more information than the number of shots does. 2. The second feature for video classication involves the distance of the camera to the subject being shot. Dierent distances could lead to dierent eects. For example, a big close-up of the camera usually create impact and point out the importance of the subject. Therefore, dierent genres should have dierent distribution of the camera distance. This distance can be approximated by the normalized area of foreground objects, which are usually the subjects. Since an accurate foreground/background modeling is usually dicult, we propose to use the motion vector eld to identify foreground objects under the assumption that the foreground object is usually moving to somewhere. 3. Besides features that re ect the camera usage, several visual features have been designed to capture the dierent characteristics between professional and amateur video contents, including camera shakiness, frame sharpness, activity segment etc. It is demonstrated that with these features, professional and amateur video contents can be separated eectively. 8 1.4 Organization of the Dissertation The rest of this dissertation is organized as follows. The background of this research is reviewed in Chapter 2. It includes the denition of duplicate video clips, and the review of existing techniques for content-based duplicate video detection and automatic video classication. The sux tree, sux array, and their applications to string processing and sequence alignment are also examined in Chapter 2. A novel video copy detection system based on the camera transitional behaviors and the sux array is proposed in Chapter 3. Two new features that take into account the dierent eects of dierent camera usage are proposed in Chapter 4 for the purpose of automatic video genre classication. Several other features that are designed to capture the dierent characteristics between professional and amateur video are also presented in Chapter 4. Finally, the conclusion of our current research, and future research issues are described in Chapter 5. 9 Chapter 2 Background Review In this chapter, background knowledge related to this work is reviewed. First, the def- inition and variation of a duplicate video is discussed in Sec. 2.1. Previous works on the task of duplicate video detection are reviewed in Sec. 2.2. An data structure that is essential to the proposed duplicate video detection algorithm called sux array, as long as its predecessor, sux tree, are reviewed in Sec. 2.3. The sequence alignment algo- rithm that utilizes the sux array is also reviewed. Finally, related works on video genre classication are reviewed in Sec. 2.4, including dierent kinds of features and classiers that have been used. 2.1 Denition of Duplicate Video Two video programs need not be pixel-wise identical to be considered as duplicate copies. A video program is viewed as a copy of the other as long as their contents look ap- proximately the same to observers. Near-duplicate video programs have almost identical contents, yet possibly with some minor dierence in various aspects. One of the major 10 Figure 2.1: Examples of duplicate video copies. challenge for video copy detection systems is their ability to cope with all possible dif- ferences between duplicate copies, including quality degradation, format conversion, and minor editing of the original content. Some examples of duplicate video copies are given in Figure 2.1. The dierences between duplicate video copies can be categorized as follows. Format conversion: Video could be distributed with a variety of dierent formats. For example, there are several dierent television broadcast standards, including NTSC, PAL, SECAM in analog TV broadcasting and ATSC, DVB in digital TV broadcasting. As for video distributed over the Internet, they can be dierent in the following aspects. 11 { encoding format: MPEG-2, MPEG-4, QuickTime, Windows Media Video, H.264, etc. { frame resolution: 320 240, 352 288, 720 480, etc. { aspect ratio: 4:3, 16:9, etc. { frame rate: 24 fps, 25 fps, 29.97 fps, 30fps, etc. { bit rate: 256 kbps, 512 kbps, 1.5Mbps, 3Mbps, etc. Quality degradation: Due to the process of digitizing and encoding, several artifacts could be introduced, including changes in contrast, brightness and saturation. In addition, because of the lossy nature of existing video coding standards, coding artifacts such as blurring and blocking artifacts are unavoidable if the coding bit- rate is not suciently high. Content editing: Before redistributing, video could undergo one or more editing operations, including the insertion of captions, logos, subtitles, etc. In addition, since the aspect ratio of the theatrical version of a movie is dierent from its DVD version, frame cropping or padding of black pixels at the frame boundary is often used to compensate the discrepancy in the aspect ratio. 2.2 Duplicate Video Detection The problem of video copy detection could sometimes be mistaken as that of video in- dexing and retrieval. Although video copy detection can inherit many techniques from indexing and retrieval systems, these two problems are fundamentally dierent. That is, 12 image/video indexing and retrieval systems are designed for the identication of seman- tically similar content in the same visual category [5], [80] while semantic similarity does not not imply duplication. As compared with image processing, video processing inevitably requires much more computation time and storage space. To detect duplicate video copies rapidly, many existing algorithms are proposed to use global signatures that are in general more compact than local signatures so that signature matching can be done more eciently [15], [16], [31], [38], [107]. Most widely used signatures include color, motion and ordinal features. The matching process is based on the distance function between a pair of feature vectors (or values). To reduce the number of feature vector comparisons, usually only a subset of frames are selected according to a set of seed vectors [15], [16], [61] or from key frames of shots. Lienhart et al. [58] detected TV commercials using the color coherence vector, which is computed for each frame. The sequence of vectors composes the signature of the video. The edit distance between two signatures is computed to determine if they are duplicates. Since the color information is used directly, it is likely that this algorithm would not work well if there is some degradation in video quality. Adjeroh et al. [2] also used the color feature. The average color and the standard deviation of each subframe were computed, and the similarity was calculated using the edit distance. Naphade et al. [70] computed the YCC color histogram for each frame. Then, a sliding window was used to determine the similarity of each position between two video clips. The positions where the similarity exceeds a threshold were marked as matches. 13 Instead of the color feature, the ordinal feature for each frame was computed in [65]. Then, to align two sequences of feature vectors, a sliding window was used. The distance was simply the average distance between feature vectors inside the sliding window. This method is less sensitive to the degradation of video quality. However, the computation is high in comparing ordinal features. Ng et al. [71] rst detected the shot boundary, clustered the shots, and then builded a tree to represent the structure of the video. Similarity between two shots (leaf nodes) was computed based on the color histogram of key frames and the shot length. When comparing two video clips, the trees were compared in a top-down approach. Similarity of a node is the sum of the similarity of child nodes. In [15], to reduce the number of feature vector comparison, a xed-number subset of frames was selected to represent the whole video clip, and the quantized color information was used to form the feature vector of each frame. The selection of the representative frames was based on a seed vector. Thus, the temporal information was lost. The similarity of the two video clips was determined by the percentage of similar frames, from the subset, shared by them. The most signicant limitation resulting from the lost temporal information is that this algorithm can only nd duplicates of the whole video clip. In [84], each video was summarized into a small number of clusters, each of each con- tained similar frames and it was represented by a hypersphere dened by the position, radius, and density. The percentage of similar frames between two video clips was esti- mated by multiplying the volume of intersection between the two hyperspheres and the density. The similarity between two video clips was then obtained from the estimated 14 percentage of similar frames. This way, the time needed for similarity comparison is reduced. However, the temporal information is lost. There are some other global features. The ordinal and temporal features were com- bined for sequence matching in [46]. In [31], the histogram of the motion vector direction was examined. Since the feature vector of a frame could be in a high dimensional space which is detrimental when the size of the database grows, several eorts aimed at provid- ing more compact yet reliable signatures. Indyk et al. [38] used the temporal sequence of the occurrence of shot boundaries as the video signature. For example, if shot transitions happen at time instancesf3:4; 10; 25:7;:::g, this sequence is used as the video signature. Zobel and Hoad [107] used the duration of shots, color histogram dierences, and centroid movements of the brightest and the darkest pixels as their video signatures, proven to be eective. On the other hand, local signatures, which are extracted after segmenting video frames into several regions, can capture changes in part of images more reliably. A popular approach is to divide a frame into MN blocks and compute either color, motion, or texture feature for each block. Recently, several researchers have paid attention to another type of local signatures based on points of interests or keypoints that contain higher information content than other regions [42], [53], [73], [95]. Local signatures such as dierential gray-value invariants are extracted around these keypoints. Since there are usually many keypoints, the storage required to store their neighborhood and the computation involved in the matching stage can be a heavy burden. Techniques to search the matched position between two video titles highly depend on extracted signatures. They tend to demand a lot of computations since video contents 15 under comparison are not aligned perfectly. Various matching methods have been pro- posed, e.g., exploiting some indexing structures [38] or hierarchical clustering algorithms in a graph [16]. However, these techniques are restricted to the problem of whole video matching. That is, they cannot be used to nd partially matched video. To achieve partial video matching, a straightforward sliding window approach is of- ten used [31], [46], [65], [70], which is computationally expensive. Algorithms based on dynamic programming have also been used. Examples include the edit distance in [2], the time warping distance in [74] and [18], and the local alignment in [107]. Dynamic programming demands a complexity ofO(mn) when comparing one sequence of lengthm with another of length n. As a result, the computational complexity grows quadratically with the sequence length, which is not suitable for longer video sequences such as movies or TV programs. 2.3 Sequence Alignment and Sux Array The sux array [64] is basically a lexicographically sorted list of all suxes of the input string. The sux array records the starting position (called the sux number) of each sux only. The sux number array along with the sorted sux list provides a compact representation of the original string. An earlier form of the sux array is the sux tree [30], which is a powerful tool with a variety of applications in string processing and computational biology. Once a sux tree being constructed, it can be used to eciently solve numerous string processing problems. It is one of the most important data structures in string processing, especially 16 if the sequences are very long. The advantage of the sux array over the sux tree is that the former is simpler and more space ecient. For several applications, the sux array is a more desirable alternative to the sux tree. 2.3.1 Notations Let be a nite ordered alphabet and S be the string of lengthjSj =n over . Assume that there is a special symbol $, that does not appear anywhere else and greater than any other symbols. For 0 i < n, S[i] denotes the character at position i in S. For 0ij <n, S[i:::j] denotes the substring of S starting with the character at position i and ending with the character at position j. Substrings of S are called words and a word w is branching if there are dierent letters x;y such that wx and wy are words. S i denotes the i-th nonempty sux, which is S[i:::n 1], the substring of S starting from the character at position i to the end of string. 2.3.2 Sux Tree The sux tree for a string,S, is a tree whose edges are labeled with strings, with exactly n + 1 leaves from 0 to n. Each internal node (except the root) has at least two children, and each edge corresponds to a nonempty substring of S. No two edges out of a node can have edge labels beginning with the same character. That is, all non-root internal nodes are branching words of string S. Each sux of S corresponds to exactly one path from the tree's root to a leaf. More specically, for a leaf node, i, the concatenation of the edge labels along the path from the root to leaf node i is exactly the substring S i . 17 5 1 3 2 4 0 a banana$ na $ na $ na$ $ na$ 6 $ Figure 2.2: The sux tree for S =banana However, such a tree does not exist for all strings S. Consider if there is a sux of S matches the prex of another sux ofS, then no sux tree can be constructed following the above properties, since the sux would not end at leaves. Therefore, to avoid this problem, a special terminator symbol $ is padded at the end of the string. This way, no sux would be a prex of another sux. Since all non-root internal nodes (words) are branching, there are at most n 1 such nodes, which means there are at most 2n nodes in total for strings with length n. Fig. 2.2 shows the sux tree for string S =banana. Historically, to achieve linear time construction of a sux tree, sux links were in- vented. Later, it was found that sux links have many applications, such as approximate string matching and computing matching statistics. Sux links are the link between internal nodes. All internal non-root nodes have a sux link to another internal node. Consider an internal node with path label (concatenation of edge labels from the root to the node) x, where x denotes a single character and a (possibly empty) substring. If 18 Table 2.1: Sux tree applications and their traversal types. Applications Traversal Types Bottom-up Top-down Sux links Sumpermaximal repeats Maximal repeats Maximal repeated pairs Longest common substring All-pairs sux-prex matching Ziv-Lempel decomposition Common substrings of multiple strings Exact string matching Exact set matching Matching statistics Construction of DAWGs 1 there is another node with path label, there is a sux link from the former (x) to the latter (). The dashed lines in Fig. 2.2 illustrate the sux links between internal nodes with path labels a, na, and ana. With sux links, sux trees can be used to solve numerous string processing prob- lems. The applications can be categorized into 3 classes [1]: bottom-up traversal of the complete sux tree, top-down traversal of a subtree of the sux tree, traversal of the sux tree using sux links. Guseld et al. [30] discussed various applications of the sux tree. Abouelhoda et al. [1] listed several types of sux tree traversal required for dierent applications as shown in Table 2.1. 19 2.3.3 Sux Array The sux array is a sorted list of all suxes of a string S. One can build a sux array inO(n logn) time using a reasonable sorting algorithm for a sequence of lengthn. Linear time construction can be achieved by a lexicographic traversal of the sux tree. Lately, direct construction of the sux array in linear time and a simpler than linear time sux tree construction algorithm are shown to be possible [43], [47], [48]. As in a sux tree, a unique terminator is usually added at the end of the string before constructing the sux array. The sux array is stored as an array of integers in the range from 0 to n, where n is the length of string S. The integers in the array specify the order of the n + 1 suxes of S$. Let suftab denote the sux array of S. Then, the i-th entry in the sux array is the sux S suftab[i] . More specically, S suftab[0] ;S suftab[1] ;S suftab[2] ; ;S suftab[n] are suxes sorted in the ascending order. 2.3.4 Enhanced Sux Array Recently, it has been shown that, with the assistance of some additional arrays, a sux array can be used to replace a sux tree and solve the same problem in the same time complexity [1]. The additional arrays include the inverse sux array, the longest common prex array, the Burrows and Wheeler transformation table, the child table and the sux link table. Along with one or more of these arrays, the sux array can be used to solve the same problems. They are dened below. 20 The inverse sux array suftab 1 is an array such that suftab 1 [suftab[i]] = i, for 0in. Obviously, this array can be computed in linear time given the sux array. The Burrows and Wheeler transformation (BWT) is an algorithm used in data compression. The BWT table bwttab is an array containing the BWT informa- tion. Specically, thei-th entry of the BWT array stores the character right before the i-th sux in the sux array. Mathematically, if suftab[i]6= 0, bwttab[i] = S[suftab[i] 1]. If suftab[i] = 0, bwttab[i] is undened. This array can be con- structed after scanning over the sux array once. Thus, the construction time is linear. The longest common prex array lcptab stores the length of the longest common prex between adjacent suxes. Specically, lcptab[i] is the length of the longest common prex of suxes S suftab[i1] and S suftab[i] for 1in, and lcptab[0] is set to 0. Usually, the lcp array can be obtained as a by-product when constructing the sux array. However, in some cases, the lcp information may not be available. It is possible to construct the lcp array in linear time from the string itself and its sux array [44]. With the lcp array, we can simulate a bottom-up traversal of a sux tree [1], [44]. In fact, the concept of lcp-interval tree was introduced in [1]. Each node in the lcp- interval tree corresponds to an l-interval [i:::j], where S[suftab[i]:::suftab[i] +l 1] is the longest common prex of suxes S suftab[i] ;S suftab[i+1] ;:::;S suftab[j] . The lcp-interval tree is basically the sux tree without leaves (i.e. a leaf node with only one sux). 21 Table 2.2: The enhanced sux array of string S =banana. i suftab lcptab bwttab S suftab[i] 0 1 0 b anana$ 1 3 3 n ana$ 2 5 1 n a$ 3 0 0 banana$ 4 2 0 a nana$ 5 4 2 a na$ 6 6 0 a $ Table 2.2 shows the sux array of string S = banana, along with the lcp array and the BWT array. Fig. 2.3 illustrates the lcp-invertal tree of S = banana. Note that the dashed-line nodes, which correspond to leaves in a sux tree, are not part of the lcp-interval tree. They are just shown for reference. The notation l-[i:::j] in each node denotes its corresponding l-interval [i:::j]. Comparing Figs. 2.2 and 2.3, we observe that there exists a one-to-one correspondence between the lcp-interval tree and the sux tree except for the leaf nodes. To achieve the simulation of the top-down traversal of a sux tree on a sux array, another array called the child table is needed. Each element in the array has three values: up, down, and nextlindex. In short, this table stores the parent-child relationship of the lcp-intervals and, thus, enables the top-down traversal of the lcp-interval tree. The construction of the child table can also be done in linear time. Although each element of the child table contains three elds, actually only one eld is needed for each element. Suppose suxS suftab [i] isx, wherex is a single character and a (possibly empty) substring. Ifj (0jn) satisesS suftab [j] =, the sux link fromx to is recorded 22 0-[0...6] 1-[0...2] 0-[4...5] 3-[0...1] 2 0 1 4 5 3 6 Figure 2.3: The lcp-interval tree for string S =banana, where dashed-line boxes are not parts of the tree. by link[i] =j. We call it the sux link ofi. link is the sux link table. Mathematically, if suftab[i]<n, link[i] = suftab 1 [suftab[i]+1]. A simple construction of the sux link table requires time complexityO(n logn) [1]. Linear time construction is also possible [1] using the constant-time range minimum query (RMQ) [6], [7], [81] or the depth-rst traversal of the tree structure. 2.3.5 Finding Maximal Unique Matches This problem originated from the genome comparison eld. Specically, researchers would like to know how well two DNA sequences of closely related organisms align. The software system MUMmer [21], [22], [52] is designed for rapidly aligning the entire genomes. In its rst step, the maximal unique match (MUM) decomposition between two genomes, 23 S 1 and S 2 , is performed. MUM is dened as a sequence that shows up exactly once in S 1 and once in S 2 , and is not contained in any longer subsequences. The task of nding MUM in MUMmer is achieved by constructing the sux tree for the concatenated string S =S 1 #S 2 , where # is a unique separation character that does not appear anywhere, either in S 1 or S 2 . This step takes O(n) time, where n =jSj. In MUMmer2 system, the sux tree is constructed only for one sequence (S 1 or S 2 ), and the other sequence is 'streamed' against the sux tree for the rst sequence to nd all the MUMs. This technique, called matching statistics, was introduced in [14] to solve the approximate string matching problem. In this way, the space requirement is reduced. However, as mentioned earlier, the space requirement for constructing a sux tree is higher than the sux array. Abouelhoda et al. [1] showed who to nd MUMs using the enhanced sux array, which is more time and space ecient than the algorithm based on the sux tree. The steps to nd MUMs in this algorithm are described below. 1. FInd all local maxima in the lcp array lcptab of S =S 1 #S 2 . 2. For every local maximum, which corresponds to an interval [i:::j], check whether i + 1 =j, bwttab[i]6= bwttab[j], and suftab[i]<jS 1 j< suftab[j]. 3. If all of the above are true, then report S[suftab[i]:::suftab[i] + lcptab[i] 1] as an MUM. 24 2.4 Video Classication In the eld of multimedia retrieval, automatic video genre classication is an important and challenging problem and it has attracted a lot of attentions for more than two decades [11]. Its goal is to place each video title into dierent categories, such as movies, news, sports, etc. Most earlier work focused on classifying the entire video while some attempted to classify smaller video segments (e.g., classifying dierent segments from a broadcast news program [106]). To classify a video program, one or more features are extracted to capture its characteristics. Then, classiers are used to place a target video into a certain category. Dierent modalities have been used, such as visual, audio and text modalities, to extract the representative information. Typically, four types of features are used: the text information, the audio informa- tion, the visual information, and the multi-modal information. Most visual features are extracted from each frame or key frames of one shot. They can be categorized into shot-, color-, and motion-based features. The average shot length or the number of shots is often used as the feature [39], [40] as the shot duration is fundamental to the perception of the content. For example, action movies, commercials, and music videos tend to have shorter shot durations than sports videos that need action continuity, or character movies that need character development [91]. The percentage of each type of video segment transition can also be used as a feature [89]. Color-based features [89], [92] include the color histogram, the average brightness, the average saturation and the luminance or color variance. They are useful in 25 distinguishing genres such as horror movies, which tend to have low light levels, and comedies, which tend to have brighter contents. Thus, color histograms of comedy and horror scenes are very dierent. Motion-based features [25], [93] have also been widely used, including the average magnitude and the standard deviation of the motion, which can be determined by the motion vector eld, the fraction of frames with motion [50], camera movement (such as zooming, panning and tilting) [101], and pixel-wise frame dierencing [39], [78] used to capture the activity level. Other useful visual features include edge- and texture-based features, which are often used in classifying sports video [55]. The edge features represent the amount and the type of edges present in a video shot. For example, a basketball court has slanted edges in a diagonal long shot view. Texture features oer the texture of a surface in a shot. For example, the texture content in a soccer game can be a useful feature to distinguish it from a basketball game. Features of other modalities have been studied, too. Popular audio features include: the RMS value of the audio signal energy, the zero crossing rate (ZCR), the silence ratio, the fundamental frequency, the frequency bandwidth and, Mel frequency cepstral coecients (MFCC) [36], [55], [62], [104]. Some have used text-based features for video classication since the text feature which contains a higher level semantic concept is easier to understand [10], [41], [60], [106]. However, since the text information is not always readily available from a video le, it is not as popular as the visual and the audio features. 26 The integration of multiple features gives higher classication accuracy than the use of a single feature [76]. In [25], a three-step process was used to combine the visual, the audio and the cinematic features to get better video classication results. Joint audio and visual features were used in [36]. A thorough review on movie content analysis based on joint features was conducted in [56]. A basket ball sports video was analyzed using integrated motion, color and edge features [105]. In this work, we focus on the visual- based features since whether a video clip is amateur or professional is directly related to what we have perceived through vision. As to the classication categories, some work focuses on classifying a video clip into one of broad categories such as the TV genres [41], [49] or the movie genres [39], [90], while others classify a video clip to a sub-category such as a specic type of sports [4], [82]. In this work, we classify video into two much broader categories: professional and amateur video. Their dierent characteristics are analyzed and discussed in the next section. After extracting features from video clips, video classication algorithms use standard classiers. Commonly used classiers include decision trees, the Bayesian approach, k- nearest neighbors (kNN), the support vector machine (SVM), t and the neural networks. KNN is one of the simplest and commonly used machine learning algorithms for classifying features into multiple classes. A feature vector is classied to a particular class based on the majority vote of its neighbors. SVM is a state-of-the-art machine learning tool, which has been used by researchers to classify video into various genres. SVM views the input data instances as two sets of vectors in an n-dimensional space, and constructs a separating hyper-plane that maximizes the margin between the two data sets. Recently, 27 hidden Markov models (HMM) and Gaussian mixture models (GMM) are popular. Since a video clip can be represented as a sequence of feature vectors, the hidden Markov model is suitable for modeling the temporal evolution of the video. In HMM-based approaches, an HMM is trained for each class. The test video is assigned to the video class with the HMM that produces the highest probability. As to the Gaussian mixture model, it is a linear combination of Gaussian distributions considering a single Gaussian distribution sometimes cannot model the data very well. A complex probability distribution can be modeled well if the mixture is chosen carefully. 28 Chapter 3 Duplicate Video Detection Based on Camera Transitional Behaviors 3.1 Introduction Digital video technologies have been growing rapidly thanks to the advances in video capturing, coding and the broadband network infrastructure. They make it possible for everyone to easily edit, re-produce, and re-distribute video contents of various for- mats nowadays. Fueled by the need of video sharing platforms, many websites such as YouTube [99] and Google Video [29] provide space for users to store and share their video contents over the Internet. Copying video, whether legally or illegally, has never been easier. Generally speaking, copyright protection becomes a growing concern for content creators/owners. In particular, controlling the copyright of video uploaded by users is a critical issue to these video sharing websites. Video watermarking is a technique that can be used to monitor the unauthorized use of video contents by embedding visually imperceptible information into the original video. 29 However, watermarking systems suer from the fact that watermarks could be compro- mised and quality of the content could be degraded. The most inconvenient issue is that watermarks must be inserted into the content before being distributed or before copies are made, so that the ownership can be traced later on. This means watermarking cannot be applied to video contents that are already in the public domain. As an alternative, content-based duplicate copy detection has drawn a lot of attention recently, as it can be applied to all video before or after distribution. The idea of content-based video copy detection is to use the unique characteristics inherent in the video to achieve copyright protection. The unique characteristics, called features or the signature, of a query video are extracted and compared with those of video contents in the database. An appropriate similarity measurement is then applied to see if the given video resembles certain video in the database. Another application of duplicate video detection systems is to reduce redundant video copies. The amount of video contents that are available over the Internet has been growing exponentially. As of March 2008, a YouTube search returns about 77.3 million video titles and 2.89 million user channels, and nearly 79 millions of users watched more than 3 billions of video titles in January 2008 alone [94]. Among all video spread over the Internet, a large number of duplicates is unavoidable because of the absence of central monitoring and management. Thus, ecient clustering and elimination of duplicate video copies are needed. As reviewed in Chapter 2.2, existing video copy detection systems have some short- comings. The signatures used in most algorithms are frame-based, and direct comparison of signature vectors is required. They are not only sensitive to various attacks but also 30 demands a large amount of computation and storage space. In this chapter, an algorithm is proposed to address these shortcomings. First, instead of extracting signatures from frames and comparing frame-based signatures directly, the video signature is extracted based on the temporal structure of the underlying video. The frames that mark unique events are labeled as anchor frames. Specically, anchor frames are extracted based on camera transitional behaviors such as shot boundaries and camera panning/tilting. The length sequence between neighboring anchor frames, called the gap sequence, is used to describe the video temporal structure. The one-dimensional video signature is very compact. Since the signature is not related to frame characteristics directly, it is less sensitive to various attacks applied. Second, we propose a fast algorithm that performs the matching between two signatures extremely fast. Being motivated by the genome sequence alignment, an ecient matching technique using the sux array data structure is adopted. The proposed system can perform the sequence matching in linear time while the complexity of conventional duplicate video copy detection algorithms grows at least quadratically with the video length. The rest of the chapter is organized as follows. By examining the temporal structure of video, a signature extraction technique by detecting shot boundaries and camera move- ment events as anchor frames is proposed in Section 3.2. An ecient signature matching algorithm using the sux array data structure is described in Section 3.3. Experimental results are shown in Section 3.4 with discussion. 31 3.2 Anchor Frame Detection 3.2.1 Video Temporal Structure As shown in Fig. 3.1, a video clip can be segmented into a sequence of scenes, which can be further segmented into a sequence of shots. A shot is dened as a continuous segment of frames resulting from a single camera take. Frames within a shot usually have consistent visual characteristics, including color, texture and motion. These frame characteristics are referred to as low-level features of the video content, while the scene and the shot as high-level features. Most previous work on video copy detection was based on nding one or more low-level features that are insensitive, or invariant, to changes mentioned in Section 2.1. However, it is apparent that these features could be easily altered even with minor changes. In addition, if features are extracted on the frame basis, the resultant signature would demand a large storage space and a high computational complexity in search of video copies. Noticing these drawbacks, high-level features as the video signature for duplicate copy detection is adopted. Although a video title can be seen as a long sequence of still images, it actually contains more information than merely a series of frames. In contrast with still images, video has an additional dimension in time; namely, a series of events evolving with time. The temporal composition of these events uniquely denes the characteristics of the underlying video and, therefore, constitutes its temporal structure. Instead of matching the feature vectors of each frame, or a subset of key frames, the temporal structures of video clips are used to match them. 32 ! Cons of frame-based features • Large storage space for signature • High computational complexity in search of video copies • Sensitive to various attacks Video Temporal Structure (1) !" Video Scene 1 Shot1 …… …… ……! High Low Scene 2 Scene N Shot 2 Shot M 1/8/10 Media Communications Lab Long sequence of still images A series of events evolving with time Figure 3.1: The temporal structure of video. In this research, the temporal structure of a video clip is represented by a series of anchor frames, which mark important events in video sequences. This is illustrated in Fig. 3.2. The dierent segments correspond to dierent events, and the black separators between them are the anchor frames that mark the positions of those events. The length between the anchor frames are then extracted to be the proposed signature, which is the seriesf23; 35; 18; 52; 14; 21; 35g in this example. Noticing the natural association between the camera usages and the video content, we detect the camera transitional behaviors including shout boundaries and the pannings and tiltings of camera, and use them as the anchor frames. The detection of shot boundaries and camera panning/tilting are described in the following subsections, respectively. 33 Video Temporal Structure (2) ! Video: A series of events evolving with time • Temporal composition of these events uniquely defines the characteristics of the underlying video ! Anchor frames • Frames that mark important events in video sequences • Gap sequence: the sequence of length between anchor frames • Ex: occurrence of shot transitions -> Use shot length sequence !" Proposed signature: 23 35 18 52 14 21 35 23 35 18 52 14 21 35 1/8/10 Media Communications Lab Figure 3.2: Anchor frames and the proposed signature. 3.2.2 Shot Boundary Detection Shot boundary detection, also known as temporal video segmentation, is the most basic step, and very often, the rst step for a myriad of video processing and analysis tasks. Due to its importance, several works have been published to give an overview and comparison of various shot boundary detection algorithms [20], [26], [28], [51], [59], [100]. Late results have shown that abrupt transition detection is relatively reliable, while gradual transition detection still has room for improvement. Algorithms for detecting gradual transition can be classied into two categories. The rst kind of algorithms focus on one or more particular types of transitional eects, including dissolve, fade in/out, and wipe [12], [32], [72]. They suer from the fact that the eect-specic algorithms would not work well on other transitional eects. The second kind of algorithms deal with all gradual transitions by employing more general models [8], [9], [102]. However, we focus mainly on abrupt shout boundaries since they are more clearly dened and the detection is more robust, comparing to gradual transitions since the goal is to identify unique events that present in both the original video and the duplicate, rather than detecting all shot boundaries correctly. Actually, as long as the detected and missed transitions are the same in both video sequences, the system is still working properly. For the same reason, the detection 34 algorithm is intentionally designed to be relatively simple, which also implies it is robust to various kinds of modications, so that the same event can be detected in both the original and the duplicate. The algorithm is described in the following. 3.2.2.1 Temporal Subsampling Although abrupt shot change detection is fairly robust, shot boundaries may be reported with minor dierences because of the dierence in the frame rate. For example, the 25fps version of a video clip can only have shot transitions at multiples of 1=25 = 0:04 second while its 30 fps version can only have shot transitions at multiples of 1=30 0:33 second. Obviously, a shot boundary at time instance t could correspond to time instance t + t after the frame rate conversion, where t is a small number. Thus, we propose to subsample the input video in the temporal domain before any further processing is performed. In this way, the slight variation in the shot duration caused by the dierence in the frame rate can be avoided. The frame rate subsampling technique has been used in general gradual transition detectors [8], [96] by noticing that although adjacent frames could only show little dis- continuity during a gradual transition, two frames that are several frames apart should show a relatively large discontinuity if these two frames are in a gradual transition. How- ever, it relies heavily on the relationship between the length of the transition and the subsampling factor. If the subsampling factor is smaller than the length of transition, the discontinuity could still be relatively small, resulting in a false negative. On the other hand, if the subsampling factor is too large, we could miss too many information in be- tween. This means that only some transitions that have approximately the same length as 35 the subsampling factor would be detected. Although missed transitions represent errors in shot boundary detection algorithms, they do not cause a problem in our video copy detection system. Since our goal is not to detect gradual transitions correctly but extract anchor frames that present uniquely in both sequences even after certain transformation or modication, as long as the detected and missed transitions are the same in both video sequences, the system is still working properly. In our experiments, the input video is sampled at an interval of 0.2 seconds (or once per ve frames in a video clip of 25 fps). Although some information is lost during the subsampling process, it makes sense since we intend to capture the structure of the underlying video instead of representing the video by frame-based features. Based on our experience, the duration of 0.2 seconds provides a good trade-o between key visual information preservation and computational complexity reduction. 3.2.2.2 Distance Measurement Most shot boundary detection algorithms employ a certain distance (or disparity) mea- sure for adjacent frames. When the distance measure between consecutive frames exceeds a threshold (either xed or adaptive), a shot transition is declared. Distance measures can be roughly classied into two categories: pixel-based and histogram-based. The pixel-based distance measure is highly sensitive to the relative camera-object motion while the histogram-based distance measure is more robust. Among several variants of the histogram-based measure, the color or the luminance histogram provides a reliable distance measure, and it is widely used. It is known that the color histogram is not eec- tive in dierentiating the camera motion and the gradual transition eects. Measuring 36 changes in image edges and/or motion information could help, but its complexity is much higher. For the video copy detection problem, more complicated shot boundary detection algorithms could be more sensitive to various attacks. Thus, the shot boundary detection algorithm used in this work is based on the luminance histogram dierence along with an adaptive threshold which is determined by the statistics collected in a local sliding window. The distance measure d(t) between frame f t and frame f t+1 is calculated using the L 1 norm as d(t) = 1 N K X i=0 ^ h i (t) ^ h i (t + 1) ; (3.1) wheref ^ h i (t)g andf ^ h i (t+1)g denote the cumulative luminance histogram forf t andf t+1 , respectively, K is the number of bins for the histogram, and N is the number of pixels in a frame. A smaller value of K (32 or 64), which corresponds to histograms with fewer bins, is often used to reduce the sensitivity to noise and camera/object movements. On one hand, if the bin number is too large, some small variations would place similar pixel values at dierent bins, thus increasing the distance between the two 'should-be-similar' frames. On the other hand, a bin that is too coarse would lose information so that it may not give the desired discriminating power. To seek a good balance between the above two extremes, we use a relatively ne bin number, K = 128, and the cumulative histogram in our system. The cumulative histogram can take into account the cross-bin eects. It is shown to be a special case of the more general and accurate distance measure called the Earth Mover's Distance (EMD) [79]. 37 After computing distance measure d t , it is further ltered by the following equation: ~ d(t) =jd(t) MED (d(t 1);d(t);d(t + 1))j; (3.2) where MED() denotes the median ltering operation. The reason of using median lter- ing is that a short and abrupt transition has some similarity to the impulse noise. That is, they appear to be isolated local peaks. We can obtain a more stable temporal signature by ltering out these isolated peaks. Note that histogram-based distance measures have one serious drawback. That is, a short and abrupt illumination change such as ashlights could introduce 'articial' shot boundaries since such a change would cause the adjacent frames to have signicant discontinuity, which would then be mistaken for shot bound- aries. Nevertheless, as long as such articial shot boundaries can be 'negatively' detected in both sequences, they have no eect on our video copy detection system. In fact, since an abrupt illumination change is an unique event, we may want to detect it and mark the time of its occurrence. 3.2.2.3 Adaptive Thresholding Having computed the frame distance measure as given in Eq. (3.2), our next step is to determine whether the current frame should be marked as an anchor frame. This is done by thresholding the frame distance function. If the current distance measure exceeds a certain threshold, there might be an abrupt shot transition (e.g., a sudden and large camera/object movement), which means the current frame represents an unique event and should be marked as an anchor frame. The easiest way to proceed is to set 38 some pre-determined threshold. Obviously, this would not perform well if the video under consideration is not stationary, that is, the video content does not have similar characteristics over time. In addition, this 'xed' threshold has to be adjusted manually for dierent types of video. Since our goal is to build a fully automatic system, an apparent alternative is to determine the threshold according to the statistics within a short time window around the frame of concern. Adaptive threshold T is calculated according to the statistics collected in a temporal window W of length L: T (t) = d (t) + d (t); (3.3) where is a scale factor, (t) and 2 d (t) are the sample mean and the sample variance of the ltered distance measure in window W , respectively. Mathematically, (t) and 2 d (t) can be calculated as d (t) = 1 L X i2W ~ d(i); (3.4) 2 d (t) = 1 L X i2W ~ d(i) d (t) 2 : (3.5) More complicated detection algorithms using a probabilistic approach such as the maximum a posteriori (MAP) estimation or a pattern classication approach such as the neural network classier can be considered as well. However, it is our observation that these algorithms demand higher computational complexity and they are more sensitive to various possible changes or attacks. 39 3.2.3 Camera Motion Detection It is challenging to detect shot change boundaries robustly for video with a gradual transi- tion eect such as camera panning, tilting and zooming. To address this problem, another approach is proposed to identify anchor frames by detecting the camera movement. In the proposed system, the block-based motion vector eld is used to estimate the camera motion parameters. 3.2.3.1 Motion Vector Processing The change of image intensity between frames can be modeled by the following six- parameter projective transformation [88]: x 0 = p 1 x +p 2 y +p 3 p 5 x +p 6 y + 1 ; y 0 = p 2 x +p 1 y +p 4 p 5 x +p 6 y + 1 : (3.6) To estimate these parameters, some feature points in the current frame are selected rst. They are carefully selected to be with distinguishable characteristics that can be accurately located and matched in the second frame. The corresponding match of each feature point in the second frame must be found, which is time consuming. An alternative is to use the motion vector eld from the compressed stream as the start point. The center of each block is considered as a feature point, and the motion vector of the block is the motion of the feature point. Note that since motion vectors are computed with the concern of reducing residuals, they may not correspond to true motion. As a result, the motion vector eld can be viewed as a set of noisy samples of the perspective transformation 40 between two frames. Certain processing techniques for noisy samples are needed, which would be described later. Since the GOP structures could vary from video to video, extracting motion vectors from the compressed bitstream is unrealistic. Here, the camera motion is estimated based on the block-based motion vector eld obtained by applying a fast motion estimation algorithm [13]. To avoid heavy computations for motion estimation, a motion vector eld is computed every 0.2 seconds. Note that the reference frame for the motion estimation is still the previous frame, not the frame that is 0.2 second apart. The block size used is 16x16. Only non-zero motion vectors in a frame are used to determine types of camera motion. When the frame has more than 80% motion vectors being zero, the camera motion for the frame is set to zero. 3.2.3.2 Camera Motion Event Detection There are six basic camera operations zooming: change of focal length panning: rotation of camera around its vertical axis tilting: rotation of camera around its horizontal axis tracking: horizontal traverse of camera booming: vertical traverse of camera dollying: moving of camera to or away from subject. 41 Since we aim at labeling special events to obtain the video structure, dierentiate pan- ning/tilting with tracking/booming is not necessary. In addition, zooming, compared to panning and tilting, happens less often. Therefore, only estimating panning and tilting operations are considered. The processes of determining frames corresponding to specic camera events are presented for panning and tilting, respectively, in this subsection. Note that camera panning and tilting are handled separately because of their dierent natures. Panning Event Detection Panning usually involve larger movement than tilting. Thus, a simple method should be able to determine the panning event. Note that we do not need to know the magnitude of the movement. Instead, we only care when the panning starts and ends. Since when a camera panning occurs, the motion vector eld usually has a dominant direction. If this dominant motion vector has a larger horizontal component then its temporal neighbors, there is a panning in progress. Based on this idea, rst the histogram of the motion vector eld of current frame is calculated. Then, to reduce the eect of noisy samples, local averaging is applied to the motion vector histogram: h 0 mv (x;y) = w X i=w w X j=w h mv (x +i;y +j); (3.7) where h mv (x;y) denotes the histogram of motion vectors in the current frame, h 0 mv (x;y) the histogram after local averaging, and w the window size. Here w = 2 is used, which is determined empirically. Then, all motion vectors are ignored except for the one that corresponds to the maximum in a locally smoothed histogram. This motion vector is 42 chosen to represent the camera motion. To give an example, black curves in Fig. 3.3 show the horizontal component (MV x ) of the camera motion vector with time, for the pair of an original video and a duplicate created by camcording and subtitle insertion. The peaks in the curves correspond to fast camera panning operation. The beginning and the end of the sequence are noisy because there are several shot changes in these regions. It is easy to observe the similarity of these two curves. However, they are still quite noisy. To further smooth the curves, the bilateral lter is applied, which is used often in edge-preserving smoothing for 2-D images. It can smooth out noisy regions while preserve the edge sharpness. It is a non-linear lter with its output being the weighted average of the input, where the weight depends not only on how close but also how similar the neighboring and the center samples are. the 1-D version of the bilateral lter is used, which can written as: y(n) = P N k=N g s (k)g r (k;n)x(n +k) P N N g s (k)g r (k;n) ; (3.8) wherex(n) is the input sequence,y(n) is the ltered sequence,N is the window size, and g s (k) = exp k 2 2 s (3.9) g r (k;n) = exp (x(n)x(n +k)) 2 2 s (3.10) are two Gaussian functions that determine the weight for each sample. Larger weights are given to input samples that are temporally closer and with more similar values to the current sample, which are controlled by g s (k) and g r (k;n), respectively. Through this 43 0 2000 4000 6000 8000 Frame number -15 -10 -5 0 5 10 15 MVx filtered (a) 0 2000 4000 6000 8000 Frame number -15 -10 -5 0 5 10 15 MVx filtered (b) Figure 3.3: Comparison of the horizonal camera motion components of (a) the original video and (b) the query video attacked with camcording and subtitle insertion. 44 5000 5100 5200 5300 5400 5500 5600 Frame number -15 -10 -5 0 5 10 15 Figure 3.4: An example of ltered horizontal motion and its rst-order derivative. process, the input can be smoothed while edges are preserved at the same time. The results after bilateral ltering with N = 30, s = 10, and r = 5 are shown by red curves in Fig. 3.3. The peaks corresponding to fast panning are preserved while other noisy regions caused by object motions or camera shakes are smoothed out. Since panning usually takes longer and could last for multiple frames, after the noise ltering process, we take the rst order derivative on the ltered sequence of the horizontal camera motion. As illustrated in Fig. 3.4, the red solid curve is the ltered sequence and the black dashed curve is its rst order derivative. The peaks in the black dashed curve correspond to the start and the end of each panning, which can be detected easily by picking the peaks. Tilting Event Detection Tilting corresponds to vertical camera motion. Since vertical movements are usually smaller and more sudden than horizontal movements, it is expected that the histogram- based method would not work well and a slightly more complicated method has to be used. 45 However, since higher order models such as the six-parameter projective transformation in Eq (3.6) need more complicated estimations (e.g., iterative least-square regression) and we only want to label certain unique events, it may be sucient to adopt a lower-order motion model. It was observed in [88] that, if the perspective distortion between frames is minimal and the camera does not rotate about the axis of camera lens, the transformation in Eq (3.6) can be approximated by the following three-parameter transformation: x 0 =p 1 x +p 3 ; y 0 =p 1 y +p 4 ; (3.11) where the three parameters, p 1 , p 3 , and p 4 are associated with zooming, panning and tilting, respectively. They can be quickly and easily estimated using the motion vector eld and the closed form expressions derived in [88]. After the parameter estimation, the camera motion vector can be computed for each block. The motion vectors that are too far way from the computed motion are rejected as outliers. Specically, we can compute the residual error for each block according to the computed camera motion vector. The standard deviation of residual errors for the current frame is then computed to measure the accuracy of these parameters. Data samples that are inconsistent with estimated parameters (i.e., block-based motion vectors that have residual errors larger than a threshold calculated based on the standard deviation) are rejected as outliers. Then, parameters are re-estimated according to remaining data samples. This estimation-rejection may iterate a couple of times. By examining the sequence of p 4 only, we can nd several peaks that correspond to the event of sudden 46 camera vertical movements. Based on these peaks, we can label the associated frames as anchor frames that indicate the tilting event. To pick anchor frames, a simple local adaptive thresholding technique is used. When- ever the panning or tilting parameters (the rst derivative of the ltered MV x sequence and the p 4 sequence) exceed a threshold, we select the current frame as anchor frame. The threshold, T (t), is determined according to the statistics in a local window W . It is computed by T (t) =(t) +(t); (3.12) where (t) and (t) are the sample mean and the standard deviation in a local window. If we call the segment in two neighboring anchor frames an event, the 1-D sequence of the event length captures the temporal evolution of events in video, which is used as the video signature. 3.3 Signature Matching Using Sux Array 3.3.1 Candidate Pruning Given a query video clip, we extract the signature as described in the previous section. Then, the extracted signature is compared to every signature (video) in the database. Since the size of the video database under consideration is usually very large, the compu- tational complexity is very high for a large size of database even a fast matching algorithm is performed between two signatures. To address this issue, we propose the use of a pruning process to eliminate video sequences that are very unlikely to be duplicates of the query video. This is achieved 47 by noticing that the distribution of color histogram dierence of a given video would not dier too much after minor modications to the video content. To be able to tolerate various possible modications, the distribution of the adaptive threshold in Eq. 3.12, which is calculated based on the local statistics of the luminance histogram dierence, is used instead. For a given query video, the pruning process is described in the following. Calculate luminance histogram dierence and T (t) in Eq. 3.12. Normalize T (t) to 0199. Calculate the normalized histogramfh T;q (n)g of T (t) for the input query video, where n is from 0 to 199. For each video in the database, compute the histogram intersection between his- togramsfh T;q (n)g andfh T;d (n)g. If the computed distance is greater than a pre- dened threshold, we claim that the current video is unlikely to be a duplicate and removed from any further processing. Figure 3.5 shows the visualization of three dierent histograms. Figure 3.5(a) and Figure 3.5(b) show the histograms of the original video and the same video but attacked by strong re-encoding (to a much lower bit-rate), respectively. As can be seen from the gures, these two histograms are very similar to each other. Figure 3.5(c) shows the result of an unrelated video clip, which bears no resemblance to the previous two histograms. Sometimes, two unrelated video clips could have similar distributions when they have similar behaviors. In this case, the signature matching process is still performed. The 48 0 50 100 150 200 0 0.005 0.01 0.015 0.02 0.025 (a) 0 50 100 150 200 0 0.005 0.01 0.015 0.02 0.025 0.03 (b) 0 50 100 150 200 0 0.01 0.02 0.03 0.04 0.05 0.06 (c) Figure 3.5: Histograms of T (t) of (a) the original video, (b) the same video but attacked by strong re-encoding, and (c) an unrelated video. 49 purpose of the pruning process is to lter out those video clips in the database that are very unlikely to be duplicates of the query video to reduce as much computational cost as possible. Therefore, some false alarms are allowed in this stage, as long as they would be eliminated in the next stage; namely, signature matching. 3.3.2 Signature Generation Since the video clip could be shifted, padded or cropped temporally and the temporal location of an anchor frame can be shifted by an oset, the actual time of anchor frames is not important. In contrast, the length between adjacent anchor frames would stay the same even after minor content modications. Thus, for a given video clip, after extracting the set of anchor frames, the length between the current anchor frame and its previous one is computed, and all the length information is stored into a one-dimensional sequence. This sequence is the signature for the video content, which is named as the gap sequence. It serves as a good signature since it is highly unlikely for two unrelated video clips to have a long set of consecutive anchor frames with the same gap lengths. It is also worthwhile to point out that, since the set of shot boundaries is only a subset of anchor frames due to down-sampling and median ltering, this sequence may not be necessarily the same as the actual gap sequence obtained by traditional shot change detection algorithms. For video clips that contain detectable shot changes, anchor frames based on shot boundary detection work well. For other clips that lack detectable shots, they usually contain lots of camera movements. Then, anchor frames extracted based on camera motion would work better. To be able to choose proper anchor frames, a metric is developed for each video clip in the pre-processing step. Since video clips of the latter 50 kind have one common property: lots of camera/object movements and less detectable shots, the feature value distribution in a local window should have a small standard deviation since everything changes gradually. Thus, the feature value distribution in a local window is studied by computing the mean of the local standard deviation as d = 1 N X t d (t); (3.13) where d (t) is the standard deviation of the ltered histogram dierence at time t as shown in Eq. (3.5) and N is the length of d (t). This metric provides a good indicator on the type of the underlying video. If d is large enough, the shot-based anchor frame is used. Otherwise, the camera-motion-based anchor frame is used. 3.3.3 Signature Matching After obtaining the gap sequence as the video signature, the next task is how to match the signatures of any two video sequences eciently. Under this framework, the video copy detection problem can be converted into the sequence alignment problem, where the sequence to be aligned is the gap sequence. Traditional sequence alignment algorithms rely on dynamic programming. Various algorithms have been proposed to solve this problem. The computational complexity of a na ve version of dynamic programming is O(mn), where m and n are the lengths of the two sequences under consideration. As the sequences get longer, the time needed for alignment grows rapidly, which makes dynamic programming impractical for very long sequences. For the video copy detection 51 application, the amount of video can be huge and their gap sequences could be very long. Then, a data structure that allows ecient matching is needed. The sux array [64] and its variants provides an ideal data structure for string pro- cessing and sequence alignment. One such application in bioinformatics is to identify identical subsequences from two given DNA sequences eciently [1], [87]. Being moti- vated by the genome sequence alignment, we adopt the sux array data structure in the signature matching process. Depending on the attacks made on the video, some of anchor frames could only be detected in one video clip but not in the corresponding duplicate one, which means the gap sequences may not be exactly the same. Thus, instead of matching the gap sequence as a whole, we nd all matching subsequences in them, which is called the problem of identifying maximal unique matches. Given two sequencesS 1 andS 2 , a maximal unique match is dened as a subsequence that occurs exactly once in both sequences but not contained in any longer subsequence. Fig. 3.6 shows an example, where maximal unique matches are labeld by red circles. The underlined 57 in one sequence is the sum of f29; 28g, which occurs because an anchor frame is missed in the top sequence but still detected in the bottom one. The same phenomenon occurs onf22; 15g with 37 andf14; 23g with 37. Finding the set of all maximally unique matches is not computationally trivial. A na ve algorithm would compare O(n 2 ) subsequences in one sequence with those in another sequence, and each comparison requires a complexity of O(n). It is possible to nd all maximal unique matches using the enhanced sux array in linear time [1]. Besides the sux array, two additional arrays are needed. They are 52 ... 14 70 21 92 57 22 15 45 67 40 37 24 67 112 14 23 41 48 29 22 ... ... 30 70 21 92 29 28 37 45 67 40 37 24 67 112 37 41 48 29 22 ... Figure 3.6: An example of matching between two sequences, where red circles indicate the maximal unique matches. the longest common prex (LCP) array and the Burows-Wheeler transformation (BWT) array. The algorithm is described below. Let S 1 and S 2 be the signatures of two video clips, i.e., the gap sequences of the query video and a video clip from the database, respectively. We can perform the following to achieve fast matching. Fast gap sequence matching based on the sux array representation 1. Concatenate sequences, S 1 and S 2 , to form a single long sequence S = S 1 #S 2 , where # is a separator symbol that does not occur in S 1 or S 2 . 2. Construct the enhanced sux array representation of S. This step is done in O(n) time, where n =jSj, is the length of S. By exploiting this compact representation, identical segments can be identied very eciently. 3. The maximal unique matches can be identied by nding all local maxima in the LCP array. Specically, if thei th entry of the LCP array exceeds a certain threshold, it is viewed as a local maximum. In addition, the i th and the (i 1) th entry of the BWT array are compared to ensure that the match is not contained by a longer match. The maximal unique match (the identical segment) corresponds to series of consecutive shots with matched gap lengths. 53 3.4 Experimental Results and Discussion Several sets of experiments were performed to evaluate the performance of the proposed duplicate video detection algorithm. In the rst part, Section 3.4.1, several query video clips were tested against a database using the shot boundaries as anchor frames only, while anchor frames that mark camera panning/tilting events were tested in the second part, Section 3.4.2. In these two parts, query video clips either have one or more modications from certain video clips in the database, or are not copies of any one of the video in the database at all. In Section 3.4.3, the proposed algorithm was applied to real-world data. Hundreds of video clips were collected from the internet according to certain query strings. The most popular one was selected as the query, was tested against the rest of the video clips to determine if the video clip being tested is the duplicate of the query clip. 3.4.1 Duplicate Video Detection using Shot Boundaries 3.4.1.1 Experimental Setup The video database used in our experiments is MUSCLE-VCD-2007 [68], which is the corpus used for the CIVR 2007 video copy detection evaluation showcase [19], [68]. The database contains 101 video clips of a total length equal to about 80 hours. These video clips come from several dierent sources such as the web video clips, TV programs and movies, and they cover a wide range of categories, including movies, commercials, sports games, cartoons, etc. They were transcoded into the MPEG-1 format with a frame size 352 288, a frame rate of 25 fps and a bit-rate of 1.15 Mbps. 54 There are 25 query video clips with a total length a little more than 7 hours. They are encoded to be the same format as the video clips in the database. Among them, 15 query video clips were attacked by one or more changes, or transformations while the remaining 10 video clips do not come from the database. The lengths of these queries vary from 6 minutes short video clip to 1 hour TV program. The attacks applied to these query video clips include dierent types of color adjustment, blurring, re-encoding with low bit-rate, cropping, subtitle insertion, camcording, zooming, and resizing. Table 3.1 summarizes the length of each query video clip and the changes made on them. Fig. 3.7 shows several query video clips side-by-side with their corresponding ground truths. In the process of nding matched shots, we only consider a segment of more than 3 consecutive shots that have the same gap lengths as a true positive. The reason to set a threshold is that subsampling the video temporally has a similar eect as quantizing gap lengths. Some video clips could happen to have, say, 2 consecutive shots with the same lengths, but the content is actually dierent. The percentage of matched shots are computed. If it is high enough, the pair of video under consideration is marked as a duplicate one. 3.4.1.2 Candidate Pruning Given a query video, candidate pruning is performed so that video clips that are very unlikely to be duplicates can be eliminated. Figure 3.8(a) and (b) show examples of histogram intersections between all database video clips with Query 5 and Query19, respectively. The red arrows indicate the ground truth video. As shown in the gure, many video clips can be skipped since the values of the histogram intersection are too 55 Table 3.1: 25 query video clips and their modications. Query Length Attacks Query1 6m58s Color adjustment and blurring Query2 5m48s (Not in database) Query3 6m19s Re-encoding, color adjustment, and cropping Query4 5m26s (Not in database) Query5 7m43s Re-encoding with strong compression Query6 6m02s Camcording and subtitles Query7 7m00s (Not in database) Query8 8m01s (Not in database) Query9 9m13s Color phase modication and color adjustment Query10 11m33s Camcording with an angle Query11 11m46s Camcording Query12 14m41s (Not in database) Query13 17m27s Horizontal ipping Query14 26m19s Zooming and subtitles Query15 42m59s Resizing Query16 18m07s Color adjustment + resize Query17 28m30s (Not in database) Query18 62m19s Re-encoding (200kbps) Query19 29m43s Resize Query20 7m28s (Not in database) Query21 7m44s (Not in database) Query22 48m16s Color adjustment + re-encoding (500kbps) Query23 14m13s (Not in database) Query24 6m57s (Not in database) Query25 25m56s Temporal shift 56 large. Although the ground truth yields the smallest value for Query19, the ground truth does not necessary give the smallest value in some cases, which does not matter as long as the value is small enough to let the ground truth survive the pruning stage. In this experiment, there are 25 query video clips and 101 video clips in the database. Therefore, there are total 2,525 video pairs need signature matching. With an empirical threshold of 0.45, this number can be reduced to only 736 pairs, which is less than 30% of the original computations. The 2,525 pairs of signature matching are performed in the following. 3.4.1.3 Matching Results Table 3.2 lists matching results of all query video clips with their corresponding ground truth video clips. The results are shown as the percentage of matched shots, and the percentage is calculated according to the signature length of the query video. That is, a 100% match means all anchor frames of the query video clip match those of the target video clip in the database in terms of gap lengths, which are lengths between two adjacent anchor frames. It is observed when the query video is compared with video clips in the database other than the ground truth, the reported percentage of match is almost always close to 0%. Only a few (less than 5 for each query) pairs have percentages around 5%. A query video has a high-percentage match only when compared against its corresponding ground truth. As shown in Table 3.2, 8 of them (Query3, Query5, Query9, Query13Query16, and Query25) have their matching percentages ranging from 70% to 100%, while 4 (Query1, Query18, Query19, and Query 22) have only 53% to 62% matches. Query1 has a 53% 57 (a) (b) (c) (d) Figure 3.7: Query video clips and the corresponding ground truths: (a) Query1, (b) Query3, (c) Query5, (d) Query6, (e) Query9, (f) Query10, (g) Query11, (h) Query13, (i) Query14, and (j) Query15 58 (e) (f) (g) (h) Figure 3.7 (continued) 59 (i) (j) (k) (l) Figure 3.7 (continued) 60 !" !#$" !#%" !#&" !#'" !#(" !#)" !#*" !#+" !#," $" $" $$" %$" &$" '$" ($" )$" *$" +$" ,$" $!$" !"#$%&'()*"+$,'#,-.%+* /($(0(#,*1"2,%* 34,'56* (a) !" !#$" !#%" !#&" !#'" !#(" !#)" !#*" !#+" !#," $" $" $$" %$" &$" '$" ($" )$" *$" +$" ,$" $!$" !"#$%&'()*"+$,'#,-.%+* /($(0(#,*1"2,%* 34,'567* (b) Figure 3.8: An example of candidate pruning given (a) Query5 and (b) Query19 as the query video. 61 match, which is the lowest in our test. On the other hand, the 10 query video clips that do not have their ground truths in the database all have 0% or less than 5% matches. Thus, we conclude that the detection of duplicates/non-duplicates is successful for the 22 query clips. The detection rate is 22 out of 25 (or 88%). 3.4.1.4 Storage Space and Computational Complexity Since we use the enhanced sux array to nd all maximal unique matches, the computa- tional complexity is low. Besides, the length of the signature is the same as the number of anchor frames, and the set of all anchor frames is a subset of temporally subsampled frames. This compact signature saves the storage space as well as reduces the compu- tational complexity drastically. The size of the original database is about 35GB, while signatures for all video clips in the database only take 1.2MB, which is about 0.33% of the original database size. To demonstrate the speed of the proposed video copy detection system, the run time information on this data set is obtained using the Linux command time on a computer with 2.16GHz Core 2 Duo CPU, and 2G memory. Table 3.3 summarizes the time needed for anchor frame extraction for the entire database and the query set, and the time for signature comparison. For a video database of length about 80 hour, less than 1 hour is needed in extracting all anchor frames. Besides, only 3 minutes are needed in extracting anchor frames for all query video clips. The power of the proposed sux-array-based approach lies in ecient signature matching. In our experiment, there are 15 query video clips and 101 video clips in the video database, there are 1,515 video pairs for comparison. 62 Table 3.2: Results of matching percentages with the query video and its ground truth. Query Ground Truth Maximum Match Percentages Query1 movie27 53.2% Query2 not in db 0%-5.45% Query3 movie8 87.5% Query4 not in db 0%-6.66% Query5 movie44 80% Query6 movie76 0% Query7 not in db 0%-5.26% Query8 not in db 0%-5.66% Query9 movie9 81.35% Query10 movie21 0% Query11 movie37 0% Query12 not in db 0%-6.38% Query13 movie11 100% Query14 movie17 69.68% Query15 movie68 98% Query16 movie13 74.67% Query17 not in db 0%-3.78% Query18 movie39 56.48% Query19 movie52 61.84% Query20 not in db 0%-4.83% Query21 not in db 0%-7.89% Query22 movie78 62.88% Query23 not in db 0%-6.71% Query23 not in db 0%-5.08% Query25 movie83 81.81% 63 Table 3.3: The Time Needed for Each Dierent Tasks Task # of Sub-task Time Anchor frame extraction (database) 101 videos (80hr) 61m56.867s Anchor frame extraction (query) 25 videos, (7hr) 6m38.640s Signature comparison 101 25 = 2525 1m50.124s Using the proposed approach, only 12 seconds are needed, which means that each pair of comparison only takes about 8 milliseconds on the average. 3.4.1.5 Characterization of Failure Video Contents As shown in Table 3.2, there are how three query videos that are duplicates of some video programs in the database, but have 0% match when compared with their corresponding ground truths. They are Query6, Query10, and Query11. These 3 query video clips have one common property. That is, they all contain a lot of gradual transitional eects and camera/object motions. Specically, we have the following observations. Query6, which is from a home-made video recorded by a hand-held camcorder, contains several (total 15) transitions at the beginning and the end of the video clip. Except that, there is only one gradual transition in the middle of the video clip with camera shaking and panning motion present. It is dicult to select anchor frames that correspond to certain special events. Query10 is a documentary le that contains many shot transitions, all of which are gradual transitions. Besides, these gradual transitions are all relatively long. They take about 20 frames, which is almost 1 second for a 25 fps video, to complete the transition. Thus, the 0.2-second temporal subsampling of the video still cannot 64 capture such transitions. Besides, we nd it dicult to mark other kinds of anchor frames, since Query10 has very steady contents in each shot. Query11 is a football game video that contains only gradual transitions and a lot of camera and object movements including camera panning, zoom in/out, and players running on the led. Again, there is no easily identiable event that can be marked as anchor frames. The extremely busy imagery is the main reason that this query video cannot be identied as a match with its ground truth. The false negatives on these 3 query video clips indicate that the signature based on luminance histogram works better for video contents that contain more unique, detectable shots, or longer video programs such as movies or TV programs. Actually, video clips like these 3 queries are few, and can be treated dierently. By computing the metric d developed in Section 3.3.2, these failure clips can be singled out. We show the value of d for all query video clips in Fig. 3.9. As shown in this gure, the 3 missed query video clips, i.e., Query6, Query10 and Query11, all have signicantly smaller d (t). 3.4.2 Duplicate Video Detection using Panning/Tilting The video database used here is also MUSCLE-VCD-2007 [68], the same one in the previous experiment. Twelve query video clips were created from two source videos. One of the original video is a 12 minute long amateur video recored by a hand-held camera during a dinner and a show. The other one is a 6 minute summary of a football game 65 0 2 4 6 8 10 12 14 16 Query 0 0.5 1 1.5 2 2.5 3 3.5 Mean of the standard deviation Figure 3.9: Comparison of the mean of the standard deviation for all query video clips. recorded on TV. Both video clips involve plenty camera motion. Twelve query videos are created by applying six dierent attacks to the two source videos: color adjustment (shift in hue) spatial cropping Gaussian noise Gaussian blurring camcording strong recompression with much lower bit-rate (from 1.15Mbps to 200Kbps or 500Kbps). 66 (a) (b) Figure 3.10: Query video clips (left) and their ground truths (right) attacked by (a) camcording and subtitles and (b) camcording. Table 3.4 lists the ground truth video and the attack applied for each query video. Fig. 3.10 shows two of the query video clips side-by-side with their corresponding ground truth videos. When nding events with matched lengths, only segments of more than 3 consecutive events are considered as true matches. Then, the percentage of matched events are computed with respect to the signature length of the query video. If it is high enough, the query video is considered a duplicate of the other video in the database. The matching results of 12 queries with their ground truths are shown in Table 3.5. Query videos only have high match percentages when comparing with their corresponding ground truths. 67 Table 3.4: Ground truth and attack applied for each query. Query Ground Truth Attack query01 movie37 Blurring query02 movie37 Color adjustment query03 movie37 Cropping query04 movie37 Noise query05 movie37 Re-compression query06 movie37 Camcording query07 movie76 Blurring query08 movie76 Color adjustment query09 movie76 Cropping query10 movie76 Noise query11 movie76 Re-compression query12 movie76 Camcording Seven of them have match percentages from 63% to 72%, and one has 53%. Although the remaining four have lower match percentages, from 25% to 31%, they can still be claimed as successful detection. The reason is that, when a query video is compared with a video other than its ground truth, the match percentage is almost 0% most time. Few pairs have 5% of match and only 2 pairs have the match percentage a little more than 10%. Thus, most of comparison pairs can be eliminated very quickly. Therefore, queries having 25% to 31% match with certain video in the database are very likely to be duplicates and should be further examined. The speed of the proposed algorithm is demonstrated by measuring the time needed to extract signature and signature comparison, respectively, using the Linux command time on a computer with 2.16GHz Core 2 Duo CPU and 2GB memory. Extracting 68 Table 3.5: Match percentages of queries with their ground truths. Query Match Percentage query01 71% query02 71% query03 64% query04 63% query05 29% query06 31% query07 63% query08 72% query09 69% query10 25% query11 53% query12 26% signatures for 12 query video clips, which have a total length of almost 2 hour, takes about 8.5 minutes, while only 68 seconds are needed for comparing 12 signatures with the 101 signatures in the database, which involves 1,212 paris of comparisons. Combining the approach based on panning/tilting event detection with the one based on shot boundaries, 24 out of 25 query video clips listed in Table 3.1 can be successfully detected. To compare the result with others, the results of all participants of the CIVR 2007 video copy detection evaluation showcase, which can be found at [68], are used. In the evaluation showcase, the rst 15 query listed in Table 3.1 were used. The most accurate result in this event was 86%, which corresponds to successfully detecting 13 queries out of 15, while the proposed system achieved 14 out of 15. Most participants took more than 40 minutes to process and match the query clips against the preprocessed database, while the proposed system took only 15 minutes. Although the fastest result 69 reported in [68] was 14 minute, its accuracy was only 8 out of 15 queries (or 53.33%). Thus, as compared with results of other participants of the evaluation showcase, our algorithm works much faster and achieves a higher detection rate. 3.4.3 Duplicate Video Detection on Varying Frame Rates In this experiment, the proposed system is applied to the real-world video data collected from the Internet. The video clips uploaded to the Internet can be in any frame rate. The most commonly used frame rates are 30, 25, 24, and 15 fps. The issue of how to adapt to varying frame rates must be dealt with rst before actually applying the algorithm to them. As described in Section 3.2, the temporal domain sub-sampling is performed before extracting anchor frames. To adapt to varying frame rates, the sub-sampling factor can be computed via s =round ^ s ^ f f ! ; (3.14) wheref is the frame rate of the current video clip, ^ f the target frame rate that we intend to convert, ^ s is the sub-sampling factor associated with the target frame rate. To give an example, if ^ f = 5fps, f = 15fps, ^ s = 6, then s = 2. The gap sequence extraction is conducted after adjusting the temporal sub-sampling factor. On one hand, with adaptive frame skipping, we can keep the time interval between two adjacent frames about the same although the original frame rates of these video clips are dierent. On the other hand, it is very dicult to obtain the exact same time interval due to the discrete nature of sampled frames. To compensate the small discrepancy 70 that might exist, a quantization process is applied to the gap sequence. That is, the gap sequence is quantized by a factor before constructing the sux array for the gap sequence alignment. A coarser quantization (equivalently, a larger quantization factor) should be able to detect more duplicates but, could result in a higher false alarm rate. 3.4.3.1 Data Collection Video clips were collected using TubeKit [83], which is a toolkit to generate YouTube crawlers according to user specication. The created crawlers automatically download video clips that match the specied requirement. Two dierent query strings were chosen and sent to YouTube. Video clips that are in the search results of the query string are downloaded. The strings used and the resulting numbers of video clips are shown in Table 3.6. The number and the percentage of duplicates are also shown in the table. Query1 and Query2 correspond to two very popular video clips on YouTube, a music video and a clip of a recording from a TV talking show, respectively The lengths of Query1 and Query2 are 2 min 48 sec and 5 min 23 sec, respectively. The total lengths of the whole set are roughly 11 hour 10 minutes and 12 hour 23 minutes, respectively. Several examples of duplicate video clips are shown in Fig. 3.11. The dierent properties of these clips include color, aspect ratio, cropping, quality caused by dierent compression settings, etc. The addition of title, logo, and/or subtitles is also a commonly seen modication. Some clips contain fade-in/out at the beginning and/or end, and addition of opening and/or ending credit. There are also some modications in the time domain such as speeding up or slowing down the video, which would totally change the lengths of each event. Strictly speaking, video clips that dier too much in 71 Table 3.6: Queries for crawling video from YouTube Query Duplicates Query String # # Percentage 1 White and Nerdy 221 78 35.29% 2 Jimmy Kimmel Matt Damon 190 57 30% the playback speed should not be regarded as duplicates since the viewing experience is quite dierent. However, they are still labeled as duplicates in the experiment. 3.4.3.2 Performance Measurement We chose one \White and Nerdy" sequence and one \Jimmy Kimmel Matt Damon" sequence as the query sequences and compared each of them with all other video clips to calculate a score and determine if they are duplicates. The percentage of the matched gaps is used as the score. When the percentage is higher than a threshold, the two video clips under consideration are labeled as duplicates. Note that gaps are considered as matched only when the gap lengths are equal for at least 3 consecutive gaps. The obtained result is then compared with the ground truth, which is established by manual labeling after the inspection of all video clips. The performance is measured by the precision-recall curve, which is generated by varying the threshold of declaring a duplicate and the quantization factor. They are dened as Precision = TP TP +FP ; (3.15) Recall = TP TP +FN ; (3.16) 72 Figure 3.11: Sample frames of duplicate video clips of Query1. 73 where TP is the number of true positive, FP is the number of false positive, TN is the number of true negative, and FN is the number of false negative. 3.4.3.3 Results The precision-recall curves of both queries are shown in Fig. 3.12. Red crosses correspond to Query1 and blue circles correspond to Query2. For Query1, the proposed algorithm achieves a high precision value (95%) when the recall is around 80% as shown in the gure. A high precision means that the false positive rate is low. Thus, if the proposed system indicates that the given video clip is a duplicate, it is almost certain that the claim is true. On the other hand, a lower recall means some duplicates would be missed. When trying to push the recall over 80% by varying the threshold and the quantization factor, the precision would drop to less than 60%. The curve of Query2 has a similar behavior to that of Query1. The precision remains very close to 1 until the recall increases and reaches value around 80%. Most of missed duplicate video clips (i.e., false negatives) have modications in the time domain. Several clips were deliberately speeded up or slowed down. A few clips were re-edited by repeating certain segments or chopping o certain segments. There is also one clip with gradual shot transitions that have never been seen in the original clip. One thing worth mentioning is that some duplicate clips that have frame rates dierent from the original clip were not detected if the gap sequences were not quantized. With a small quantization factor (ne quantization), such as 2 or 4, they can be successfully detected as duplicates. Some of aforementioned clips that were modied in the temporal 74 0 0.2 0.4 0.6 0.8 1 Recall 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Query1 Query2 Figure 3.12: The precision-recall curves of Query 1 (denoted by crosses) and Query 2 (denoted by circles). domain can be detected by applying a coarser quantizer (a larger quantization factor), but the number of false positives increases at the same time. Another interesting phenomenon is that some duplicate clips for Query2 that were missed seems to form a group themselves. They can be detected as duplicates inside their own group but not with some other duplicate clips. A possible reason is that since this query string corresponds to a clip in a TV show, anybody can record this show when it was aired or on the re-run, and dierent recording devices might have dierent recording mechanisms that somehow changed the time stamps. An algorithm similar to the one proposed in [107] was implemented and compared with the proposed algorithm, which uses dynamic programming for sequence matching. The feature sequences used here is the luminance histogram dierence d(t) as calculated 75 in Eq. (3.1), which is similar to the color-shift signature used in [107]. In dynamic programming, a cubic scoring function proposed in [107] is used: S = 20 ( 10) 3 50 ; (3.17) where S is the score and the dierence between symbols ( d(t) in the current case). The precision-recall curves of the algorithms based on the sux array, and dynamic programming are shown in Fig. 3.13(a) and Fig. 3.13(b) for Query1 and Query2, re- spectively. We see that the performance is comparable and superior to [107] for Query1 and Query2, respectively. This demonstrates instability of directly using frame-based features. In addition, the proposed algorithm is about 4 to 5 times faster than dynamic programming in terms of program running time for the signature matching stage. 76 0 0.2 0.4 0.6 0.8 1 Recall 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Suffix Array Dynamic Programming (a) 0 0.2 0.4 0.6 0.8 1 Recall 0.2 0.4 0.6 0.8 1 Precision Suffix Array Dynamic Programming (b) Figure 3.13: The precision-recall plots for (a) Query1 and (b) Query2. 77 Chapter 4 Video Genre Inference Based on Camera Capturing Models 4.1 Introduction Digital representations have been widely adopted for multimedia contents. With ad- vanced technologies in broadband networks and multimedia data compression, the cre- ation of large online video repositories for easy access becomes feasible, and the size of digital video collection increases rapidly in recent years. Eective management of a large video collection becomes an important issue. Automatic video genre classication is one important and challenging problem among the issues of eective management of large video collections, which has attracted a lot of attentions for more than two decades [11]. Its goal is to place each video title in a category (e.g., movies, news, sports, etc.). Most research focuses on classifying the entire video while some attempts to classify smaller video segments (e.g., classifying dierent segments from a broadcast news program [106]). Traditionally, classifying video into several pre-determined categories such as news, sports and commercials involves two steps. First, models are built for each genre from a set of training video clips. Second, video clips with unknown genre are compared with 78 the models of the pre-determined genres. In the rst step, visual and/or audio features are extracted to represent each video clip. Features that explain the variation between genres the most should be chosen. Learning methods are used to bridge the gap between low-level features and genres, which is a high-level semantic concept. In the second step, a proper similarity function is involved to place a target video into a certain category. Previous work on video classication is concerned with classifying video into one of broad categories such as movies, sports, news, or into more specic sub-categories such as dierent kinds of sports. These categories all have one common property. That is, they are shot and edited by professionals and, thus, called professional video. On the other hand, the rapidly dropping price in hand-held cameras and video editing software makes it possible for everyone to become a video producer. Online video sharing websites such as YouTube [99] and Google Video [29] are lled with user-generated contents, called amateur video, these days. These contents are getting more and more popular and the amount of them is getting larger and larger. Since professional and amateur video contents have dierent commercial values, their automatic separation would facilitate the management of large video collections. Examples include handling the copyright issues in some professional video clips uploaded by certain users, where amateur video clips are certainly of less importance for content owners such as the movie or TV industries. This is the main focus of our current research. Many video genre classication methods rely on low-level features without considering the shooting process. It is observed that one main dierence between professional and amateur video is the process how they are produced. For example, their characteristics is highly correlated with the number of cameras used to shoot the scene. That is, most 79 amateur video is shot with only a single camera while professional video is shot with multiple cameras. If the number of cameras used in the shooting process can be inferred from the low-level video features, amateur video can be more easily identied even with the presence of minor editing. Based on this observation and others, instead of directly extracting features from video contents, we attempt to infer the shooting scenarios asso- ciated with these two video types. Then, features that can re ect dierent camera usages are extracted to distinguish between professional and amateur video contents. They in- clude: the number of cameras in a shooting scene, camera shakiness, sharpness, color variance, and the distance between the camera and the subject, etc. This chapter is organized as follows. By studying the lming principles, two novel features that model the camera capturing scenario are proposed in Sec. 4.2. These two features are tested on dierent genres of professional video in Sec. 4.3, where it is demonstrated that the genre can be inferred from the proposed features. In Sec. 4.4, the major dierences between professional and amateur video contents are discussed in several aspects. Several features used to capture such dierences are discussed and presented in Sec. 4.5. Experimental results on automatic separating professional and amateur video contents are given in Sec. 4.6. 4.2 Camera Capturing Model Traditional methods use low-level features such as the shot length, color, motion etc. Here, a novel approach based on an intermediate video capturing model is introduced to bridge the gap between low-level features and the video genre. The idea is that the video 80 genre is inherently linked with the video capturing scenario. For example, only a single camera is involved for personal video while multiple cameras and editing are employed for producing professional video. Another example is the dialogue scene, which happens frequently in romance and other character movies. In dialog scenes, two cameras are often used to capture the faces of two actors. Two novel features for video genre classication are proposed based on the video capturing scenario. The rst one is the number of cameras. By comparing the similarity between key frames of shots, we can roughly determine how many scenes there are within a certain time window. The second feature is the distance of the subject to the camera. If the subject is farther away from the camera, it appears smaller in the frame and could be of less importance. On the contrary, if the subject is closer to the camera, the subject is bigger and could be more important. Since the features are extracted by taking the lming process into account, most commonly used classiers can successfully label video clips into dierent genres. This section started by discussing the basis knowledge in lming, especially the eects of dierent usages of cameras. Then, two features are proposed. 4.2.1 Filming Basics There are many ways to use cameras depending on how you want to shoot and present the scene. The camera characteristics can be roughly categorized by dierent subject distances, camera angles (including horizontal and vertical positions), focal lengths, or camera levels as explained below. 81 Subject distances: extreme long shot, long shot, full shot, medium shot, head and shoulders close-up, close-up, big close-up, extreme close-up. Horizontal position: front angle, prole angle, rear angle. Vertical position: bird's eye angle, high angle, neutral angle, low angle, worm's eye angle. Lens or focal lengths: wide-angle lens, telephoto lens. Camera levels: normal, dutch angle (tilting the camera o to the side so that the shot is composed with the horizon at an angle to the bottom of the frame). Dierent uses of cameras can have dierent presentations for the same scene, and convey dierent information. For example, close-ups indicate the importance of a subject being lmed and create impacts; longer shots make the scene less intense; high angles make the viewer feel more powerful than the subject while low angles suggest the pow- erfulness of the subject; and dutch angles are often used to create tension or uneasiness. Therefore, it is reasonable to argue that there could be dierent distributions of camera distances or angles for video of a dierent genre. For ction video that wants to tell some stories usually sticks to a moderate position or angle, because they do not want to call attention to themselves. This is especially true for video like comedy, which usually tries to avoid big close-ups. The drama is similar except that tighter angles tend to be chosen to increase emotional intensity. As for action movies, a lot of dutch angles are used. Sometimes, unusual angles are used for some special eects. For example, a sudden bird's eye view or an extreme close-up can startle the audience. However, these special angles are not used often. 82 As for non-ction video such as news, sports and documentaries, the important thing is neutrality. Therefore, camera setups tend to be neutral. The camera angle types range from the long shot to the loose close-up. Extreme setups like the wide angle, extreme close-up, or o-level angles are often avoided when lming video of these genres. For example, on-camera reporters are usually covered in medium shots. Even for the news shots in the eld, similar rules also apply. Long shots or telephoto lens are used often in sports videos, since directors would like the camera to be o the playing eld. Sometimes, closer shots on the players, either on the eld or the bench, coaches, or even the spectators may be desirable. In these occasions, some close-ups may be used. Commercial and music video contents usually contain a lot of editing or shooting eects. In some ways, more extreme angles are preferred to create as dramatic and/or striking feeling as possible. 4.2.2 Camera Number Many existing methods use the average duration of shots or number of shots as one of the feature to classify video. It is based on the observation that action video usually has shorter and more shots to create the intense feeling for viewers, like car chasing, ghting or explosion scene. On the other hand, drama or romance video tends to have longer shots to develop the characters or scenes. Note that a longer duration also implies fewer shots in a x interval. However, drama movies sometimes would also have shorter shot durations, not nec- essarily as short as in action movies, but short enough to cause ambiguity. Take dialogue 83 scenes for example, which happen a lot in drama movies. Two cameras are used to cap- ture the faces of the two persons that are in the dialogue. There could be more than 10, or even almost 20 shot changes during an one minute conversation between two people, depending on the content of the conversation. The action movies sometimes have around 20 shot changes in one minute. Therefore, it is more important to determine the number of cameras used during a period of time rather than counting the number of shot changes. To infer the number of cameras, a shot boundary detection is rst performed [20], [57], [100]. Shot boundaries can be determined by computing a certain similarity (or distance) measure between adjacent frames. If the similarity is below (or the distance is above) some threshold, a shot boundary is declared at the current frame. The same idea can be extended to infer the number of cameras. If a camera is not moving or changing its focus, frames that are shot by the same camera should look similar since they are close in time. The simplest way is to compare every pair of frames during a short period of time. However, this would demand high complexity. An alternative is to extract key frames. Shot change detection is rst applied. Then, we may use the rst frame of each shot as the key frame [69]. Although there exist other more complicated algorithms to determine the key frame of each shot, this simple approach should suce for our current purpose. After shot boundary detection, we can simply compare the extracted key frames. If two key frames are similar enough according to a certain distance measure, they are labeled with the same camera index. Note that it is not meaningful to consider the number of cameras in a long time interval since the camera setup is changed for dierent scenes. Thus, the number of cameras should be calculated in a short time interval only, 84 ...... Shot boundary ...... Calculate distance Figure 4.1: The process of determining new camera frames. say, in one minute. Furthermore, since camera motion is common, one key frame may not be sucient to represent a shot. For this case, we may need to extract more than one key frame from a shot. To achieve this, we adopt the approach presented in [103], which compares each frame with the current key frame in the shot. If the distance between them exceeds a threshold, the current frame is selected as the next (new) key frame. This process is illustrated in Fig. 4.1. The proposed algorithm to infer the number of cameras within a short time interval is summarized below. 1. Compute the color histogram for the current frame i. 2. Compute the distance between the current and the previous frames. If the distance exceeds a threshold, T 1 , the current frame is chosen as a new key frame. For each new key frame, compute the distances with previous key frames. If the shortest distance is less than a certain threshold T 2 , assign the camera index of the most similar key frames to the new key frame. 85 3. If the condition in Item 2) is not met, compute the distance between the current frame and the previous key frame. If the distance exceeds a threshold, T 3 , the current frame is chosen as a new key frame but with a camera index of the key frame corresponding to the previous shot boundary. 4. Proceed to the next frame by increment frame index i by one. ThresholdsT 1 ,T 2 andT 3 in the algorithm can be determined from a set of training video sequences. 4.2.3 Camera Distance The distance between the camera and the objects being shot can be classied into the following several types. Long shot (LS) (sometimes wide shot): Long shots typically show the entire subject (for example, human gures) and usually include a large portion of the surroundings to provide a comprehensive view. There are also the extreme long shot (XLS) and the medium long shot (MLS). The extreme long shot is obtained when the camera is at the furthest position from the subject. It usually shows the outside of a building or a landscape. The medium long shot is a distance somewhere between long shot and medium shot. In the case of a human gure, it usually cuts o the feet and ankles. Medium shot (MS): In the medium shot, the subject and its surrounding occupy about the same areas in the frame. In the case of the human gure, a medium shot 86 is from the knees or waist up. It is usually used for dialogue scenes. It is good at showing body languages but lacks the ability to show facial expressions. Close-up (CU): Close-up shots show a very small part of background. They con- centrate on the detail, such as a character's face, which usually lls almost the whole frame. Variations include the medium close-up (MCU), the extreme close-up (XCU), etc. A medium close-up includes the head and the shoulder in the case of a human gure. On the other hand, an extreme close-up magnies the subjects beyond what we usually experience in a real world. In the case of a human gure, it shows the head, usually from the forehead to the chin. When the distance of cameras changes, the main dierence is the object size, which gets bigger from the long shot to the close-up as demonstrated in Fig. 4.2. Thus, by measuring the ratio of the foreground object area to the background area or the entire frame, the relative shot distance can be estimated. Extraction/detection of foreground or moving objects is an important step in appli- cations such as object tracking and identication in a video surveillance system. For some applications, the background information is available in all frames, for example, when the background is static. Instead of modeling foreground objects, the background information allows the detection of the foreground by \learning" and \subtracting" the background from the video frame. Dierent background modeling techniques have been proposed in the literature [33], such as the one based on the edge map [45], the background registration technique [17], the mixture of Gaussian (MoG) model [66], [86], etc. 87 (a) (b) (c) (d) (e) (f) (g) Figure 4.2: Dierent camera distances: (a) extreme long shot, (b) long shot, (c) medium long shot, (d) medium shot, (e) medium close-up, (f) close-up, and (g) extreme close-up. However, when the background is dynamic or the camera is moving, modeling and de- tection of background becomes a challenging problem [3], [67], [77]. Background modeling and subtraction cannot be applied directly to these cases. Usually, motion compensation has to be applied to video frames rst to compensate the movement caused by a moving camera. To this end, parameters of a camera motion model are estimated, usually based on optical ow, or motion vectors of certain feature points. These techniques assume that the underlying camera motion models are accurate enough so that video frames can be well compensated and aligned with others. Nevertheless, motion vectors are usually noisy, which means accurate camera motion reconstruction is generally dicult, and the estimation of camera motion parameters is usually a complicated process if certain ac- curacy is required. In addition, unlike surveillance video, shot change occurs frequently in movies, where the background model has to be re-initialized from time to time. This makes foreground/background separation even more challenging. 88 Besides the cumbersome procedure of camera motion estimation, motion compensa- tion of video frames, background modeling, and background subtraction, there is another approach for foreground/background separation. That is, the human visual attention model can be used for this purpose. Foreground objects, usually with more motion, at- tract more human attention than the background. By examining the motion vector eld, it is possible to identify regions that attract more attention. A director usually moves the camera to track the movement of objects. Thus, motion vectors should be approximately equal to a global motion, denoted by (v 1 ;v 2 ), which can be roughly estimated by the mean of the motion vector eld: v i = 1 MN M1 X x=0 N1 X y=0 v i (x;y); i = 1; 2; (4.1) where M and N are the numbers of blocks in the column and the row, respectively, and v i (x;y) is the motion vector at position (x;y). Foreground objects are identied as regions with motion vectors dierent from the global motion. Let F (x;y) be a map denoting background by 0 and foreground by 1. Then, it can be obtained as F (x;y) = 8 > > < > > : 1; ifjv i (x;y)v i j> i ; 0; otherwise, (4.2) where i is the threshold between foreground and background. If the current motion vector is much deviated from the mean motion vector, the current block is labeled as 89 foreground. Here, the threshold is selected as the standard deviation of the motion vector eld: 2 i = 1 MN M1 X x=0 N1 X y=0 (v i (x;y)v i ) 2 ; i = 1; 2: (4.3) However, if the background has some homogeneous and/or periodic content, the esti- mated motion vector could be wrong, resulting in some background blocks to be wrongly labeled as foreground blocks. To remedy this, for each 4 4 block in the current frame, a close neighborhood of the corresponding position in the previous frame is checked. If there is a block from the previous frame that is similar enough to the current block, the current block should be labeled as background. The similarity of two blocks is measured by the sum of absolute dierence (SAD) of the luminance normalized by the sum of the luminance of the current block. Mathematically, we have D x;y = P 3 i=0 P 3 j=0 jI t (x +i;y +j)I t1 (x +i + x;y +j + y)j P 3 i=0 P 3 j=0 I t (x +i;y +j) ; (4.4) whereI(x;y) is the luminance component of the current frame, and x and y dene the local neighborhood. If D x;y is smaller than a threshold, the current block is labeled as background; otherwise, the foreground/background is determined by F (x;y). The camera distance is then estimated by the ratio of the foreground area to the entire frame, which is called the normalized foreground area. The threshold for D x;y is determined empirically. One commonly used feature in video classication is frame dierencing. It measures the amount of motion between frames. If the camera is still, frame dierencing can capture the movement of foreground objects. However, if the camera is moving, the 90 Movie Non-movie Action Cartoon Drama Horror Music Video News Sports Video Figure 4.3: Video genres considered in this experiment. content of the entire frame is changing and, thus, it is dicult to capture the movement of the foreground. In the current context, since regions that are consistent with the global motion are excluded, the foreground objects can be identied and the camera distance can be estimated. 4.3 Preliminary Experiment Results Several video programs with dierent genres are collected from YouTube. They are collected for movie and non-movie classes. The movie class contains action, cartoon, drama, and horror, while the non-movie class contains music video, news, and sports, as shown in Fig. 4.3. Video programs were rst divided into segments of approximately equal duration of 1 min. Then, they were all encoded by the H.264 video format. There are 20 1min clips in each genre, 140 in total, except that action08.mp4 is 20 seconds. The total length was around 2 hour and 20 minutes. 91 Note that this amount of video is not enough for both training and testing. In addition, to form a complete representation of each genre, more features that take other aspects such as color into consideration should be included as well. The results shown here only serve as preliminary results that demonstrate the feasibility of the proposed two new features. All thresholds that are used to extract the two features are determined empirically. 4.3.1 Camera Distance Fig. 4.4 shows some examples of the foregrounds extracted from a short clip in the movie Hancock, a soccer game on TV, and a chasing scene in a horror movie. The light-gray area represents the foreground while the dark-gray area the background. As can be seen, the proposed algorithm can successfully extract the area that attract the visual attention the most. Generally speaking, a dierent camera distance can achieve a dierent visual eect and dierent genres should have a dierent distribution of the camera distance. Therefore, the normalized foreground area should not be averaged over the whole clip. Instead, the variance of the area is calculated. In addition, the histogram of the normalized area is calculated. In this work, the foreground area is normalized and quantized to the range 0 to 9. The histograms in every genre are averaged, respectively, as shown in Fig. 4.5. We see that the action movie has a more distinct histogram as compared with others. The histograms for music video and sports video are similar to the one for action, but with fewer frames having larger values (3 6). Horror forms another group itself, which is a little like action, but actually quite dierent considering it has larger value at bin 0. 92 (a) (b) (c) (d) Figure 4.4: Several foreground maps. 93 0 10 20 30 40 0 1 2 3 4 5 6 7 8 9 Action 0 17.5 35.0 52.5 70.0 0 1 2 3 4 5 6 7 8 9 Cartoon 0 17.5 35.0 52.5 70.0 0 1 2 3 4 5 6 7 8 9 Drama 0 12.5 25.0 37.5 50.0 0 1 2 3 4 5 6 7 8 9 Horror 0 12.5 25.0 37.5 50.0 0 1 2 3 4 5 6 7 8 9 Music Video 0 17.5 35.0 52.5 70.0 0 1 2 3 4 5 6 7 8 9 News 0 12.5 25.0 37.5 50.0 0 1 2 3 4 5 6 7 8 9 Sport Figure 4.5: The camera distance histogram for each genre. 94 Cartoon, drama, and news are the third group. The common point of these three genres is that they all have static background often. In addition, the scenes of a news presenter bear strong resemblance to dialog scenes in dramas. As for cartoon, when one object is moving, other objects tend not to move because this is easier for animators to create animation, which makes the scene has similar portion of foreground as the scenes in news or drama. Note that since the proposed algorithm is based on the motion vector eld, the fore- ground object would not be detected if it is not moving at all. That is the reason that many frames have value 0. Thus, this can be viewed as a metric related to motion as well. However, unlike frame dierence methods which cannot extract the foreground when the camera is moving, the proposed algorithm is able to measure the portion that foreground objects take and, as shown by the results, some information about the video genre can be inferred from the distribution of the foreground ratio. To conclude, the proposed camera distance measure does have dierent distributions among dierent genres. However, other features should be used as well in order to obtain a more precise description of the video genre. For example, more accurate measurement of the motion information can be obtained if it is combined with the frame dierence method. 4.3.2 Camera Number As mentioned earlier, the average shot number is often used in video genre classication, but it fails to consider the fact that there are sometimes fewer scenes than shots. The dialog scene, which happens often in movies that need character development, is such an 95 Figure 4.6: Shots of drama03.mp4. example. Although there might be many shot changes during a conversation, there is actually only 2 cameras covering the two persons. Fig. 4.6 shows the rst frame of each shot in drama03.mp4, which is a short clip in the movie No Country for Old Men. This clip contains a typical dialog scene, where an establishing shot shooting the outside of a store is used to point out the location of the following event. A second shot is one person inside the store followed by another person walking into the store. Then, a conversation between them happens. As seen in the gure, the shot is alternating between the two persons in the conversation. There are 15 shot changes (with 16 shots in total) during this 1 minute clip. However, there are only 3 cameras in this video. The number of cameras (3) is more meaningful than the number of shots (16) in this case. The proposed algorithm successfully picks the rst three frames shown in Fig. 4.6 to represent the cameras used in this clip, which is an establishing shot, a shot of the rst person and a shot of the second person. 96 Fig. 4.7 gives another example. There are 20 shots in drama07.mp4. However, by examining extracted shots, we see that there are only 6 dierent camera angles, which can be represented by the rst 5 and the last key frames. Most of the time, the director just alternated between two cameras to capture face expressions and body movements of the subjects. The frames circled with boxes in Fig. 4.7 correspond to the frames that are extracted by the proposed algorithm to represent the cameras used in this clip. A total of 7 frames is extracted. The rst 6 are correct frames, while the last one, which is a detected shot and not shown in the gure, is a false alarm caused by the sudden motion in the scene where one person ripped o a wallpaper on the wall very quickly. However, the proposed algorithm does not work well in every situation. Sometimes, the camera position changes when being inactive, which cause the frame content to change and can be regarded as a new camera. For example, in drama06.mp4, there are only 4 dierent cameras but 8 are detected, which is caused by the fact that there are several dierent close-ups for subjects besides a normal medium shot. Figure 4.8 shows the numbers of shots and cameras for each video genre. In action movies, the number of shots is generally large since shot change occurs frequently to make the scene more intense. Sometimes, the number of cameras is much less than the number of shots in some cases. The reason is that, to make the scene intense, the director sometimes quickly alternates between two cameras. Although several action movie clips have about the same (or only a little bit less) number of cameras as the number of shots, it is safe to say that, if the number of cameras is less than the number of shots to a certain degree and the number of shots exceeds a certain threshold, the video clip is likely to be an action movie. 97 Figure 4.7: Shots of a short clip in a drama movie. Frames circled with boxes correspond to the identied cameras. 98 0 15.0 30.0 45.0 60.0 action cartoon drama horror mv news sports # shot # camera Figure 4.8: Number of shots and number of cameras for each genre Music video also has many shot changes as well. The dierence is the number of cameras is almost always about the same as the number of shots. Cartoon and sport videos also have about the same number of shots and cameras, but with fewer number of shots than the action movie or music video. For news video, the number of shots is signicantly less than other genres. The number of cameras depends on what type of the segment is. If it is inside studio, the number of camera is less than the number of shots since the cameras are alternating between each other. However, if the news segment contains shots in the eld, the number of cameras tend to be about the same as the number of shots. Horror movies have more shots than drama movies, but they both have a signicantly less number of cameras then the number of shots. In drama movies, the camera number is few because of frequent dialogue scenes. As to horror movies, cameras sometimes switch quickly between several to create intense feeling. In addition, scenes in 99 horror movies are usually very dark, which makes the dierentiation more dicult than brighter scenes. Recall that three groups can be formed by inspecting camera distance histogram. Consider the rst group, which contains action, music video, and sports. Action movie and music video have similar number of shots, but action movies tend to have less number of shots comparing to music video. On the other hand, sports video has both less number of shot and number of camera than action and music video. As for cartoon, drama, and news, which belong to the second group, cartoon has more shots and about the same number of camera; drama and news both have fewer number of shots comparing to their number of shots, but news video typically has less number of shot. Therefore, using the number of shots and the camera number, it is easy to distinguish inside each group. To conclude, instead of using the number of shot changes alone, the number of cameras should also be considered. By jointly consider the number of shots and the number of cameras, much more information about the shooting process can be inferred. Note also that the number of cameras should only be calculated in a short interval, such as 1 min in the demonstrated cases. 4.4 Professional versus Amateur Video 4.4.1 Comparison of Professional and Amateur Video Professional video, as the name suggests, is well authored. Such video clips are created and edited by video/camera professionals, and shot in a controlled studio environment with good camera equipments. They are generally shot using multiple professional cameras, 100 and edited depending on the genre and story content. Examples of professional video include motion picture video from the movie industry, and TV programs such as news video, sports video, commercial video, etc. They are carefully edited so that they tend to follow the practice of lm theory. In contrast, amateur video is shot by ordinary users such as hobbyists or casual shooters. They usually attempt to capture certain interesting or memorable moments for fun or archiving. Such video clips have minimal or even no editing. Depending on how it is produced, amateur video can be further classied into two categories: video shot by a xed webcam or a hand-held camcorder (including camera-equipped cellphone). Fig. 4.9 shows examples of amateur video clips. Fig. 4.9(a) presents a video type that is common on the internet. The author wants to demonstrate something. In this case, it is a video clip about song playing on an electric guitar, which was shot by a xed webcam. Fig. 4.9(b) is a video clip that was shot during a concert with a camera-equipped cellphone. Figs. 4.9(c) and (d) are video clips that were shot by a hand-held camcorder in a rollercoaster ride and a birthday party, respectively. Generally speaking, professional and amateur video primarily diers in two aspects; namely, video quality and the editing eect. Amateur video can be easily identied by humans for poorer video quality. Although it may not be always true, amateur video in general has a quality issue in pictures, sound or even both. Out-of-focus and poor lighting are two common problems in amateur video, which tend to result in blurred frames and frames with smaller contrast variation, respectively. If a video clip is shot by a hand-held camcorder, shaky camera motion is often observed. The video content may shake or drift caused by the instability of human hands. 101 (a) (b) (c) (d) Figure 4.9: Examples of amateur video clips. Furthermore, many amateur video clips are shot by a single camera and there is no editing on the captured content. However, since it is easier to access video editing software than before, there exist some amateur video clips that have been edited. The editing styles of professional and amateur video tend to be dierent. That is, there is a rhythm in professional video associated with the emotion and the story while amateur editors select shots in a random fashion. For example, amateur video clips are typically created by simply cascading several clips shot at the same event, or shot on the same subjects but dierent time instances. There is no shot selection at all in most amateur video clips. 102 4.4.2 Relevant Features Traditional methods mostly use low-level features such as the shot length, color, motion, etc. for video genre classication. Based on observations discussed in Sec. 4.4.1, several visual-based features to capture the dierent properties between professional and amateur video are considered in this subsection. Specically, we propose a novel feature set by considering video quality and editing eects. 4.4.2.1 Visual Quality Eect: Camera shakiness { If an amateur video clip is shot by a hand-held camcorder, it often contains a certain amount of shaky motion content because of the instability of hands. In contrast, the camera motion in professional video is much smoother. This feature can be determined by computing the dierence between the estimated camera motion and the smoothed camera motion. Sharpness of frame { The out-of-focus problem occurs a lot in amateur video clips. In addition, the quality of the recorded video is often poor. Thus, frames may be blurred in many cases. By measure the sharpness of frames, some amateur video clips can be easily identied. This is achieved by computing the edge map of the frame and using the average edge magnitude as the sharpness measure. Color variance { Unlike the professional shooting environment which is well lit, amateur video usually has poor lighting. In addition, a dierent lighting set-up would be used in a dierent scene in a professional video clip while the lighting 103 condition tends to stay the same in an amateur video clip. Therefore, computing he variance of color would be helpful in distinguishing these two kinds of video. 4.4.2.2 Video Editing Eect: Number of cameras { Amateur video usually involves only a single camera while the professional video is often shot with multiple cameras and edited with some software tool so as to integrate multiple shot contents into one scene. By comparing the similarity between key frames of shots, we can roughly infer the number of scenes in a time interval. Distance of cameras { If the subject is farther away from the camera, it appears smaller in the frame and could be of less importance. On the contrary, if the subject is closer to the camera, the subject is bigger and could be more important. By comparing the camera distance, professional and amateur video can be dierentiated since many amateur video clips tend to have a xed camera distance in a short time interval while professional video tends to have multiple camera distances. This is achieved by extracting the size of foreground objects. Activity segment { Consider segments of consecutive frames with higher activity than others. Amateur video clips have either a lot of motion when shooting by hand-held camcorders or very little motion when shooting by xed webcams and, therefore, they have either very short or long such segments of frames. In contrast, since professional video is well-edited, the length of such a segment of frames is neither very long nor very short. 104 4.5 Extraction and Inference of Relevant Features We provide an overview of features that are relevant to automatic professional and ama- teur video separation in Sec. 4.4.2. Some of these relevant visual features can be extracted directly while others may be inferred from other low-level features. In this section, we describe algorithms to extract and/or infer these features in detail. Note that the ex- traction of features \Number of Cameras" and \Camera Distance" follow the processes presented in Sec. 4.2, hence not repeated here. 4.5.1 Activity Segment An activity segment is dened as a segment of consecutive frames that have higher activity then others. Here, we simply use the pixel-wise frame dierencing betweenI t andI ( t1) as the frame activity measure for frame t. If the current frame has an activity larger than a pre-dened threshold, it is classied to be in the activity segment. After marking every frame that is in the activity segment, a segment that is too short is eliminated. Then, the average length of the activity segments is calculated as one of the feature for separating professional and amateur video. The observation is that amateur video tends to have either very short or long activity segments while the length of such a segment in professional video is somewhere in-between. Although this property may not be very useful if we would like to dierentiate drama movies or news from amateur video clips, it still helps separate amateur video from professional video in other genres such as music video, sports, etc. 105 4.5.2 Camera Shakiness For amateur video clips shot by a hand-held camcorder, the presence of shaky camera motion is common. The measure of camera's shakiness motion is helpful in distinguishing between professional and amateur video clips. To achieve this task, the movement of the camcorder must be determined rst, which can be dened by a set of camera motion parameters. Then, unstable motion caused by the shaky hand's movement should be separated from the smooth and stable camera movement such as panning, tilting etc. The detailed process is described below. We use the block-based motion vector eld to estimate camera's motion parameters. First, a fast motion estimation algorithm [13] is used to obtain the motion vector eld. Then, the histogram of the motion vector eld,fh mv (x;y)g, of the current frame is com- puted. To reduce the eect of noisy samples, local averaging is applied to the histogram via h 0 mv (x;y) = w X i=w w X j=w h mv (x +i;y +j); (4.5) wherefh 0 mv (x;y)g denotes the smoothed motion vector histogram andw the window size of the smoothing window. Then, the motion vector that corresponds to the maximum in a locally smoothed histogram is selected as the camera motion vector for the current frame: (v x ;v y ) = argmax (x;y) h 0 mv (x;y): (4.6) 106 Note that the frequency of the component corresponding to unstable camera motion is usually much higher than that of the component corresponding to stable camera move- ment. To separate them, a temporal low-pass lter is applied to both the horizontal and vertical components of the camera motion vector to obtain the stable camera movements: v i;s (n) = W X j=0 a i v i (nj); i =x;y; (4.7) where v i (n) is the estimated camera motion at time n, v i;s (n) is the smoothed camera motion,W is the window length, andfa i g are coecients of the temporal low-pass lter. The Hamming window is used for low-pass ltering in this work. Fig. 4.10 shows the camera motion along the vertical direction for two amateur video clips shot by hand-held camcorders. The black curves are the estimated camera motion while the red curves represent the smoothed camera motion. The deviation of the actual camera motion from the smoothed one is attributed to the shaky camera motion. The magnitude of such deviation is used to measure the shakiness of the camera. The mean and the variance of camera's shakiness within a short time interval are used as features for classication. 4.5.3 Sharpness of Frames As compared with professional video, some amateur video may suer from the problem of out of focus and lower visual quality. Thus, it is helpful if we can determine how blur the frame is. We dene a feature called the sharpness of frames, which is measured by 107 0 2000 4000 6000 8000 10000 12000 Frame number -15 -10 -5 0 5 10 15 MVy 0 2000 4000 6000 8000 10000 12000 Frame number -15 -10 -5 0 5 10 15 MVy Figure 4.10: The curves of original and smoothed camera motion in the vertical direction. the edge magnitude of frames. First, an edge map of a frame is generated using rst- order derivative operators such as the Sobel edge detector. Then, the edge magnitude is calculated. The average magnitude is used as one of the feature for classication. 108 4.5.4 Color Variance Professional video is shot in a controlled environment while amateur video is not. A scene can be too dark or too bright, which cause the frame to have a smaller color variance, in amateur video. In addition, in the shooting of professional video, dierent lighting conditions can be set up easily while the lighting remains about the same in an amateur video clip. To capture this property, the mean color is rst calculated for each frame as a rough estimate of the dominant color. Then, the variance of the mean color is calculated, which represents the degree of diversity in the lighting condition for the short video clip. Only the luminance component is considered by the proposed scheme in the section of experimental results. 4.6 Experimental Results 4.6.1 Experimental Setup Video clips for professional and amateur video classication were collected from YouTube. For the class of the professional video, seven dierent genres, including action, cartoon, drama, horror, music video, news, and sports, were collected. Each genre has 23 video clips, which constitute a total of 161 video clips. The amateur video class has 194 video clips. In the 194 clips, 161 of them have no or minor editing, including clips demonstrating certain tricks, techniques or instrument performance shot by a single xed webcam as well as recordings of certain events such as birthday parties, rollercoaster rides, or rock concerts shot by a handheld camcorder. The remaining 33 clips have a certain amount of editing, which could be done by the shooters themselves. These comprise a data set 109 of 355 video clips. Each video clip has been segmented into an approximately duration of 1 minute, meaning the total length of the data set is almost six hours. All clips were transcoded into the H.264 format, with a bit rate varying from 450 to 850 Kbps. The features extracted for each video clip in our experiments were: Average shot duration; Number of shots; Number of cameras; Mean and variance of camera shakiness; Variance of luminance; Sharpness of frame; Average length of activity segment; Mean and variance of the normalized foreground area; 10-bin histogram of the normalized foreground area. Several dierent classiers were tested with these features, including the na ve Bayes, the logistic regression, the support vector machine (SVM), and the C4.5 decision tree. The AdaBoost algorithm [27] was used to improve the performance of the na ve Bayes classier and the C4.5 decision tree. A 10-fold cross-validation was performed. That is, the data were randomly divided into 10 parts. For each fold, one part was used as the testing set while the remaining parts were used as the training set. The results of 10 folds were averaged. 110 The classication performance is measured in terms of correct classication accuracy, which is the proportion of the correctly predicted examples in the test data set, and the Receiver Operating Characteristic (ROC) curve, which was introduced to evaluate the performance of machine learning algorithms [37]. They are presented in Sec. 4.6.2 and Sec. 4.6.3. 4.6.2 Classication Accuracy and Discussion Table 4.1 shows the correct classication results (in percentages) with dierent classiers. Overall, the classication accuracy is quite good, and the results show that the proposed features are stable regardless of the type of classiers. Table 4.1: Classication results with dierent classiers Classier Classication Accuracy Na ve Bayes 74.93% Na ve Bayes with AdaBoost 76.34% Logistic Regression 83.10% SVM 83.66% C4.5 Decision Tree 84.23% C4.5 Decision Tree with AdaBoost 90.14% We see that the na ve Bayes with and without the AdaBoost have the lower classica- tion rates in the range of 7476%. The three other classiers (i.e., the logistic regression, the SVM and the C4.5 decision tree) have correct classication rates clustered around 83-84%. With the AdaBoost algorithm, the performance of the C4.5 decision tree can be enhanced by 6% to reach an accuracy of 90.14%, which the best result we have achieved. 111 The superior performance of the C4.5 decision tree with the AdaBoost algorithm can be explained as follows. In professional or amateur video, the diversity within each class is high. For example, the drama movie and the action movie are very dierent although they both belong to professional video. On the other hand, certain sub-classes in professional video might have certain properties that are similar to certain sub-classes in amateur video. For example, a news clip and an amateur clip shot by a xed webcam could be both obtained using a small number of cameras without much motion. In these cases, transforming the input data to a higher dimensional space, as done in SVM, may not be very helpful. On the other hand, the C4.5 algorithm is based on the idea of choosing an attribute, which is most eective in terms of the normalized information gain, to split the data into two subsets [75]. Therefore, although certain features may not have a high discriminant power between subclasses, other features can be chosen to split the data into two sets in the decision tree. For example, the sharpness of frame and the color variance can be used in the above mentioned example. Table 4.2 shows detailed classication results when using the C4.5 decision tree along with the AdaBoost algorithm. The correctly classied instances are roughly the same in both classes. For the amateur video clips that were falsely classied to professional video, they were either with some editing, or the objects were too close to the camera. In the latter case, a small movement in the object could cause a large change in the frame content, resulting in falsely detected shot change, the number of cameras, and the foreground area. As to professional video clips that were falsely classied to amateur video, they look like an amateur clip shot by a single xed webcam. Such video clips are 112 Table 4.2: Detail classication results Class Amateur Professional Total Accuracy Amateur 174 20 194 89.7% Professional 15 146 161 90.7% mostly in-studio news clips which are static with few shot changes and cameras or clips in drama movies shot with fewer cameras. 4.6.3 ROC Performance and Discussion Besides the classication accuracy, the Receiver Operating Characteristic (ROC) curve has been introduced to evaluate the performance of various machine learning algorithms. The ROC curve is plotted with the probability of the class prediction. Specically, it is the curve of the True Positive Rate (TPR) versus the False Positive Rate (FPR) by varying the discrimination threshold of the classier. Mathematically, TPR and FPR are dened as TPR = TP TP +FN ; FPR = FP FP +TN ; where TP is the number of true positive, FP is the number of false positive, TN is the number of true negative, and FN is the number of false negative. Fig. 4.11 shows two ROC curves for amateur video classication with the C4.5 decision tree. The dashed line is obtained using C4.5 decision tree only while the solid line is the result of using C4.5 decision tree with the AdaBoost algorithm. As shown in the gure, the 113 two schemes can maintain a high true positive rate for a wide range of false positive rates. The false positive rates can be reduced to be as low as 15% and 10% for the C4.5 decision tree only and the C4.5 decision tree with the AdaBoost algorithm, respectively, while preserving high true positive rates. The two ROC curves have the areas under the curve (AUC) of 0.847 and 0.939, respectively, which demonstrate the excellent discriminant power provided by the set of features proposed in this work. 0 0.2 0.4 0.6 0.8 1 False positive rate 0 0.2 0.4 0.6 0.8 1 True positive rate C4.5 C4.5+AdaBoost Figure 4.11: Two ROC curves of amateur video classication, where the dashed and the solid lines correspond to the C4.5 decision tree only and the C4.5 decision tree with the AdaBoost algorithm, respectively. 114 Chapter 5 Conclusion and Future Work 5.1 Summary of the Research In this research, we proposed a framework for ecient video copy detection, along with two dierent signatures that can capture the temporal structure of video in dierent aspects. Besides, we examined several new features for automatic video classication by incorporating the eects caused by dierent camera usage. In Chapter 3, we presented a novel video copy detection system that enables accurate and extremely ecient identication of duplicate video copies. The proposed algorithm utilizes the video temporal structure as the signature, which is compact, discriminative yet robust for a large class of video contents. Specically, unique camera transitional behaviors including shot boundaries, starts/ends of camera panning and tilting were marked as anchor frames, and the length sequence between consecutive anchor frames, called the gap sequence, was used as the signature of the underlying video. To address the rapid growth of video contents, the proposed system uses an ecient data structure called the sux array to achieve extremely fast matching of signatures. The matching algorithm 115 can identify all the maximal unique matches in linear time and determine whether the video under consideration is a duplicate video from the database. A candidate pruning stage was also proposed to avoid performing signature matching for every video in the database by eliminating the video clips that are very unlikely to be duplicates of the query video. It was shown by experimental results that the proposed signature was compact, and the proposed algorithm was very fast and eective in duplicate video detection. In Chapter 4, two new features were rst proposed for automatic video classica- tion. The rst one estimates the number of camera being used in a short time interval. By noticing that video with dierent genres should have dierent distribution of cam- era distances, the second feature that approximates the camera distance was proposed. Preliminary experiment results demonstrated that the number of cameras, along with the number of shots, convey more information about the video genre than considering the number of shots alone. Some information about the video genre can be inferred from the camera distance distribution, too. Then we follow with analysis on dierent natures of professional and amateur video. A set of features was proposed to capture these dierent natures. Specically, amateur video clips are often shot using a single camera, with considerable camera shakes, out-of-focus image frames, poor lighting, and often with a xed distance between the camera and the subjects. In contrast, professional clips mostly involve multiple cameras, stable camera motions, better image quality, good lighting, and diversied camera distances. Consequently, the proposed feature set in- cludes: the number of cameras used, the shakiness of the camera, sharpness of frames, and the distance between the camera and subjects. The proposed features were tested on a data set with video clips collected from an online video sharing website, with several 116 well known classiers. It was shown by experimental results that the features are robust to dierent classiers and successfully capture the characteristics caused by the dierent camera usages in the professional and amateur video. With the proposed set of feature, the professional and the amateur video can be separated with high accuracy. 5.2 Future Research Directions To make the research more complete, we discuss possible extensions and issues be solved in the future. There are several cases that the proposed shot-length-based signature works poorly. A simple metric that characterizes such video was proposed. However, the test cases are not enough to determine if this simple metric works well for most of video contents. We may need a more robust and reliable metric to re ect the properties of the video properly so that we can decide which signature to use. A set of query video with dierent types of attacks was tested against the database. We may want to include the analysis of the eect of dierent attacks on the proposed system. It would be useful to understand why the proposed approach is more eective in circumventing some attacks, such as blurring and color adjustment, while merely works for other attacks, such as camcording and noise. Since the proposed copy detection system is based on the temporal structure of video, attacks related to the temporal axis may compromise the system. For exam- ple, inserting redundant frames at random positions, or fast forwarding (or slowing down) the video at 1.1x (or 0.9x) speed would change the temporal structure but 117 may not change the viewing experience much. The feature-based approach may have to come in play in these cases, which means the extension of the proposed, with the power of handling temporal attacks, may need hybrid (i.e., feature-based/syntax- based) approach. The performance of this signature on video whose camera transitional behaviors is not distinctive (that is, video with a substantial number of static scenes) is unclear. We may want to investigate how the performance would be aected if such video is under consideration. In addition, even if the video clip has some camera transitional behaviors, the information would not be enough to represent its temporal structure if the video is not long enough. Video clips that are too short might need to be treated separately with approaches based on frame-based features. For video classication, a much larger database is needed to have more conclusive analysis. The database used in this work is still considered as a smaller database consider the amount of video clips out there on the internet. For the camera distance feature used in video classication, since the proposed approach relies on the motion vector eld, if the foreground objects are not moving, they cannot be properly identied. To remedy this, certain memory mechanisms that memorize the previous location of foreground objects may be helpful. However, the computation and the fact that, unlike surveillance videos, general video usually have somewhat frequent shot changes should be taken into consideration for the design of such mechanisms. 118 Bibliography [1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch, \Replacing sux trees with en- hanced sux arrays," J. of Discrete Algorithms, vol. 2, no. 1, pp. 53{86, 2004. [2] D. Adjeroh, M. Lee, and I. King, \A distance measure for video sequence simi- larity matching," Multi-Media Database Management Systems, 1998. Proceedings. International Workshop on, pp. 72{79, 5-7 Aug 1998. [3] S. Araki, T. Matsuoka, H. Takemura, and N. Yokoya, \Real-time tracking of mul- tiple moving objects in moving camera image sequences using robust statistics," Pattern Recognition, International Conference on, vol. 2, p. 1433, 1998. [4] Y. Ariki and Y. Sugiyama, \Classication of tv sports news by dct features using multiple subspace method," in Pattern Recognition, 1998. Proceedings. Fourteenth International Conference on, vol. 2, Aug 1998, pp. 1488{1491 vol.2. [5] Y. A. Aslandogan and C. T. Yu, \Techniques and systems for image and video retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 1, pp. 56{63, 1999. [6] M. A. Bender and M. Farach-Colton, \The lca problem revisited," in Latin Amer- ican Theoretical INformatics, 2000, pp. 88{94. [7] M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin, \Low- est common ancestors in trees and directed acyclic graphs," J. Algorithms, vol. 57, no. 2, pp. 75{94, 2005. [8] J. Bescos, \Real-time shot change detection over online mpeg-2 video," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 14, no. 4, pp. 475{484, April 2004. [9] P. Bouthemy, M. Gelgon, and F. Ganansia, \A unied approach to shot change detection and camera motion characterization," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 9, no. 7, pp. 1030{1044, Oct 1999. [10] D. Brezeale and D. J. Cook, \Using closed captions and visual features to classify movies by genre," in 7th Int. Workshop Multimedia Data Min. (MDM/KDD), 2006. [11] D. Brezeale and D. Cook, \Automatic video classication: A survey of the liter- ature," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 38, no. 3, pp. 416{430, May 2008. 119 [12] Z. Cernekova, I. Pitas, and C. Nikou, \Information theory-based shot cut/fade detection and video summarization," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 16, no. 1, pp. 82{91, Jan. 2006. [13] J. Chalidabhongse and C.-C. J. Kuo, \Fast motion vector estimation using multiresolution-spatio-temporal correlations," Circuits and Systems for Video Tech- nology, IEEE Transactions on, vol. 7, no. 3, pp. 477{488, Jun 1997. [14] W. I. Chang and E. L. Lawler, \Sublinear approximate string matching and bio- logical applications," Algorithmica, vol. 12, no. 4-5, pp. 327{344, November 1994. [15] S.-S. Cheung and A. Zakhor, \Ecient video similarity measurement with video signature," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 1, pp. 59{74, Jan 2003. [16] ||, \Fast similarity search and clustering of video sequences on the world-wide- web," Multimedia, IEEE Transactions on, vol. 7, no. 3, pp. 524{537, June 2005. [17] S.-Y. Chien, S.-Y. Ma, and L.-G. Chen, \Ecient moving object segmentation algorithm using background registration technique," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 12, no. 7, pp. 577{586, Jul 2002. [18] C.-Y. Chiu, C.-H. Li, H.-A. Wang, C.-S. Chen, and L.-F. Chien, \A time warping based approach for video copy detection," Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, vol. 3, pp. 228{231, 2006. [19] ACM International Conference on Image and Video Retrieval, 2007, http://www.civr2007.com/. [20] C. Cotsaces, N. Nikolaidis, and I. Pitas, \Video shot detection and condensed rep- resentation. a review," Signal Processing Magazine, IEEE, vol. 23, no. 2, pp. 28{37, March 2006. [21] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, and S. L. Salzberg, \Alignment of whole genomes." Nucleic Acids Res, vol. 27, no. 11, pp. 2369{2376, June 1999. [22] A. L. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, \Fast algorithms for large-scale genome alignment and comparison." Nucleic Acids Res, vol. 30, no. 11, pp. 2478{2483, June 2002. [23] M. S. Drew and J. Au, \Video keyframe production by ecient clustering of com- pressed chromaticity signatures (poster session)," in MULTIMEDIA '00: Proceed- ings of the eighth ACM international conference on Multimedia. New York, NY, USA: ACM, 2000, pp. 365{367. [24] J. Fan, H. Luo, J. Xiao, and L. Wu, \Semantic video classication and feature subset selection under context and concept uncertainty," June 2004, pp. 192{201. 120 [25] S. Fischer, R. Lienhart, and W. Eelsberg, \Automatic recognition of lm genres," in MULTIMEDIA '95: Proceedings of the third ACM international conference on Multimedia. New York, NY, USA: ACM, 1995, pp. 295{304. [26] R. M. Ford, C. Robson, D. Temple, and M. Gerlach, \Metrics for shot boundary detection in digital video sequences," Multimedia Syst., vol. 8, no. 1, pp. 37{46, 2000. [27] Y. Freund and R. E. Schapire, \A decision-theoretic generalization of on-line learn- ing and an application to boosting," in EuroCOLT '95: Proceedings of the Second European Conference on Computational Learning Theory. London, UK: Springer- Verlag, 1995, pp. 23{37. [28] U. Gargi, R. Kasturi, and S. Strayer, \Performance characterization of video-shot- change detection methods," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 10, no. 1, pp. 1{13, Feb 2000. [29] GoogleVideo, http://video.google.com/. [30] D. Guseld, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, January 1997. [31] A. Hampapur, K. Hyun, and R. M. Bolle, \Comparison of sequence matching tech- niques for video copy detection," in Storage and Retrieval for Media Databases 2002, M. M. Yeung, C.-S. Li, and R. W. Lienhart, Eds., vol. 4676. SPIE, 2001, pp. 194{201. [32] A. Hanjalic, \Shot-boundary detection: unraveled and resolved?" Circuits and Systems for Video Technology, IEEE Transactions on, vol. 12, no. 2, pp. 90{105, Feb 2002. [33] I. Haritaoglu, D. Harwood, and L. Davis, \W4: real-time surveillance of people and their activities," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 22, no. 8, pp. 809{830, Aug 2000. [34] C. Harris and M. Stephens, \A combined corner and edge detection," in Proceedings of The Fourth Alvey Vision Conference, 1988, pp. 147{151. [35] A. Hauptmann, R. Yan, Y. Qi, R. Jin, M. Christel, M. Derthick, M.-Y. Chen, R. Baron, W.-H. Lin, and T. D. Ng, \Video classication and retrieval with the informedia digital video library system," in Text Retrieval Conf. (TREC 2002), 2002. [36] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. Wong, \Integration of multimodal features for video scene classication based on hmm," 1999, pp. 53{58. [37] J. Huang, J. Lu, and C. Ling, \Comparing naive bayes, decision trees, and svm with auc and accuracy," in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, Nov. 2003, pp. 553{556. 121 [38] P. Indyk, G. Iyengar, and N. Shivakumar, Finding pirated video sequences on the internet. Stanford, CA: Stanford Infolab, Technical Report, 1999. [39] G. Iyengar and A. Lippman, \Models for automatic classication of video se- quences," in Storage and Retrieval for Image and Video Databases (SPIE), 1998, pp. 216{227. [40] R. S. Jadon, S. Chaudhury, and K. K. Biswas, \Generic video classication: An evolutionary learning based fuzzy theoretic approach," in Indian Conf. Comput. Vis. Graph. Image Process. (ICVGIP), 2002. [41] R. Jasinschi and J. Louie, \Automatic tv program genre classication based on audio patterns," 2001, pp. 370{375. [42] A. Joly, O. Buisson, and C. Frelicot, \Content-based copy retrieval using distortion- based probabilistic similarity search," Multimedia, IEEE Transactions on, vol. 9, no. 2, pp. 293{306, Feb. 2007. [43] J. K arkk ainen and P. Sanders, \Simple linear work sux array construction," in Proc. 13th International Conference on Automata, Languages and Programming. Springer, 2003. [44] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, \Linear-time longest- common-prex computation in sux arrays and its applications," in CPM '01: Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching. London, UK: Springer-Verlag, 2001, pp. 181{192. [45] C. Kim and J.-N. Hwang, \Fast and automatic video object segmentation and tracking for content-based applications," Circuits and Systems for Video Technol- ogy, IEEE Transactions on, vol. 12, no. 2, pp. 122{129, Feb 2002. [46] C. Kim and B. Vasudev, \Spatiotemporal sequence matching for ecient video copy detection," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 15, no. 1, pp. 127{132, Jan. 2005. [47] D. K. Kim, J. S. Sim, H. Park, and K. Park, \Constructing sux arrays in linear time," Journal of Discrete Algorithms, vol. 3, no. 2-4, pp. 126{142, June 2005. [48] P. Ko and S. Aluru, \Space ecient linear time construction of sux arrays," Journal of Discrete Algorithms, vol. 3, no. 2-4, pp. 143{156, June 2005. [49] V. Kobla, D. DeMenthon, and D. Doermann, \Detection of slow-motion replay se- quences for identifying sports videos," in Multimedia Signal Processing, 1999 IEEE 3rd Workshop on, 1999, pp. 135{140. [50] V. Kobla, D. DeMenthon, and D. S. Doermann, \Identifying sports videos using replay, text, and camera motion features," M. M. Yeung, B.-L. Yeo, and C. A. Bouman, Eds., vol. 3972, no. 1. SPIE, 1999, pp. 332{343. 122 [51] I. Koprinska and S. Carrato, \Temporal video segmentation: A survey," Signal Processing: Image Communication, vol. 16, no. 5, pp. 477{500, January 2001. [52] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S. L. Salzberg, \Versatile and open software for comparing large genomes." Genome Biol, vol. 5, no. 2, 2004. [53] J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa, \Robust voting algo- rithm based on labels of behavior for video copy detection," in MULTIMEDIA '06: Proceedings of the 14th annual ACM international conference on Multimedia. New York, NY, USA: ACM, 2006, pp. 835{844. [54] J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford, \Video copy detection: a comparative study," in CIVR '07: Proceedings of the 6th ACM international conference on Image and video retrieval. New York, NY, USA: ACM, 2007, pp. 371{378. [55] M. Lee, S. Nepal, and U. Srinivasan, \Edge-based semantic classication of sports video sequences," vol. 1, July 2003, pp. I{157{60 vol.1. [56] Y. Li, S.-H. Lee, C.-H. Yeh, and C.-C. J. Kuo, \Techniques for movie content analysis and skimming: tutorial and overview on video abstraction techniques," Signal Processing Magazine, IEEE, vol. 23, no. 2, pp. 79{89, March 2006. [57] Y. Li, S. Narayanan, and C.-C. J. Kuo, \Content-based movie analysis and index- ing based on audiovisual cues," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 14, no. 8, pp. 1073{1085, Aug. 2004. [58] R. Lienhart, C. Kuhmunch, and W. Eelsberg, \On the detection and recognition of television commercials," Multimedia Computing and Systems '97. Proceedings., IEEE International Conference on, pp. 509{516, Jun 1997. [59] R. Lienhart, \Reliable transition detection in videos: A survey and practitioner's guide," International Journal of Image and Graphics, vol. 1, no. 3, pp. 469{486, 2001. [60] W.-H. Lin and A. Hauptmann, \News video classication using svm-based multi- modal classiers and combination strategies," in MULTIMEDIA '02: Proceedings of the tenth ACM international conference on Multimedia. New York, NY, USA: ACM, 2002, pp. 323{326. [61] L. Liu, W. Lai, X.-S. Hua, and S.-Q. Yang, \Video histogram: A novel video signature for ecient web video duplicate detection." in MMM (2), ser. Lecture Notes in Computer Science, T.-J. Cham, J. Cai, C. Dorai, D. Rajan, T.-S. Chua, and L.-T. Chia, Eds., vol. 4352. Springer, 2007, pp. 94{103. [62] Z. Liu, Y. Wang, and T. Chen, \Audio feature extraction and analysis for scene segmentation and classication," J. VLSI Signal Process. Syst., vol. 20, no. 1-2, pp. 61{79, 1998. 123 [63] D. G. Lowe, \Distinctive image features from scale-invariant keypoints," Int. J. Comput. Vision, vol. 60, no. 2, pp. 91{110, 2004. [64] U. Manber and G. Myers, \Sux arrays: a new method for on-line string searches," in SODA '90: Proceedings of the rst annual ACM-SIAM symposium on Discrete algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathemat- ics, 1990, pp. 319{327. [65] R. Mohan, \Video sequence matching," Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 6, pp. 3697{ 3700 vol.6, May 1998. [66] V. Morellas, I. Pavlidis, and P. Tsiamyrtzis, \Deter: detection of events for threat evaluation and recognition," Mach. Vision Appl., vol. 15, no. 1, pp. 29{45, 2003. [67] D. Murray and A. Basu, \Motion tracking with an active camera," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 16, no. 5, pp. 449{459, May 1994. [68] MUSCLE-VCD-2007, http://www-rocq.inria.fr/imedia/civr-bench/index.html. [69] A. Nagasaka and Y. Tanaka, \Automatic video indexing and full-video search for object appearances," in Proceedings of the IFIP TC2/WG 2.6 Second Working Conference on Visual Database Systems II. Amsterdam, The Netherlands, The Netherlands: North-Holland Publishing Co., 1992, pp. 113{127. [70] M. Naphade, M. Yeung, and B. Yeo, \A novel scheme for fast and ecient video sequence matching using compact signatures," in Storage and Retrieval for Media Databases, vol. 3972. SPIE, 2000, pp. 564{572. [71] C. W. Ng, I. King, and M. R. Lyu, \Video comparison using tree matching algo- rithms," in Proceedings of International Conference on Imaging Science, Systems and Technology, vol. 1, 2001, pp. 184{190. [72] C.-W. Ngo, T.-C. Pong, and R. Chin, \Video partitioning by temporal slice co- herency," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 11, no. 8, pp. 941{953, Aug 2001. [73] C.-W. Ngo, W.-L. Zhao, and Y.-G. Jiang, \Fast tracking of near-duplicate keyframes in broadcast domain with transitivity propagation," in MULTIMEDIA '06: Proceedings of the 14th annual ACM international conference on Multimedia. New York, NY, USA: ACM, 2006, pp. 845{854. [74] S. Park and W. W. Chu, \Similarity-based subsequence search in image sequence databases," International Journal of Image and Graphics (IJIG), vol. 3, no. 1, pp. 31{54, January 2003. [75] J. R. Quinlan, C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993. 124 [76] Z. Rasheed and M. Shah, \Movie genre classication by exploiting audio-visual features of previews," vol. 2, 2002, pp. 1086{1089 vol.2. [77] Y. Ren, C.-S. Chua, and Y.-K. Ho, \Motion detection with nonstationary back- ground," Mach. Vision Appl., vol. 13, no. 5-6, pp. 332{343, 2003. [78] M. Roach, J. Mason, and M. Pawlewski, \Video genre classication using dynam- ics," Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on, vol. 3, pp. 1557{1560 vol.3, 2001. [79] Y. Rubner, C. Tomasi, and L. J. Guibas, \The earth mover's distance as a metric for image retrieval," Int. J. Comput. Vision, vol. 40, no. 2, pp. 99{121, 2000. [80] Y. Rui, T. Huang, and S. Chang, \Image retrieval: current techniques, promising directions and open issues," Journal of Visual Communication and Image Repre- sentation, vol. 10, no. 4, pp. 39{62, Apr. 1999. [81] K. Sadakane, \Succinct representations of lcp information and improvements in the compressed sux arrays," in SODA '02: Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2002, pp. 225{232. [82] E. Sahouria and A. Zakhor, \Content analysis of video using principal components," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 9, no. 8, pp. 1290{1298, Dec 1999. [83] C. Shah, \Tubekit: a query-based youtube crawling toolkit," in JCDL '08: Pro- ceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries. New York, NY, USA: ACM, 2008, pp. 433{433. [84] H. T. Shen, B. C. Ooi, and X. Zhou, \Towards eective indexing for very large video sequence database," in SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 2005, pp. 730{741. [85] A. F. Smeaton, P. Over, and W. Kraaij, \Evaluation campaigns and trecvid," in MIR '06: Proceedings of the 8th ACM International Workshop on Multimedia In- formation Retrieval. New York, NY, USA: ACM Press, 2006, pp. 321{330. [86] C. Stauer and W. E. L. Grimson, \Learning patterns of activity using real- time tracking," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 747{757, 2000. [87] S.-C. Su, \Genomic data analysis and processing with signal processing techniques," Ph.D. Thesis, University of Southern California, June 2006. [88] Y.-P. Tan, D. Saur, S. Kulkami, and P. Ramadge, \Rapid estimation of camera motion from compressed video with application to video annotation," IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 1, pp. 133{146, Feb 2000. 125 [89] B. T. Truong and C. Dorai, \Automatic genre identication for content-based video categorization," Pattern Recognition, 2000. Proceedings. 15th International Confer- ence on, vol. 4, pp. 230{233 vol.4, 2000. [90] N. Vasconcelos and A. Lippman, \Towards semantically meaningful feature spaces for the characterization of video content," in ICIP '97: Proceedings of the 1997 International Conference on Image Processing (ICIP '97) 3-Volume Set-Volume 1. Washington, DC, USA: IEEE Computer Society, 1997, p. 25. [91] ||, \Statistical models of video structure for content analysis and characteriza- tion," Image Processing, IEEE Transactions on, vol. 9, no. 1, pp. 3{19, Jan 2000. [92] X. Wan and C.-C. J. Kuo, \A new approach to image retrieval with hierarchical color clustering," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 8, no. 5, pp. 628{643, Sep 1998. [93] P. Wang, R. Cai, and S.-Q. Yang, \A hybrid approach to news video classication multimodal features," vol. 2, Dec. 2003, pp. 787{791 vol.2. [94] Wikipedia, \Youtube," http://en.wikipedia.org/wiki/YouTube. [95] X. Wu, A. G. Hauptmann, and C.-W. Ngo, \Practical elimination of near-duplicates from web video search," in MULTIMEDIA '07: Proceedings of the 15th interna- tional conference on Multimedia. New York, NY, USA: ACM, 2007, pp. 218{227. [96] W. Xiong and J. Lee, \Ecient scene change detection and camera motion annota- tion for video classication," Computer Vision and Image Understanding, vol. 71, no. 2, pp. 166{181, August 1998. [97] L.-Q. Xu and Y. Li, \Video classication using spatial-temporal features and pca," vol. 3, July 2003, pp. III{485{8 vol.3. [98] X. Yang, Q. Sun, and Q. Tian, \Content-based video identication: a survey," Information Technology: Research and Education, 2003. Proceedings. ITRE2003. International Conference on, pp. 50{54, 11-13 Aug. 2003. [99] YouTube, http://www.youtube.com/. [100] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang, \A formal study of shot boundary detection," Circuits and Systems for Video Technology, IEEE Transactions on, vol. 17, no. 2, pp. 168{186, Feb. 2007. [101] X. Yuan, W. Lai, T. Mei, X.-S. Hua, X.-Q. Wu, and S. Li, \Automatic video genre categorization using hierarchical svm," Image Processing, 2006 IEEE International Conference on, pp. 2905{2908, Oct. 2006. [102] H. Zhang, A. Kankanhalli, and S. W. Smoliar, \Automatic partitioning of full- motion video," Multimedia Syst., vol. 1, no. 1, pp. 10{28, 1993. 126 [103] H. Zhang, J. Wu, D. Zhong, and S. W. Smoliar, \An integrated system for content- based video retrieval and browsing," Pattern Recognition, vol. 30, no. 4, pp. 643{ 658, 1997. [104] T. Zhang and C.-C. J. Kuo, \Audio content analysis for online audiovisual data segmentation and classication," Speech and Audio Processing, IEEE Transactions on, vol. 9, no. 4, pp. 441{457, May 2001. [105] W. Zhou, A. Vellaikal, and C.-C. J. Kuo, \Rule-based video classication system for basketball video indexing," in MULTIMEDIA '00: Proceedings of the 2000 ACM workshops on Multimedia. New York, NY, USA: ACM, 2000, pp. 213{216. [106] W. Zhu, C. Toklu, and S.-P. Liou, \Automatic news video segmentation and cat- egorization based on closed-captioned text," Multimedia and Expo, 2001. ICME 2001. IEEE International Conference on, pp. 829{832, Aug. 2001. [107] J. Zobel and T. C. Hoad, \Detection of video sequences using compact signatures," ACM Trans. Inf. Syst., vol. 24, no. 1, pp. 1{50, 2006. 127
Abstract (if available)
Abstract
In this research, we focus on two techniques related to the management of large video collection: video copy detection and automatic video classification. After the introductory chapter and a brief review in Chapter 2, our main research results are presented in Chapters 3 and 4.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced techniques for high fidelity video coding
PDF
Leveraging georeferenced meta-data for the management of large video collections
PDF
Precoding techniques for efficient ultra-wideband (UWB) communication systems
PDF
Focus mismatch compensation and complexity reduction techniques for multiview video coding
PDF
Advanced intra prediction techniques for image and video coding
PDF
Network reconnaissance using blind techniques
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Rate control techniques for H.264/AVC video with enhanced rate-distortion modeling
PDF
Robust video transmission in erasure networks with network coding
PDF
Efficient coding techniques for high definition video
PDF
Temporal perception and reasoning in videos
PDF
Texture processing for image/video coding and super-resolution applications
PDF
Techniques for efficient cloud modeling, simulation and rendering
PDF
Techniques for vanishing point detection
PDF
Low complexity mosaicking and up-sampling techniques for high resolution video display
PDF
Distributed edge and contour line detection for environmental monitoring with wireless sensor networks
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Predictive coding tools in multi-view video compression
Asset Metadata
Creator
Wu, Ping-Hao
(author)
Core Title
Efficient management techniques for large video collections
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
06/03/2010
Defense Date
01/25/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
camera operation,camera transition,duplicate video detection,OAI-PMH Harvest,suffix array,video classification,video copy detection,video database
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Ortega, Antonio (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
pinghaow@usc.edu,pinhoramic@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3108
Unique identifier
UC155331
Identifier
etd-Wu-3451 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-341017 (legacy record id),usctheses-m3108 (legacy record id)
Legacy Identifier
etd-Wu-3451.pdf
Dmrecord
341017
Document Type
Dissertation
Rights
Wu, Ping-Hao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
camera operation
camera transition
duplicate video detection
suffix array
video classification
video copy detection
video database