Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Temporal perception and reasoning in videos
(USC Thesis Other)
Temporal perception and reasoning in videos
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Temporal Perception and Reasoning in Videos by Jiyang Gao A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2019 Copyright 2019 Jiyang Gao To Shengke, Liping and Yiyue ii Acknowledgments I would like to express my sincere thanks to my advisor Prof. Ramakant Nevatia for his guidance and support. I have learned a lot from his deep understanding, broad knowledge and insightful vision about Articial Intelligence and Computer Vision during the past three and half years. Besides research, I have been also very lucky to discuss with Prof. Nevatia about life and career, his optimistic attitude, humbleness and toughness have a signicant in uence on me. I would like to thank Prof. Antonio Ortega, Prof. Haipeng Luo, Prof. C.-C. Jay Kuo and Prof. Joseph J. Lim for taking their precious time to serve on my qualication and thesis defense committee; Chen Sun for all the kind help for advice and help on research and my career; James Guo, Zhen Li, Jiang Wang, Shengyang Dai and Jia Li for collaboration at Google; Kan Chen, Zhenheng Yang, Rama Kovvuri, Runzhou Ge, Chuanzi He and Zhaoheng Zheng for collaboration at USC. I also want to thank the current and previous members of our Computer Vision lab, and all my friends at USC. I am grateful to my parents Shengke and Liping for their understanding and support over the past 26 years. My gratitude is beyond the words. Finally, a very special thank to my ancee Yiyue Zhang, who accompanies and en- courages me along the whole journey since 10/10/2015. iii Table of Contents Acknowledgments iii List of Tables vii List of Figures ix Abstract xii 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Temporal Unit Regression Network for Temporal Action Proposals 3 1.3.2 Cascaded Boundary Regression for Temporal Action Detection . . 4 1.3.3 CTAP: Complementary Temporal Action Proposal Generation . . 5 1.3.4 Reinforced Encoder-Decoder Networks for Action Anticipation . . 5 1.3.5 Temporal Activity Localization via Language Query . . . . . . . . 5 1.3.6 Motion-Appearance Co-Memory Networks for Video Question An- swering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Proposal Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related Work 7 2.1 Temporal Action Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Action Anticipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Action Localization via Language Query . . . . . . . . . . . . . . . . . . . 9 2.4 Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Temporal Unit Regression Network for Temporal Action Proposals 11 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Video Unit Processing . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Clip Pyramid Modeling . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.3 Unit-level Temporal Coordinate Regression . . . . . . . . . . . . . 14 3.2.4 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 iv 3.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3.2 Experiments on THUMOS-14 . . . . . . . . . . . . . . . . . . . . . 16 3.3.3 Experiments on ActivityNet . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Cascaded Boundary Regression for Temporal Action Detection 25 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.1 Video Unit Feature Extraction . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Video Clip Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.3 Temporal Coordinate Regression . . . . . . . . . . . . . . . . . . . 27 4.2.4 Two-Stage Proposal and Detection Pipeline . . . . . . . . . . . . . 28 4.2.5 Cascaded Boundary Regression . . . . . . . . . . . . . . . . . . . . 28 4.2.6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.1 Experiments on THUMOS-14 . . . . . . . . . . . . . . . . . . . . . 30 4.3.2 Experiments on TVSeries . . . . . . . . . . . . . . . . . . . . . . . 32 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5 CTAP: Complementary Temporal Action Proposal Generation 34 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Complementary Temporal Action Proposal Generator . . . . . . . . . . . 36 5.2.1 Initial Proposal Generation . . . . . . . . . . . . . . . . . . . . . . 36 5.2.2 Proposal Complementary Filtering . . . . . . . . . . . . . . . . . . 37 5.2.3 Proposal Ranking and Boundary Adjustment . . . . . . . . . . . . 38 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.3 Evaluation on THUMOS-14 . . . . . . . . . . . . . . . . . . . . . . 41 5.3.4 Evaluation on ActivityNet v1.3 . . . . . . . . . . . . . . . . . . . . 43 5.3.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Reinforced Encoder-Decoder Networks for Action Anticipation 47 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Reinforced Encoder-Decoder Network . . . . . . . . . . . . . . . . . . . . 48 6.2.1 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2.2 Encoder-Decoder Network . . . . . . . . . . . . . . . . . . . . . . . 48 6.2.3 Classication Network . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2.4 Reinforcement Module . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2.5 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 51 6.3.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 v 7 Temporal Activity Localization via Language Query 56 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2.2 CTRL Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2.3 CTRL Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.3.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.3.2.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . 62 7.3.2.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . 62 7.3.2.3 System Variants . . . . . . . . . . . . . . . . . . . . . . . 63 7.3.3 Experiments on TACoS . . . . . . . . . . . . . . . . . . . . . . . . 64 7.3.4 Experiments on Charades-STA . . . . . . . . . . . . . . . . . . . . 66 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8 Motion-Appearance Co-Memory Networks for Video Question Answer- ing 69 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 8.2 General Dynamic Memory Networks . . . . . . . . . . . . . . . . . . . . . 71 8.3 Motion-Appearance Co-Memory Networks . . . . . . . . . . . . . . . . . . 72 8.3.1 Multi-level Contextual Facts . . . . . . . . . . . . . . . . . . . . . 73 8.3.2 Motion-appearance Co-Memory Module . . . . . . . . . . . . . . . 74 8.3.3 Answer Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.4.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 77 8.4.3 System Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 8.4.4 Experiments on TGIF-QA . . . . . . . . . . . . . . . . . . . . . . . 78 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 9 Conclusion and Future Work 82 Reference List 83 vi List of Tables 3.1 Run-time and AR Comparison on THUMOS-14. . . . . . . . . . . . . . . 20 3.2 Temporal action localization performance (mAP % @tIoU=0.5) evaluated on dierent proposals on THUMOS-14. . . . . . . . . . . . . . . . . . . . 20 3.3 Temporal action localization performance (mAP %) comparison at dier- ent tIoU thresholds on THUMOS-14. . . . . . . . . . . . . . . . . . . . . . 21 3.4 Proposal generalization performance (AR@F=1.0 %) of TURN-C3D-16 on dierent subsets of ActivityNet. . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Temporal action localization performance (mAP% @tIoU=0.5) on Activi- tyNet v1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1 Comparison of dierent coordinate osets on action localization (% mAP@tIoU=0.5): parameterized, non-parameterized frame-level, non-parameterized unit-level. The performance with no boundary regression is also listed. . . . . . . . . 31 4.2 Comparison of cascaded step K p c = 1; 2; 3; 4 for temporal proposal genera- tion (% AR@F=1.0) on THUMOS-14. . . . . . . . . . . . . . . . . . . . . 31 4.3 Comparison of cascaded step K d c = 1; 2; 3; 4 for temporal action detection (% mAP@tIoU=0.5) on THUMOS-14. . . . . . . . . . . . . . . . . . . . . 31 4.4 Comparison with state-of-the-art on temporal action proposal generation. Average Recall at Proposal Frequency (AR@F=1.0) performance are re- ported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Temporal action detection performance (mAP %) comparison at dierent tIoU thresholds on THUMOS-14. . . . . . . . . . . . . . . . . . . . . . . . 32 4.6 Comparison of cascaded stepK c = 0; 1; 2; 3; 4 for temporal action detection (% mAP@tIoU=0.5) on TVSeries. K c = 0 means that the system only do classication, no boundary regression. . . . . . . . . . . . . . . . . . . . . 33 4.7 Temporal action detection performance (mAP %) comparison at dierent tIoU thresholds on TVSeries. . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1 Performance comparison between TAR and TURN [23] on THUMOS-14 test set. Same unit feature ( ow-16) and test sliding windows are used on TAR and TURN for fair comparison. Average Recall (AR) at dierent numbers is reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Complementary ltering evaluation on THUMOS-14 test set, compared with \Union" and \tIoU-selection". Average Recall (AR) at dierent num- bers is reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Comparison of CTAP and other proposal generation methods with the same action detector (SCNN) on THUMOS-14 test set, mean Average Precision (mAP % @tIoU=0.5) is reported. . . . . . . . . . . . . . . . . . 42 vii 5.4 Evaluation of TAR on ActivityNet v1.3 validation set. AR@100 and AUC of AR-AN curve are reported . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Evaluation of Complementary proposal ltering on ActivityNet v1.3 vali- dation set. AR@100 and AUC of AR-AN curve are reported. . . . . . . . 44 5.6 Generalization evaluation of CTAP on Activity Net v1.3 (validation set) in terms of AR@100 and AR-AN under curve area. . . . . . . . . . . . . . 44 5.7 Evaluation of CTAP on ActivityNet v1.3 validation set. AR@100 and AUC of AR-AN curve are reported. (The AR@100 of MSRA [101] is not available.) 44 6.1 Action anticipation comparison on TVSeries (cAP %) test set and THUMOS- 14 (per-frame mAP %) test set at 1s (4 chunks) with two-stream features. 53 6.2 Action anticipation comparison (ACC %) on TV-Human-Interaction at T a = 1s (4 chunks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3 Detailed action anticipation (cAP %) comparison for ED and RED on TVSeries test set from T a = 0:25s to T a = 2:0s with two-stream represen- tations and VGG features. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.4 Detailed action anticipation (per-frame mAP %) comparison for ED and RED on THUMOS-14 test set fromT a = 0:25s toT a = 2s with two-stream representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.5 Comparison on online action detection in TVSeries test set. . . . . . . . 55 6.6 Online action detection comparison on THUMOS-14 test set (per-frame mAP %) with two stream features. . . . . . . . . . . . . . . . . . . . . . 55 7.1 Comparison of dierent methods on TACoS . . . . . . . . . . . . . . . . . 66 7.2 Comparison of dierent methods on Charades-STA . . . . . . . . . . . . . 66 7.3 Experiments of complex sentence query. . . . . . . . . . . . . . . . . . . . 66 8.1 Number of samples of dierent tasks in TGIF-QA dataset. . . . . . . . . . 76 8.2 Evaluation of co-memory attention mechanism on TGIF-QA. \Action" is repetition action (ACC %), \Trans" is state transition (ACC %), \Count" is repetition count (MSE) and \Frame" is frame QA (ACC %). . . . . . 78 8.3 Evaluation of dynamic fact ensemble on TGIF-QA. \Action" is repetition action (ACC %), \Trans" is state transition (ACC %), \Count" is repeti- tion count (MSE) and \Frame" is frame QA (ACC %). . . . . . . . . . . 79 8.4 Comparison on cycles of memory update on TGIF-QA. \Action" is repe- tition action (ACC %), \Trans" is state transition (ACC %), \Count" is repetition count (MSE) and \Frame" is frame QA (ACC %). . . . . . . . 79 8.5 Comparison with the state-of-the-art method on TGIF-QA dataset. \Ac- tion" is repetition action (ACC %), \Trans" is state transition (ACC %), \Count" is repetition count (MSE) and \Frame" is frame QA (ACC %). . 81 viii List of Figures 1.1 General Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 There are two parts in this dissertation: temporal perception and temporal reasoning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1 Temporal action proposal generation from a long untrimmed video. We propose a Temporal Unit Regression Network (TURN) to jointly predict action proposals and rene the location by temporal coordinate regression. 11 3.2 Architecture of Temporal Unit Regression Network (TURN). A long video is decomposed into short video units, and CNN features are calculated for each unit. Features from a set of contiguous units, called a clip, are pooled to create clip features. Multiple temporal scales are used to create a clip pyramid at an anchor unit. TURN takes a clip as input, and outputs a condence score, indicating whether it is an action instance or not, and two regression osets of start and end times to rene the temporal action boundaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 (a)-(c) show the correlation between temporal action localization perfor- mance and TAP performance under dierent metrics. (d) shows correla- tion coecient between temporal action localization and TAP performance versus video length on THUMOS-14 dataset. . . . . . . . . . . . . . . . . 17 3.4 Comparison of TURN variants on THUMOS-14 dataset . . . . . . . . . . 19 3.5 Proposal performance on THUMOS-14 dataset under 4 metrics: AR-F, AR-AN, Recall@F-tIoU, Recall@AN-tIoU. For AR-AN and Recall@AN- tIoU, we use the codes provided by [13] . . . . . . . . . . . . . . . . . . . 19 3.6 Qualitative examples of retrieved proposals by TURN on THUMOS-14 dataset. GT indicates ground truth. TP and FP indicate true positive and false positive respectively. \reg prop" and \cls prop" indicate regression proposal and classication proposal. . . . . . . . . . . . . . . . . . . . . . 21 3.7 Comparison of generalizability on ActivityNet v1.2 dataset . . . . . . . . 23 4.1 Architecture of two-stage action detection pipeline with Cascaded Bound- ary Regression (CBR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Unrolled model of Cascaded Boundary Regression (CBR), the parameters of the MLPs are shared. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.1 The architectures of three baseline methods are shown: (1) SW+R&A: sliding windows are processed by a model for proposal ranking and bound- ary adjustment, e.g.TURN[23], SCNN [77]; (2) TAG: TAG [109] generate proposals based on unit-level actionness; (3) TAG+R&A: actionness pro- posals are processed with proposal ranking and boundary adjustment. . . 35 ix 5.2 The architecture of Complementary Temporal Action Proposal (CTAP) generator. \BA" is short for boundary adjustment, \PR" is short for proposal ranking, \ppl" is short for proposal and \bdy" is short for boundary. 36 5.3 AR-AN curves of the complementary results with ow-16 feature (F16) and two-stream-6 feature (TS6). Complementary ltering proposals outper- form sliding windows (SW+TAR) and actionness proposals (TAG+TAR) consistently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4 AN-AR curve and recall@AN=100 curve of CTAP and state-of-the-art methods on THUMOS-14 test set. . . . . . . . . . . . . . . . . . . . . . . 43 5.5 Visualization of temporal action proposals generated by CTAP. First two rows represent 4 temporal action proposals from 2 videos in THUMOS- 14. Last two rows represent 4 temporal action proposals from 2 videos in ActivityNet v1.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.1 Anticipating future actions by inferring from history information: the nor- mal images represent past frames and the transparent images represent future frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Reinforced Encoder-Decoder (RED) networks architecture for action an- ticipation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.3 Baseline methods for action anticipation. The left one is FC and the right one is EFC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.1 Temporal activity localization via language query in an untrimmed video. 57 7.2 Cross-modal Temporal Regression Localizer (CTRL) architecture. CTRL contains four modules: a visual encoder to extract features for video clips, a sentence encoder to extract embeddings, a multi-modal processing network to generate combined representations for visual and text domain, and a temporal regression network to produce alignment scores and location osets. 58 7.3 Intersection over Union (IoU) and non-Intersection over Length (nIoL). . 61 7.4 Charades-STA construction. . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.5 Performance comparison of dierent visual encoders. . . . . . . . . . . . 64 7.6 Performance comparison of dierent sentence embedding. . . . . . . . . . 65 7.7 Alignment prediction and regression renement examples in TACoS. The row with gray background shows the ground truth for the given query; the row with blue background shows the sliding window alignment results; the row with green background shows the clip regression results. . . . . . . . . 67 7.8 Alignment prediction and regression renement examples in Charades-STA. 67 8.1 Answering questions in videos involves both motion and appearance anal- ysis, and usually requires multiple cycles of reasoning, especially for tran- sitive questions, e.g.\ What does the woman do after look uncertain?", we need to rst localize when the woman looks uncertain, which requires motion evidence for looking uncertain and appearance evidence for the woman; and then focus on what the woman does (smile). . . . . . . . . . 69 x 8.2 General Dynamic Memory Network (DMN) [44] architecture. The memory update process for the t-th cycle is : (1) the facts F are encoded by an attention-based GRU in episodic memory module, where the attention is generated by last memory m t1 ; (2) the nal hidden state of the GRU is called contextual vector c t , which is used to update the memory m t together with question embedding q. The question answer is generated from the nal memory state m T . . . . . . . . . . . . . . . . . . . . . . . . 71 8.3 The input temporal representations are processed by temporal conv-deconv layers to build multi-layer contextual facts, which have the same temporal resolution but dierent contextual information. . . . . . . . . . . . . . . . 72 8.4 Co-memory attention module extracts useful cues from both appearance and motion memories to generate attentionga t /gb t for motion and appear- ance separately. Dynamic fact ensemble takes the multi-layer contextual facts A L /B L and the attention scores ga t /gb t to construct proper facts A s=h L /B s=h L , which are encoded by an attention-based GRU. The nal hid- den state c t b /c t a of the GRU is used to update the memory m t b /m t a . The nal output memorym h is the concatenation of the motion and appearance memory, and it is used to generate answers. . . . . . . . . . . . . . . . . 73 8.5 Multi-layer contextual facts are dynamically constructed via a soft atten- tion fusion process, which computes a weighted average facts according to the attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8.6 Examples on state transition, repetition action, repetition count and frame QA are shown in 1st, 2nd, 3rd and 4th row. ST-TP is the temporal atten- tion model from [34]. Green is for correct prediction and red is for wrong prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 9.1 There are two parts in this dissertation: temporal perception and temporal reasoning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 xi Abstract Temporal perception and reasoning in videos are important Computer Vision prob- lems that are closely related to Machine Learning, Deep Learning, Natural Language Processing and other research areas in Computer Science. In temporal domain, one of the most important perception problems is temporal action detection: given a untrimmed long video, temporal action detection aims to predict the start and end time of the ac- tions of interest and also the action category. Broadly, temporal action detection can be divided to two types: online and oine. Online action detection refers to detecting the actions in streaming videos (e.g.surveillance, self-driving car), i.e.no future video is available; for oine action detection, the whole video is available at the time of detection. For temporal reasoning, we formulate the problem as video question answering, in which the input is a video and a question related to the video, the task is to infer the answer from the video. Temporal perception and reasoning are challenging problems for the high video quality variation, the complex temporal structures, missing or parallel actions and also the large problem scale. This thesis describes my solutions to temporal perception and reasoning in large scale of videos. Recent advances in computer vision are mostly driven by three key factors: large scale of annotated data, deep neural networks and computing resources. Follow the similar idea in image domain, there have been several successful video-clip- level Convolutional Neural Networks (CNN) models, such as C3D [85] and two-stream CNN [79], which are mainly designed for action classication problem. However, those models work on modelling short video clips, how to model and analyze long-term temporal information for high level perception and reasoning tasks still remains unknown. My basic philosophy is to utilize clip-level models as basic building blocks, and design high-level networks on the top of the base networks for analyzing and understanding the temporal structure embedded in videos. Basically, the video is converted to a vector sequence, then the problem is transferred to sequence analysis problem. Dierent high- level network architectures are designed and used for dierent tasks, such as encoder- decoder networks for online action detection, temporal convolutional regression networks for oine action detection, dynamic memory network for video question answering. This thesis demonstrates the eectiveness of all the algorithms on a range of publicly available video or image datasets. xii Chapter 1 Introduction Capturing and sharing videos has become much easier since the development of mobile phones and internet, and the demand for semantic analysis of user generated videos is fast growing. Videos, dierent from images, introduce a unique dimension: time. Temporal analysis is an important research directions, and has many potential applications includ- ing temporal activity detection, language-based video search, action anticipation and so on. Previous research has mostly focused on action classication [79], which generates video-level action labels but does not predict the happening time of the actions. The video dataset they work on are mostly short trimmed videos. This proposal focuses on the temporal analysis of long untrimmed videos. 1.1 Problem Statement In the beginning, we rst dene several concepts which will be used throughout this proposal. A video clip or a video segment is a subset of video frames taken from the same video and continuous in time. An action is a simple human behavior which involves human-human or human-object interactions. Actions usually takes a small period of time (e.g.10 seconds or less) and are distinguished by appearance or motion patterns. Activity is a similar term to action, and they are interchangeable. In this proposal, we focus on temporal action detection, action anticipation and temporal question answering. Their denitions are • Temporalactiondetection: Given a long video, the task of temporal action detection is to localize intervals where actions of interest take place and also predict the action categories. Actions of interest can be a pre-dened list of actions or an action specied by natural language description. Action detection can also be classied as online action detection and oine action detection based on the availability of future video: for oine action detection, the whole video is available when detecting actions; for online action detection, only past video is available and future video is not available at the time of detection. • Temporal action anticipation: Temporal action anticipation refers to detection (i.e. anticipation) of an action before it happens. For action anticipation, the future 1 video is not available at the time of detection, which is similar to online action detection. • Temporal question answering: Given a video and related questions, the task is to generate correct answers. The questions could be counting the number of repetitions of an action, or identifying the state (e.g.actions, places) before (or after) another state. These three topics are tightly connected with each other. Temporal action detection with xed action categories is a fundamental problem in temporal perception, just like ob- ject detection in understanding images. However, people may have many dierent names for the same or similar actions, pre-dened actions do not provide such exibility. Thus, using natural language as action query is a more desirable way. In the setting of action detection, the whole video is available, however, in many real world applications, such as video surveillance monitoring systems, the videos are recorded in real time, which means action should be detected without future information. Such situations require the action detection system to work in an on-line manner, or even anticipate actions happening in seconds ahead, that is online action detection and action anticipation. Online action detection can be viewed as a special case of action anticipation, where the anticipation time is 0. From temporal perception to temporal reasoning, question answering is an important step, which not only requires the system to have strong understanding in both visual and language domain, but also need it to use language as a guidance to reason in the temporal structure. 1.2 Challenges Temporal perception and reasoning in large-scale videos is not a travail problem, one has to tackle the following challenges: Complex temporal structures. The temporal structure of video activities are complex for two reasons. First, multiple actions can happen in parallel in the videos, even with large amount of training videos, it is still dicult to discover the distinguishable patterns among actions. Second, action-related video segments can appear anywhere in the video, it is non-trivial to annotate all actions precisely, and thus makes the training data noisy. Large video variation. There are large variations in videos, especially in web videos captured by amateur users. The variations includes lighting condition, video resolution, camera motion, view point change, occlusion and so on. It requires strong appearance and motion representations to separate action instances from the large and wide variety of irrelevant background data. Besides, human edition like insertion of logos and captions are likely to introduce noise for video analysis. Large problem scale. A video is a sequence of images. In our typical dataset, the average length of videos is 20 minutes, or 50,000 frames. There are usually hundreds of videos for training. On the one hand, large amount of temporally correlated images provide more information for temporal analysis; on the other hand, it also requires that the feature extraction and other high-level analysis should be ecient. 2 1.3 General Approach Deep learning has been applied in many computer vision problems. There are several suc- cessful clip-level CNN models, such as C3D and two-stream models, which have proven their eectiveness on the action classication problem. However, those models work on short clips, how to model and analyze long-term temporal information still remains un- known. My basic philosophy is to utilize these clip-level models as base networks, and build high-level networks on the top of the base networks to analyze the temporal struc- ture, as shown in Figure 1.1. Dierent architectures are designed and used for dierent tasks, such as encoder-decoder networks for action anticipation, temporal convolutional regression networks for action detection. Specically, I mainly work on the following research projects: VideoFrames unit TemporalPerception NeuralNetworks Base Networks VideoFeatures Results Figure 1.1: General Model. • Temporal action detection with two-stage boundary regression [23, 21]. • Complementary temporal action proposal generation [17]. • Temporal action localization with language query [20, 19]. • Online action anticipation with encoder-decoder networks [22]. • Motion-appearance co-memory networks for video question answering [18]. These research projects can be summarized in Figure 9.1. I will introduce some of my work in detail towards these goals in the following paragraphs. 1.3.1 Temporal Unit Regression Network for Temporal Action Proposals Temporal Action Proposal (TAP) generation is an important problem, as fast and accu- rate extraction of semantically important (e.g.human actions) segments from untrimmed 3 Detection by Language: TALL Anticipation: RED Temporal Perception Temporal Reasoning Video Question Answering: Co-memory Frame-based Detection Video+Text → Video Video+Text → Text Detection: CTAP Proposal +Frame Detection Detection: TURN+CBR Proposal-based Detection Figure 1.2: There are two parts in this dissertation: temporal perception and temporal reasoning. videos is an important step for large-scale video analysis. We propose a novel Temporal Unit Regression Network (TURN) model. There are two salient aspects of TURN: (1) TURN jointly predicts action proposals and renes the temporal boundaries by tempo- ral coordinate regression; (2) Fast computation is enabled by unit feature reuse: a long untrimmed video is decomposed into video units, which are reused as basic building blocks of temporal proposals. TURN can run at over 880 frames per second (FPS) on a TITAN X GPU. 1.3.2 Cascaded Boundary Regression for Temporal Action Detection State-of-the-art methods address temporal action detection problem by applying action classiers on sliding windows. Although sliding windows may contain an identiable portion of the actions, they may not necessarily cover the entire action instance, which would lead to inferior performance. We adapt a two-stage temporal action detection pipeline with Cascaded Boundary Regression (CBR) model. Class-agnostic proposals and specic actions are detected respectively in the rst and the second stage. CBR uses temporal coordinate regression to rene the temporal boundaries of the sliding windows. The salient aspect of the renement process is that, inside each stage, the temporal boundaries are adjusted in a cascaded way by feeding the rened windows back to the system for further boundary renement. We test CBR on THUMOS-14 and TVSeries, and achieve state-of-the-art performance on both datasets. The performance gain is especially remarkable under high IoU thresholds, e.g.map@tIoU=0.5 on THUMOS-14 is improved from 19.0% to 31.0%. 4 1.3.3 CTAP: Complementary Temporal Action Proposal Generation Previous methods can be divided to two groups: sliding window ranking and actionness score grouping. Sliding windows uniformly cover all segments in videos, but the temporal boundaries are imprecise; grouping based method may have more precise boundaries but it may omit some proposals when the quality of actionness score is low. Based on the complementary characteristics of these two methods, we propose a novel Complementary Temporal Action Proposal (CTAP) generator. Specically, we apply a Proposal-level Ac- tionness Trustworthiness Estimator (PATE) on the sliding windows proposals to generate the probabilities indicating whether the actions can be correctly detected by actionness scores, the windows with high scores are collected. The collected sliding windows and actionness proposals are then processed by a temporal convolutional neural network for proposal ranking and boundary adjustment. CTAP outperforms state-of-the-art methods on average recall (AR) by a large margin. 1.3.4 Reinforced Encoder-Decoder Networks for Action Anticipation Action anticipation aims to detect an action before it happens. Many real world ap- plications in robotics and surveillance are related to this predictive capability. Current methods address this problem by rst anticipating visual representations of future frames and then categorizing the anticipated representations to actions. However, anticipation is based on a single past frame's representation, which ignores the history trend. Besides, it can only anticipate a xed future time. We propose a Reinforced Encoder-Decoder (RED) network for action anticipation. RED takes multiple history representations as input and learns to anticipate a sequence of future representations. One salient aspect of RED is that a reinforcement module is adopted to provide sequence-level supervision; the reward function is designed to encourage the system to make correct predictions as early as possible. 1.3.5 Temporal Activity Localization via Language Query Existing methods for action detection typically train classiers for a pre-dened list of actions and apply them in a sliding window fashion. However, activities in the wild consist of a wide combination of actors, actions and objects; it is dicult to design a proper activity list that meets users needs. We propose to localize activities by natural language queries. Temporal Activity Localization via Language (TALL) is challenging as it requires: (1) suitable design of text and video representations to allow cross-modal matching of actions and language queries; (2) ability to locate actions accurately given features from sliding windows of limited granularity. We propose a novel Cross-modal Temporal Regression Localizer (CTRL) to jointly model text query and video clips, out- put alignment scores and action boundary regression results for candidate clips. For evaluation, we adopt TaCoS dataset, and build a new dataset for this task on top of Charades by adding sentence temporal annotations, called Charades-STA. 5 1.3.6 Motion-Appearance Co-Memory Networks for Video Question Answering Video Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usu- ally correlated with each other and able to provide useful attention cues to the other; (3) dierent questions require dierent number of frames to infer the answer. Based on these observations, we propose a motion-appearance co-memory network for video QA. Our networks are built on concepts from Dynamic Memory Network (DMN) and intro- duces new mechanisms for video QA. Specically, there are three salient aspects: (1) a co-memory attention mechanism that utilizes cues from both motion and appearance to generate attention; (2) a temporal conv-deconv network to generate multi-level con- textual facts; (3) a dynamic fact ensemble method to construct temporal representation dynamically for dierent questions. We evaluate our method on TGIF-QA dataset, and the results outperform state-of-the-art signicantly on all four tasks of TGIF-QA. 1.4 Proposal Outline The proposal is outlined as follows: We begin in chapter 2 with a brief overview of literature on temporal perception, such as action detection and anticipation, and tempo- ral reasoning. Chapter 3 presents our framework for temporal proposal generation. In Chapter 4, we describe our two-stage action detection model with cascaded boundary re- gression. In Chapter 5, we present our complementary ltering based proposal generator. Chapter 6 presents our framework for anticipating future actions with encoder-decoder networks. The idea of temporal action localization via language query is described in Chapter 7. Co-memory network for video question answering is introduced in Chapter 8. Finally, in Chapter 9, we summarize this dissertation and describe the future directions of our research. 6 Chapter 2 Related Work In this chapter, we review the related works in action detection, action anticipation, and action localization via language. 2.1 Temporal Action Detection Temporal Action Proposal. Temporal action proposal generation has been shown to be an eective step in action detection, and has become more popular in recent years. Two types of methods have been proposed, the rst type of methods formulates it as a binary classication problem on sliding windows. Among them, Sparse-prop [6] uses STIPs [46] and dictionary learning for proposal generation. SCNN-prop [77] is based on training C3D [85] network for binary classication task. TURN [23] cuts the videos to units, and reuse unit-level features for proposals, which improves computational eciency. TURN [23] also proposes to apply temporal regression to adjust the action boundaries which improves the AR performance. The performance of this type of methods is limited by the imprecise temporal boundaries of sliding windows. The second type of method is based on snippet level actionness score and apply Temporal Action Grouping (TAG) [109] method on the score sequence to group continuous high-score region as proposal. However, TAG may omit the correct proposals when the quality of actionness scores is low. Besides, DAPs [13] and SST [5] are online proposal generators, which could run over the video in a single pass, without the use of overlapping temporal sliding windows. CTAP [17] rst tried to fuse the proposal-based method [23] and frame-based method [96] by complementary ltering, and it achieved state-of-the-art performance. Temporal Action Detection. This task [77, 103, 83, 107] focuses on predicting the action categories, and also the start/end times of the action instances in untrimmed videos. S-CNN [77] presents a two-stage action detection model, which rst generates proposals and then classies the proposals. Lin et al. propose a Single Shot Action De- tector (SSAD) [48], which skips the proposal generation step and directly detects action instances in untrimmed video. Gao et al. [21] design a Cascaded Boundary Regres- sion (CBR) network to rene the action boundaries iteratively. SSN [109] presents a mechanism to model the temporal structures of activities, and thus the capability of dis- criminating between complete and incomplete proposals for precisely detecting actions. R-C3D [98] designs a 3D fully convolutional network, which generates candidate temporal regions and classies selected regions into specic activities in a two-stage manner. Yuan 7 et al. [108] propose to localize actions by searching for the structured maximal sum of frame-wise classication scores. Shou et al.[76] design a Convolutional-De-Convolutional (CDC) operation that makes dense predictions at a ne granularity in time to determine precise temporal boundaries. Dai et al.[10] propose a temporal context network, which adopts a similar architecture to Faster-RCNN [71], for temporal action detection. Beyond the xed category action detection, TALL [20] proposes to use natural language as the query to detect the target actions in videos. Object Proposals and Object Detection. Recent successful object detection frameworks [25, 24, 71] are built on high quality object proposals. SelectiveSearch [86] and Edgebox [113] rely on hand-crafted low-level visual features. R-CNN [25] and Fast R-CNN [24] use this type of object proposals as input. RPNs [71] proposed to use anchor boxes and spatial regression for object proposal generation, which is based on ConvNet's conv-5 featmap. YOLO [68] proposed to divide the input image into grid cells and estimate the object bounding box by coordinate regression. SSD [50] further adopted multi-scale grid cells to predict bounding boxes. 2.2 Action Anticipation Early and Online Action Detection Hoai et al. [27, 28] rst proposed the problem of early event detection. They designed a max-margin framework which is based on structured output SVMs. Ma et al. [52] address the problem of early action detection. They propose to train an LSTM network with ranking loss and merge the detection spans based on the frame-wise prediction scores generated by the LSTM. Recently, Geest et al. [11] published a new dataset for online action detection, which consists of 16 hours (27 episodes) of TV series with temporal annotation for 30 action categories. Action Anticipation There have been some promising works on anticipating future action categories. Lan et al. [45] designed a hierarchical representation, which describes human movements at multiple levels of granularities, to predict future actions in the wild. Pei et al. [64] proposed an event parsing algorithm by using Stochastic Context Sensitive Grammar (SCSG) for inferring the goal of agents, and predicting the intended actions. Xie et al. [93] proposed to infer people's intention of performing actions, which is a good clue for predicting future actions. Vondrick et al. [88] proposed to anticipate visual representation by training CNN on large-scale unlabelled video data. Reinforcement Learning in Vision We get inspiration from recent approaches that used REINFORCE [92] to learn task-specic policies. Yeung et al.[103] proposed to learn policies to predict next observation location for action detection task by using LSTM networks. Mnihetal.[59] proposed to adaptively select a sequence of regions in images and only processing the selected regions at high resolution for the image classication task. Ranzato et al.[67] proposed a sequence-level training algorithm for image captioning that directly optimizes the metric used at test time by policy gradient methods. 8 2.3 Action Localization via Language Query Sentence-based image retrieval. Given a set of candidate videos/images and a sen- tence query, this task requires retrieving the videos/images that match the query. Karpa- thy et al. [36] proposed Deep Visual-Semantic Alignment (DVSA) model. DVSA used bidirectional LSTMs to encode sentence embeddings, and R-CNN object detectors [25] to extract features from object proposals. Skip-thought [42] learned a Sent2Vec model by applying skip-gram [58] on sentence level and achieved top performance in sentence- based image retrieval task. Sun et al. [82] proposed to discover visual concepts from image-sentence pairs and apply the concept detectors for image retrieval. Gao et al. [19] proposed to learn verb-object pairs as action concepts from image-sentence pairs. Hu et al. [31] and Mao et al. [56] formulated the problem of natural language object retrieval. Sentence-based video retrieval. As for video retrieval, Lin et al. [47] parsed the sentence descriptions into a semantic graph, which are then matched to visual concepts in the videos by generalized bipartite matching. Bojanowski et al. [4] tackled the problem of video-text alignment: given a video and a set of sentences with temporal ordering, assigning a temporal interval for each sentence. In our settings, only one sentence query is input to the system and temporal ordering is not used. 2.4 Visual Question Answering Image question answering. Image question answering aims to measure the capability of reasoning about linguistic and image inputs jointly. Many methods have been proposed [100, 9, 32, 51, 99, 15, 3, 2, 94, 54, 8, 75, 39, 30, 110, 104, 111, 112]. Among all these models, attention mechanism [100, 9, 51, 112] provides guidance to deep models on \where to look" and memory mechanism [44, 94] allows the model to have multiple reasoning iterations and rene the answer gradually. Question-guided attention mechanism [9] uses semantic representation of a question as query to search for the regions in an image that are related to the answer. Yang et al. [100] presented a Stacked Attention Network (SAN) that queries an image multiple times to infer the answer progressively. Lu etal. [51] argued that modeling \what words to listen to" is equally important to model \where to look", and proposed a co-attention model that jointly reasons about image-guided and question-guided attention. Instead of directly inferring answers from the abstract visual features, Yu et al. [104] developed a semantic attention mechanism to select high-level question-related concepts. Dynamic memory network (DMN), which was rst introduced by Kumar et al. [44] to solve text based question answering, adopted episodic memories and attention mechanisms which allow multiple cycles of reasoning. Xiong et al. [94] improved the memory and input module of DMN so that it can be applied to image QA. Video question answering. Video QA is a relatively new task compared with image QA. Yu et al. [106] adopted a semantic attention mechanism, which combines the detected concepts in videos with text encoding/decoding to generate answers. Comparing with images, temporal domain is unique to videos. A temporal attention mechanism is leveraged to selectively attend to one or more periods of a video in [34, 60, 97]. Besides temporal attention mechanism, Jang et al. [34] and Xu et al. [97] also utilized motion information along with appearance information in videos. Recently Na et al. [61] and 9 Kim et al. [40] both introduced the memory mechanism to their models for video QA. However, their models [61, 40] both lack motion analysis and dynamic memory update mechanism. 10 Chapter 3 Temporal Unit Regression Network for Temporal Action Proposals 3.1 Introduction We address the problem of generating Temporal Action Proposals (TAP) in long untrimmed videos, akin to generation of object proposals in images for rapid object detection. As in the case for objects, the goal is to make action proposals have high precision and recall, while maintaining computational eciency. There has been considerable work in action classication task where a \trimmed" video is classied into one of specied categories [79, 90]. There has also been work on localizing the actions in a longer, \untrimmed" video [16, 77, 107, 103], i.e. temporal action localization. A straightforward way to use action classication techniques for localization is to use temporal sliding windows, however there is a trade-o between density of the sliding windows and computation time. Taking cues from the success of proposal frameworks in object detection tasks [24, 71], there has been recent work for generating temporal action proposals in videos [77, 13, 6] to improve the precision and accelerate the speed of temporal localization. State-of-the-art methods [77, 6] formulate TAP generation as a binary classication problem (i.e. action vs. background) and apply sliding window approach as well. Denser groundtruth sliding windowprediction location refinement Timeline Figure 3.1: Temporal action proposal generation from a long untrimmed video. We propose a Temporal Unit Regression Network (TURN) to jointly predict action proposals and rene the location by temporal coordinate regression. 11 sliding windows usually would lead to higher recall rates at the cost of computation time. Instead of basing on sliding windows, Deep Action Proposals (DAPs) [13] uses a Long Short-term Memory (LSTM) network to encode video streams and infer multiple action proposals inside the streams. However, the performance of average recall (AR), which is computed by the average of recall at temporal intersection over union (tIoU) between 0.5 and 1, suers at small number of predicted proposals compared with the sliding window based method [77] 1 . To achieve high temporal localization accuracy and ecient computation cost, we propose to use temporal boundary regression. Boundary regression has been a successful practice for object localization, as in [71]. However, temporal boundary regression for actions has not been attempted in the past work. We present a novel method for fast TAP generation: Temporal Unit Regression Net- work (TURN). A long untrimmed video is rst decomposed into short (e.g. 16 or 32 frames) video units, which serve as basic processing blocks. For each unit, we extract unit-level visual features using o-the-shelf models (C3D and two-stream CNN model are evaluated) to represent video units. Features from a set of contiguous units, called a clip, are pooled to create clip features. Multiple temporal scales are used to create a clip pyramid. To provide temporal context, clip-level features from the internal and surrounding units are concatenated. Each clip is then treated as a proposal candidate and TURN outputs a condence score, indicating whether it is an action instance or not. In order to better estimate the action boundary, TURN outputs two regression osets for the starting time and ending time of an action in the clip. Non-maximum suppres- sion (NMS) is then applied to remove redundant proposals. The source code is available at https://github.com/jiyanggao/TURN-TAP. DAPs [13] and Sparse-prop [6] use Average Recall vs. Average Number of retrieved proposals (AR-AN) to evaluate the TAP performance. There are two issues with AR-AN metric: (1) the correlation between AR-AN of TAP and mean Average Precision (mAP) of action localization was not ex- plored ; (2) the average number of retrieved proposals is related to average video length of the test dataset, which makes AR-AN less reliable when evaluating across dierent datasets. Spatio-temporal action detection [105, 91] used Recall vs. Proposal Number (R-N), however this metric does not take video lengths into consideration. There are two criteria for a good metric: (1) it should be capable of evaluating the performance of dierent methods on the same dataset eectively; (2) it should be capable of evaluating the performance of the same method across dierent datasets (generalization capability). We should expect better TAP would lead to better localization performance, using the same localizer. We propose a new metric, Average Recall vs. Frequency of retrieved proposals (AR-F), for TAP evaluation. In Section 3.3.2, we validate that the proposed method satises the two criteria by quantitative correlation analysis between TAP performance and action localization performance. We test TURN on THUMOS-14 and ActivityNet for TAP generation. Experimental results show that TURN outperforms the previous state-of-the-art methods [13, 77] by a large margin under AR-F and AR-AN. For run-time performance, TURN runs at over 880 frames per second (FPS) with C3D features and 260 FPS with ow CNN features on 1 Newly released evaluation results from DAPs authors show that SCNN-prop [77] outperforms DAPs. 12 Figure 3.2: Architecture of Temporal Unit Regression Network (TURN). A long video is decomposed into short video units, and CNN features are calculated for each unit. Features from a set of contiguous units, called a clip, are pooled to create clip features. Multiple temporal scales are used to create a clip pyramid at an anchor unit. TURN takes a clip as input, and outputs a condence score, indicating whether it is an action instance or not, and two regression osets of start and end times to rene the temporal action boundaries. a single TITAN X GPU. We further plug TURN as a proposal generation step in existing temporal action localization pipelines, and observe an improvement of mAP from state- of-the-art 19% to 25.6% (at tIoU=0.5) on THUMOS-14 by changing only the proposals. State-of-the-art localization performance is also achieved on ActivityNet. We show state- of-the-art performance on generalization capability by training TURN on THUMOS-14 and transfer it to ActivityNet without ne-tuning, strong generalization capability is also shown by test TURN across dierent subsets in ActivityNet without ne-tuning. In summary, our contributions are four-fold: (1) We propose a novel architecture for temporal action proposal generation using temporal coordinate regression. (2) Our proposed method achieves high eciency (>800 fps) and outperforms previous state-of-the-art methods by a large margin. (3) We show state-of-the-art generalization performance of TURN across dierent action datasets without dataset specic ne-tuning. (4) We propose a new metric, AR-F, to evaluate the performance of TAP and compare AR-F with AR-AN and AR-N by quantitative analysis. 3.2 Methods In this section, we will describe the Temporal Unit Regression Network (TURN) and the training procedure. 13 3.2.1 Video Unit Processing As we discussed before, the large-scale nature of video proposal generation requires the solution to be computationally ecient. Thus, extracting visual feature for the same window or overlapped windows repeatedly should be avoided. To accomplish this, we use video units as the basic processing units in our framework. A video V contains T frames, V =ft i g T 1 , and is divided into T=n u consecutive video units , where n u is the frame number of a unit. A unit is represented as u =ft i g s f +nu s f , where s f is the starting frame, s f +n u is the ending frame. Units are not overlapped with each other. Each unit is processed by a visual encoder E v to get a unit-level representation f u = E v (u). In our experiments, C3D [85], optical ow based CNN model and RGB image CNN model [79] are investigated. Details are given in Section 3.3.2. 3.2.2 Clip Pyramid Modeling A clip (i.e. window) c is composed of units, c =fu j g su+nc su , where s u is the index of starting unit and n c is the number of units inside c. e u =s u +n c is the index of ending unit, andfu j g eu su is called internal units of c. Besides the internal units, context units for c are also modeled. fu j g su sunctx andfu j g eu+nctx eu are the context before and after c respectively, n ctx is the number of units we consider for context. Internal feature and context feature are pooled from unit features separately by a functionP . The nal feature f c for a clip is the concatenation of context features and the internal features; f c is given by f c =P (fu j g su sunctx )k P (fu j g eu su )k P (fu j g eu+nctx eu ) wherek represents vector concatenation and mean pooling is used for P . We scan an untrimmed video by building window pyramids at each unit position, i.e. an anchor unit. A clip pyramid p consists of temporal windows with dierent temporal resolution, p =fc nc g;n c 2fn c;1 ;n c;2 ;:::g. Note that, although multi-resolution clips would have temporal overlaps, the clip-level features are computed from unit-level features, which are only calculated once. 3.2.3 Unit-level Temporal Coordinate Regression The intuition behind temporal coordinate regression is that human can infer the approx- imate start and end time of an action instance (e.g. shooting basketball, swing golf) without watching the entire instance, similarly, neural networks might also be able to infer the temporal boundaries. Specically, we design a unit regression model that takes a clip-level representation f c as input, and have two sibling output layers. The rst one outputs a condence score indicating whether the input clip is an action instance. The second one outputs temporal coordinate regression osets. The regression osets are o s =s clip s gt ; o e =e clip e gt (3.1) where s clip , e clip is the index of starting unit and ending unit of the input clip; s gt , e gt is the index of starting unit and ending unit of the matched ground truth. 14 There are two salient aspects in our coordinate regression model. First, instead of regressing the temporal coordinates at frame-level, we adopt unit-level coordinate re- gression. As the basic unit-level features are extracted to encode n u frames, the feature may not be discriminative enough to regress the coordinates at frame-level. Comparing with frame-level regression, unit-level coordinate regression is easier to learn and more eective. Second, in contrast to spatial bounding box regression, we don't use coordinate parametrization. We directly regress the osets of the starting unit coordinates and the ending unit coordinates. The reason is that objects can be re-scaled in images due to camera projection, so the bounding box coordinates should be rst normalized to some standard scale. However, actions' time spans can not be easily rescaled in videos. 3.2.4 Loss Function For training TURN, we assign a binary class label (of being an action or not) to each clip (generated at each anchor unit). A positive label is assigned to a clip if: (1) the window clip with the highest temporal Intersection over Union (tIoU) overlaps with a ground truth clip; or (2) the window clip has tIoU larger than 0.5 with any of the ground truth clips. Note that, a single ground truth clip may assign positive labels to multiple window clips. Negative labels are assigned to non-positive clips whose tIoU is equal to 0.0 (i.e. no overlap) for all ground truth clips. We design a multi-task loss L to jointly train classication and coordinates regression. L =L cls +L reg (3.2) where L cls is the loss for action/background classication, which is a standard Softmax loss. L reg is for temporal coordinate regression and is a hyper-parameter. The regression loss is L reg = 1 N pos N X i=1 l i j(o s;i o s;i ) + (o e;i o e;i )j (3.3) L1 distance is adopted. l i is the label, 1 for positive samples and 0 for background samples. N pos is the number of positive samples. The regression loss is calculated only for positive samples. During training, the background to positive samples ratio is set to be 10 in a mini- batch. The learning rate and batch size are set as 0.005 and 128 respectively. We use the Adam [41] optimizer to train TURN. 3.3 Evaluation In this section, we introduce the evaluation metrics, experimental setup and discuss the experimental results. 3.3.1 Metrics We consider three dierent metrics to assess the quality of TAP, the major dierence is in the way to consider the retrieve number of proposals: Average Recall vs. Number of 15 retrieved proposals (AR-N) [105, 29], Average Recall vs. Average Number of retrieved proposals (AR-AN) [13], Average Recall vs. Frequency of retreived proposals (AR-F). Average Recall (AR) is calculated as a mean value of recall rate at tIoU between 0.5 and 1. AR-N curve. In this metric, the numbers of retrieved proposals (N) for all test videos are the same. This curve plots AR versus number of retrieved proposals. AR-AN curve. In this metric, AR is calculated as a function of average number of retrieved proposals (AN). AN is calculated as: = ;2 (0; 1]. In which, = 1 n P n i=1 i is the average number of all proposals of test videos. is the ratio of picked proposals to evaluate. n is the number of test videos and i is the number of all proposals for each video. By scanning the ratio from 0 to 1, the number of retrieved proposals in each video varies from 0 to number of all proposals and thus the average number of retrieved proposals also varies. AR-F curve. This is the new metric that we propose. We measure average recall as a function of proposal frequency (F), which denotes the number of retrieved proposals per second for a video. For a video of lengthl i and proposal frequency ofF , the retrieved proposal number of this video is R i =Fl i . We also report Recall@X-tIoU curve: recall rate at X with regard to dierent tIoU. X could be number of retrieved proposals (N), average number of retrieved proposals (AN) and proposal frequency (F). For the evaluation of temporal action localization, we follow the traditional mean Av- erage Precision (mAP) metric used in THUMOS-14 and ActivityNet. A prediction is re- garded as positive only when it has correct category prediction and tIoU with ground truth higher than a threshold. We use the ocial evaluation toolkit provided by THUMOS-14 and ActivityNet. 3.3.2 Experiments on THUMOS-14 Datasets. The temporal action localization part of THUMOS-14 contains over 20 hours of videos from 20 sports classes. This part consists of 200 videos in validation set and 213 videos in test set. TURN model is trained on the validation set, as the training set of THUMOS-14 contains only trimmed videos. Experimental setup. We perform the following experiments: (1) dierent temporal proposal evaluation metrics are compared; (2) the performance of TURN and other TAP generation methods are compared under evaluation metrics (i.e AR-F and AR-AN) men- tioned above; (3) dierent TAP generation methods are compared on the temporal action localization task with the same localizer/classier. Specically, we feed the proposals into a localizer/classier, which outputs the condence scores of 21 classes (20 classes of action plus background). Two localizer/classiers are adopted: (a) SVM classiers: one-vs-all linear SVM classiers are trained for all 21 classes using C3D fc6 features; (b) S-CNN localizer: the pre-trained localization network of S-CNN [77] is adopted. For TURN model, the context unit numbern ctx is 4, is 2.0, the dimension of middle layerf m is 1000, temporal window pyramids is built withf1; 2; 4; 8; 16; 32g units. We test TURN with dierent unit sizes n u 2f16; 32g, and dierent unit features, including C3D [85], optical ow based CNN feature and RGB CNN feature [79]. The NMS threshold is 16 set to be 0.1 smaller than tIoU in evaluation. We implement TURN model in Tensor ow [1]. Comparison of dierent evaluation metrics. To validate the eectiveness of dierent evaluation metrics, we compare AR-F, AR-N, AR-AN by a correlation analysis with localization performance (mAP). We generate seven dierent sets of proposals, in- cluding random proposals, slidinig windows and variants of S-CNN [77] proposals (details are given in the supplementary material). We then test the localization performance using the proposals, as shown in Figure 3.3 (a)-(c). SVM classiers are used for localization. (a) (b) (c) (d) Figure 3.3: (a)-(c) show the correlation between temporal action localization performance and TAP performance under dierent metrics. (d) shows correlation coecient between temporal action localization and TAP performance versus video length on THUMOS-14 dataset. A detailed analysis of correlation and video length is given in Figure 3.3 (d). The test videos are sorted by video lengths and then divided evenly into four groups. The average video length of the group is the x-axis, and y-axis represents the correlation coecient between action localization performance and TAP performance of the group. Each point in 3.3 (d) represents the correlation of TAP and localization performance of one group under dierent evaluation metrics. As can be observed in Figure 3.3, the correlation coecient between mAP and AR-F is consistently higher than 0.9 at all video lengths. In contrast, correlation of AR-N and mAP is aected by video length distribution. Note that, AR-AN also shows a stable correlation with mAP, this is partially because the TAP generation methods we use generate proportional numbers of proposals to video length. To assess generalization, assume that we have two dierent datasets,S 0 andS 1 , whose average number of all proposals are 0 and 1 respectively. As introduced before, average number of retrieved proposals =;2 (0; 1] is dependent on . When we compare AR at some certain AN = x between S 0 and S 1 , as 0 and 1 are dierent, we need to set dierent 0 and 1 . It means that the ratios between retrieved proposals and all 17 generated proposals are dierent for S 0 and S 1 , which make the AR calculated for S 0 and S 1 at the same AN = x can not be compared directly. For AR-F, the number of proposals retrieved is based on \frequency", which is independent with the average number of all generated proposals. In summary, AR-N cannot evaluate TAP performance eectively on the same dataset, as number of retrieved proposals should vary with video lengths. AR-AN cannot be used to compare TAP performance among dierent datasets, as the retrieval ratio depends on dataset's video length distribution, which makes the comparison unreasonable. AR-F satises both requirements. Comparison of visual features. We test TURN with three unit-level features to assess the eect of visual features on AR performance: C3D [85] features, RGB CNN features with temporal mean pooling and dense ow CNN [95] features. The C3D model is pre-trained on Sports1m [37], all 16 frames in a unit are input into C3D and the output offc6 layer is used as unit-level feature. For RGB CNN features, we uniformly sample 8 frames from a unit, extract \Flatten 673" features using a ResNet [26] model (pre-trained on training set of ActivityNet v1.3 dataset [95]) and compute the mean of these 8 features as the unit-level feature. For dense ow CNN features, we sample 6 consecutive frames at the center of a unit and calculate optical ow [14] between them. The ows are then fed into a BN-Inception model [95, 33] that is pre-trained on training set of ActivityNet v1.3 dataset [95]. The output of \global pool" layer of BN-Inception is used as the unit-level feature. As shown in Figure 3.4, dense ow CNN feature (TURN-FL) gives the best results, indicating optical ow can capture temporal action information eectively. In contrast, RGB CNN features (TURN-RGB) show inferior performance and C3D (TURN-C3D) gives competitive performance. Temporal context and unit-level coordinate regression. We compare four variants of TURN to show the eectiveness of temporal context and unit regression: (1) binary cls w/o ctx: binary classication (no regression) without the use of temporal context, (2)binary cls w/ ctx: binary classication (no regression) with the use of context, (3)frameregw/ctx: frame-level coordinate regression with the use of context and (4)unit reg w/ ctx: unit-level coordinate regression with the use of context (i.e. our full model). The four variants are compared with AR-F curves. As shown in Figure 3.4, temporal context helps to classify action and background by providing additional information. As shown in AR-F curve, unit reg w/ ctx has higher AR than the other variants at all frequencies, indicating that unit-level regression can eectively rene the proposal location. Some TURN proposal results are shown in Figure 3.6. Comparison with state-of-the-art. We compare TURN with the state-of-the- art methods under AR-AN, AR-F, Recall@AN-tIoU, Recall@F-tIoU metrics. The TAP generation methods include DAPs [13], SCNN-prop [77], Sparse-prop [6], sliding window, and random proposals. For DAPs, Sparse-prop and SCNN-prop, we plot the curves using the proposal results provided by the authors. \Sliding window proposals" include all sliding windows of length from 16 to 512 overlapped by 75%, each window is assigned with a random score. \Random proposals" are generated by assigning random starting and ending temporal coordinates (ending temporal coordinate is larger than starting temporal coordinate), each random window is assigned with a random score. As shown 18 Figure 3.4: Comparison of TURN variants on THUMOS-14 dataset in Figure 3.5, TURN outperforms the state-of-the-art consistently by a large margin under all four metrics. AR-F AR-AN Recall@F-tIoU Recall@AN-tIoU Figure 3.5: Proposal performance on THUMOS-14 dataset under 4 metrics: AR-F, AR- AN, Recall@F-tIoU, Recall@AN-tIoU. For AR-AN and Recall@AN-tIoU, we use the codes provided by [13] How unit size aects AR and run-time performance? The impact of unit size on AR and computation speed is evaluated with n u 2f16; 32g. We keep other hyper- parameters the same as in Section 3.3.2. Table 3.1 shows comparison of the three TURN variants (TURN-FL-16, TURN-FL-32, TURN-C3D-16) and three state-of-the-art TAP methods, in terms of recall (AR@F=1.0) and run-time (FPS) performance. We randomly select 100 videos from THUMOS-14 validation set and run TURN-FL-16, TURN-FL-32 and TURN-C3D-16 on a single Nvidia TITAN X GPU. The run-time of DAPs [13] and SCNN-prop [77] are provided in [13], which were tested on a TITAN X GPU and a GTX 980 GPU respectively. The hardware used in [6] is not specied in the paper. As can be seen, there is a trade-o between AR and FPS: smaller unit size leads to higher recall rate, and also higher computational complexity. We consider unit size 19 Table 3.1: Run-time and AR Comparison on THUMOS-14. method AR@F=1.0 (%) FPS DAPs [13] 35.7 134.3 SCNN-prop [77] 38.3 60.0 Sparse-prop [6] 33.3 10.2 TURN-FL-16 43.5 129.4 TURN-FL-32 42.4 260.6 TURN-C3D-16 39.3 880.8 as temporal coordinate precision, for example, unit size of 16 and 32 frames represent approximately half second and one second respectively. The major part of computation time comes from unit-level feature extraction. Smaller unit size leads to more number of units, which increases computation time; on the other hand, smaller unit size also increases temporal coordinate precision, which improves the precision of temporal re- gression. C3D feature is faster than ow CNN feature, but with a lower performance. Compared with state-of-the-art methods, we can see that TURN-C3D-16 outperforms current state-of-the-art AR performance, but accelerates computation speed by more than 6 times. TURN-FL-16 achieves the highest AR performance with competitive run- time performance. TURN for temporal action localization. We feed proposal results of dierent TAP generation methods into the same temporal action localizers/classiers to compare the quality of proposals. The value of mAP@tIoU=0.5 is reported in Table 3.2. TURN outperforms all other methods in both the SVM classier and S-CNN localizer. Sparse- prop, SCNN-prop and DAPs all use C3D to extract features. It is worth noting that the localization results of four dierent proposals suit well with their proposal performance under AR-F metric in Figure 3.5: the methods that have better performance under AR-F achieve higher mAP in temporal action localization. Table 3.2: Temporal action localization performance (mAP % @tIoU=0.5) evaluated on dierent proposals on THUMOS-14. DAPs SVM[13] Our SVM S-CNN Sparse-prop[6] 7.8 8.1 15.3 DAPs[13] 13.9 9.5 16.3 SCNN-prop[77] 7.6 2 14.0 19.0 TURN-C3D-16 - 16.4 22.5 TURN-FL-16 - 17.8 25.6 A more detailed comparison of state-of-the-art localization methods is given in Table 3.3. It can be seen that, by applying TURN with linear SVM classiers for action localiza- tion, we achieve comparable performance with the state-of-the-art methods. By further 2 This number should be higher, as DAPs authors adopted an incorrect frame rate when using S-CNN proposals. 20 Table 3.3: Temporal action localization performance (mAP %) comparison at dierent tIoU thresholds on THUMOS-14. tIoU 0.1 0.2 0.3 0.4 0.5 Oneata et al.[62] 36.6 33.6 27.0 20.8 14.4 Yeung et al.[103] 48.9 44.0 36.0 26.4 17.1 Yuan et al. [107] 51.4 42.6 33.6 26.1 18.8 S-CNN [77] 47.7 43.5 36.3 28.7 19.0 TURN-C3D-16 + SVM 46.4 41.5 34.3 24.9 16.4 TURN-FL-16 + SVM 48.3 43.2 35.1 26.2 17.8 TURN-C3D-16 +S-CNN 48.8 45.5 40.3 31.5 22.5 TURN-FL-16 + S-CNN 54.0 50.9 44.1 34.9 25.6 incorporating S-CNN localizer, we outperform all other methods by a large margin at all tIoU thresholds. The experimental results prove the high-quality of TURN proposals. TURN helps action localization on two aspects: (1) TURN serves as the rst stage of a localization pipeline (e.g. S-CNN, SVM) to generate high-quality TAP, and thus increases the localization performance; (2) TURN accelerates localization pipelines by ltering out many background segments, thus reducing the unnecessary computation. GT TP reg prop TP cls prop FP reg prop FP cls prop Time Time Time 57.2 63.1 63.9 8.6 10.4 11.1 12.9 14.2 33.3 34.0 34.9 37.2 38.4 29.9 33.1 35.7 37.4 115.2 116.7 123.9 121.6 122.5 57.6 61.9 Figure 3.6: Qualitative examples of retrieved proposals by TURN on THUMOS-14 dataset. GT indicates ground truth. TP and FP indicate true positive and false positive respectively. \reg prop" and \cls prop" indicate regression proposal and classication proposal. 3.3.3 Experiments on ActivityNet Datasets. ActivityNet datasets provide rich and diverse action categories. There are three releases of ActivityNet dataset: v1.1, v1.2 and v1.3. All three versions dene a 5-level hierarchy of action classes. Nodes on higher level represent more abstract action categories. For example, the node \Housework" on level-3 has child nodes \Interior cleaning", \Sewing, repairing, & maintaining textiles" and \Laundry" on level-4. From 21 the hierarchical action categories denition, a subset can be formed by including all action categories that belong to a certain node. Experiment setup. To compare with previous work, we do experiments on v1.1 (on subsets of \Works" and \Sports") for temporal action localization [7, 103], v1.2 for proposal generalization capability following the same evaluation protocol as in [13]. On v1.3, we design a dierent experimental setup to test TURN's cross-domain generaliza- tion capability: four subsets having distinct semantic meanings are selected, including \Participating in Sports, Exercise, or Recreation", \Vehicles", \Housework" and \Arts and Entertainment". We also check that the action categories in dierent subsets are not semantically related: for example, "archery", "dodge ball" in \Sports" subset, "changing car wheels", "xing bicycles" in \Vehicles" subset, "vacuuming oor", "cleaning shoes" in \Housework" subset, "ballet", "playing saxophone" in \Arts" subset. The evaluation metrics include AR@AN curve for temporal action proposal and mAP for action localization. AR@F=1.0 is reported for comparing proposal performance on dierent subsets. The validation set is used for testing as the test set is not publicly available. To train TURN, we set the number of frames in a unit n u to be 16, the context unit number n ctx to be 4, L to be 6 and to be 2.0. We build the temporal window pyramid withf2; 4; 8; 16; 32; 64; 128g number of units. The NMS threshold is set to be 0.1 smaller than tIoU in evaluation. For the temporal action localizer, SVM classiers are trained with two-stream CNN features in \Sports" and \Works" subsets. Generalization capability of TURN. One important property of TAP is the ex- pectation to generalize beyond the categories it is trained on. On ActivityNet v1.2, we follow the same evaluation protocol from [13]: model trained on THUMOS-14 validation set and tested in three dierent sets of ActivityNet v1.2: the whole set of ActivityNet v1.2 (all 100 categories), ActivityNet v1.2 \ THUMOS-14 (on 9 categories shared between the two) and ActivityNet v1.2 6 1024 frames (videos with unseen categories with annotations up to 1024 frames). To avoid any possible dataset overlap and enable direct comparison, we use C3D (pre-trained on Sports1M) as feature extractor, the same as DAPs did. As shown in Figure 3.7, TURN has better generalization capability in all three sets. Table 3.4: Proposal generalization performance (AR@F=1.0 %) of TURN-C3D-16 on dierent subsets of ActivityNet. Arts Housework Vehicles Sports Sliding Windows 24.44 27.63 27.59 25.72 Arts (23; 685) 44.30 44.38 40.85 38.43 Housework (10; 373) 40.27 44.30 38.65 36.54 Vehicles (5; 238) 38.43 40.05 42.22 30.70 Sports (26; 1294) 43.26 43.58 41.40 46.62 Ensemble (64; 2590) 45.30 48.12 42.33 46.72 On ActivityNet v1.3, we implement a dierent setup for evaluating generalization capability on subsets that contain semantically distinct actions: (1) we train TURN on one subset and test on the other three subsets, (2) we train on the ensemble of all 4 subsets and test on each subset. TURN is trained with C3D unit features, to avoid any 22 10 1 10 2 10 3 Average number of retrieved proposals 0.0 0.2 0.4 0.6 0.8 Average Recall Sliding Window DAPs ActivityNet DAPs ActivityNet 1024 frames DAPs ActivityNet ∩ THUMOS-14 TURN-C3D-16 ActivityNet TURN-C3D-16 ActivityNet 1024 frames TURN-C3D-16 ActivityNet ∩ THUMOS-14 Figure 3.7: Comparison of generalizability on ActivityNet v1.2 dataset overlap of training data. We also report performance of sliding windows (lengths of 32, 64, 128, 256, 512, 1024 and 2048, overlap 50% ) in each subset. Average recall at frequency 1.0 (AR@F=1.0) are reported in Table 3.4. The left-most column lists subsets used for training. The numbers of action classes and training videos with each subset are shown in brackets. The top row lists subsets for test. The o-diagonal elements indicate that the training data and test data are from dierent subsets; the diagonal elements indicate the training data and test data are from the same subsets. As can be seen in Table 3.4, the overall generalization capability is strong. Specically, the generalization capability when training on \Sports" subset is the best compared with other subsets, which may indicate that more training data would lead to better generalization performance. The \Ensemble" row shows that using training data from other subsets would not harm the performance of each subset. TURN for temporal action localization. Temporal action localization perfor- mance is evaluated and compared on \Works" and \Sports" subsets of ActivityNet v1.1. TURN trained with dense ow CNN features is used for comparison. On v1.1, TURN- FL-16 proposal is fed into one-vs-all SVM classiers which trained with two-stream CNN features. From the results shown in Table 3.5, we can see that TURN proposals improve localization performance. Table 3.5: Temporal action localization performance (mAP% @tIoU=0.5) on ActivityNet v1.1 Subsets [7] [103] Sliding Windows TURN-FL-16 Sports 33.2 36.7 27.3 37.1 Work 31.1 39.9 29.6 41.2 23 3.4 Conclusion We presented a novel and eective Temporal Unit Regression Network (TURN) for fast TAP generation. We proposed a new metric for TAP: Average Recall-Proposal Frequency (AR-F). AR-F is robustly correlated with temporal action localization performance and it allows performance comparison among dierent datasets. TURN can runs at over 880 FPS with the state-of-the-art AR performance. TURN is robust on dierent visual fea- tures, including C3D and dense ow CNN features. We showed the eectiveness of TURN as a proposal generation stage in localization pipelines on THUMOS-14 and ActivityNet. 24 Chapter 4 Cascaded Boundary Regression for Temporal Action Detection 4.1 Introduction Temporal action detection in long videos is an important and challenging problem, which has been receiving increasing attention recently. Given a long video, the task of action detection is to localize intervals where actions of interest take place and also predict the action categories. Good progress has been achieved in action classication [79, 85], where the task is to predict action classes in "trimmed" videos. Current state-of-the-art methods [62, 107, 77] on action detection extend classication methods to detection by applying action classi- ers on dense sliding windows. However, while sliding windows may contain an identi- able portion of the action, they do not necessarily cover the entire action instance or they could contain extraneous background frames, which may lead to inferior performance. Similar observations have also been made for use of sliding windows in object detection [74]. Inspired by object detection, Shouet al. [77] proposed a two-stage pipeline for action detection, called SCNN. In the rst stage, it produces actionness scores for multi-scale sliding windows and outputs the windows with high scores as class-agnostic temporal pro- posals; in the second stage, SCNN categorizes the proposals to specic actions. However, SCNN still suers from the imprecision of sliding window intervals. To improve temporal localization accuracy, recently a method called TURN [23] pro- posed to use temporal boundary regression. TURN takes sliding windows and their surrounding context as input and renes their temporal boundaries by learning a bound- ary regressor. We propose that the process of boundary estimation can be improved by deploying a cascade, where a regressed clip is fed back to the system for further rene- ment. As the system could observe dierent content in each round of renement, the boundary can be rened gradually. We adapt a two-stage action detection model with temporal coordinate regression. In the rst stage, our model takes sliding windows as input, and outputs class-agnostic tem- poral proposals. In the second stage, our model detects actions based on the proposals. The salient aspect in our model is that, inside each stage, we propose to use Cascaded Boundary Regression (CBR) to adjust temporal boundaries in a regression cascade, where regressed clips are fed back to the system for further boundary renement. The main dierences compared with TURN are (1) the use of cascaded boundary regression, (2) a 25 full pipeline for action detection (TURN only focus on proposal generation). We evaluate CBR on two challenging datasets: THUMOS-14 and TVSeries [11]. CBR outperforms state-of-the-art methods on both temporal action proposal generation and action detec- tion tasks by a large margin. The performance gain is especially remarkable under high IoU thresholds, e.g map@tIoU=0.5 on THUMOS-14 is improved from 19.0% to 31.0%. Our contributions are two-fold: (1) We propose a Cascaded Boundary Regression method for temporal boundary esti- mation, which is shown to be eective on both proposal generation and action detection. (2) We evaluate CBR on both proposal generation and action detection, and achieve state-of-the-art performance on both THUMOS-14 and TVSeries [11]. 4.2 Methods In this section, we describe the two-stage Cascaded Boundary Regression (CBR) network and the training procedure, its architecture is shown in Figure 4.1. This work is inspired by [23], so we use similar notations in the following. Temporal Boundary Regressor Stage 1: CBR-Proposal SlidingWindows UnitFeature Extractor Unit-level features Stage 2: CBR-Detection Proposals Detection Figure 4.1: Architecture of two-stage action detection pipeline with Cascaded Boundary Regression (CBR) 4.2.1 Video Unit Feature Extraction A video V containing T frames, V = ff i g T 1 , is divided into T=u f consecutive video units , where u f is the number of frames in a unit. A video unit can be represented as u =ft i g fs+u f 1 fs , wheref s is the starting frame,f s +u f 1 is the ending framef e . Units do not overlap with each other. Each unit is processed by a visual encoder E v to get a unit-level representationf u =E v (u). In our experiments, C3D [85] and two-stream CNN models [79] are investigated. Details are given in Section 4. 4.2.2 Video Clip Modeling A clip c is composed of units, c =fu j g us+cu1 us , where u s is the index of starting unit and c u is the number of units inside c. u e = u s +c u 1 is the index of ending unit u e , andfu j g ue us are called internal units of c. Besides the internal units, surrounding units for c are also modeled. fu j g us1 usnctx andfu j g ue+nctx ue+1 are the surrounding units before and after c respectively, n ctx is the number of units we consider. The surrounding units 26 provide temporal context for clips, which are important for temporal boundary inferring. Internal feature and context features are pooled from unit-level features separately by mean pooling operation P . The nal feature f c for a clip is the concatenation of context features and the internal feature. f c =P (fu j g su sunctx )k P (fu j g eu su )k P (fu j g eu+nctx eu ) (4.1) wherek represents vector concatenation. We scan a video by multi-scale temporal sliding windows. The temporal sliding windows are modeled by two parameters: window length l i and window overlap o i . Note that, although multi-scale clips would have temporal overlaps, the clip-level features are computed from unit-level features, which are only calculated once. 4.2.3 Temporal Coordinate Regression We rst introduce temporal coordinate regression and then introduce the two-stage pro- posal and detection pipeline. Our goal is to design a method which is able to estimate the temporal boundaries of actions. For spatial boundary regression, previous works [71, 24] use parameterized coordinate osets, that is, the boundary coordinates are rst param- eterized by the central coordinates and the size (i.e. length and width) of the bounding box, the osets are calculated based on these parameterized coordinates. In the temporal coordinate settings, the parameterization osets could be represented as, o x = (x gt x clip )=l clip ; o l =log(l gt =l clip ) (4.2) where x and l denote the clip's center coordinate and clip length respectively. Variables x gt ;x clip are for ground truth clip and test clip (likewise for l). Instead of using parameterization, non-parameterized oset is to use the start and end coordinates directly. Specically, there are two levels of coordinates: frame-level and unit-level. The frame-level coordinate is the index of the frame f i ; the unit-level coordinate is the index of the unit u j . For an action instance, the ground truth start and end coordinates t gt s and t gt e are usually annotated in seconds, which could be always transferred to frame-level (multiplied by FPS) f gt s and f gt e , the unit-level ground truth coordinates are calculated by rounding: u gt s =<f gt s =u f >; u gt e =<f gt e =u f > (4.3) where<> represents rounding,u f is the frame number in a unit. The non-parameterized regression osets are o s =s clip s gt ; o e =e clip e gt (4.4) where s clip , e clip are the start and end coordinates of the input clip, which could be at frame-level or unit-level. s gt ,e gt are the coordinates for the matched ground truth action instance. The intuition behind unit-level coordinate regression is that, as the basic unit- level features are extracted to encode n u frames, the feature may not be discriminative enough to regress the coordinates at frame-level. Comparing with frame-level regression, unit-level coordinate regression is easier to learn, though with coarser boundaries. 27 4.2.4 Two-Stage Proposal and Detection Pipeline Inspired by the proposal and detection pipeline in object detection, we design a two-stage pipeline for temporal action detection, in which class-agnostic proposals and class-specic detections are generated respectively, shown in Figure 4.1. In both stages, temporal coordinate regression is used to infer temporal action boundaries. Sepcically, given a clip c =< s;e >, it is rst processed by the proposal network, which outputs two boundary regression osets <o s ;o e > and an actionness score p indicating whether c is an action instance. If the output p is higher than a threshold , the detection network takes c 0 =< s 0 ;e 0 > (new temporal boundaries) as input and generates n + 1 softmax scores p z ;x2 [1;n + 1] and n pairs of boundary osets <o z s ;o z e >;x2 [1;n], where n is the number of action categories. 4.2.5 Cascaded Boundary Regression Figure 4.2: Unrolled model of Cascaded Boundary Regression (CBR), the parameters of the MLPs are shared. In each stage (i.e. the proposal and detection stages), boundary regression is applied in a cascaded manner{the output boundaries are fedback as input to the network for further renement, as shown in Figure 4.2. For proposal network, given a input clip c =< s;e >, the output clip c 1 =< s 1 ;e 1 > is fedback as input to do a second round of renement, and the second output is c 2 =< s 2 ;e 2 >. The iteration process takes K p c steps, the nal boundaries and the actionness score for c are c K p c =<s K p c ;e K p c >; p = K p c Y i=1 p i (4.5) The cascade process of detection network is similar to that of proposal network. For detection network, it outputs n pairs of temporal boundary osets and n + 1 category scores. Among the n non-background categories, we take the category with the highest score as the prediction x, and pick the corresponding boundary osets < o s;x ;o e;x >. 28 The rened clip c 1 =< s 1 x ;e 1 x > is fed back into the network. After K d c steps, the nal boundaries and score for the predicted category x are c K d c =<s z K d c ;e z K d c >; p = K d c Y i=1 p z i (4.6) The proposal network and the detection network are trained separately, details could be found in the next section. In either stage (proposal or detection), each cascade step could be trained separately, but here we have chosen to use the same network parameters in each step for simplicity. 4.2.6 Loss Function To train CBR, we collect training samples from sliding windows, whose parameters (over- lap and window length) will be introduced in Section 4. A class label is assigned to a sliding window if: (1) the window clip with the highest temporal Intersection over Union (tIoU) overlaps with a ground truth clip; or (2) the window clip has tIoU larger than 0.5 with any of the ground truth clips. Note that, a single ground truth clip may assign its label to multiple window clips. Negative labels are assigned to non-positive clips whose tIoU is equal to 0.0 (i.e. no overlap) for all ground truth clips. We design a multi-task loss L to jointly train classication and coordinate regression. L =L cls +L reg (4.7) whereL cls is the loss for classication, which is a standard cross-entropy loss. For proposal network, L cls is a binary classication cross-entropy loss; for detection network, L cls is a stardard multi-class cross-entropy loss. L reg is for temporal coordinate regression and is a hyper-parameter, which is set empirically. The regression loss is L reg = 1 N N X i=1 n X z=1 l z i [R(^ o z s;i o z s;i ) +R(^ o z e;i o z e;i )] (4.8) where R is L1 distance, N is batch size and n is the total number of categories, l z i is the label, when the ith sample is from category z, l z i = 1, otherwise, l z i = 0. ^ o is the regression estimate oset, ando is the ground truth oset. For parameterized osets,o z s;i and o z e;i are replaced by o z x;i and o z l;i . The learning rate and batch size are set as 0.005 and 128 respectively. We use the Adam [41] optimizer to train CBR. The ratio of sample numbers of background to non- background in a mini-batch is set to be 10 for training proposal network. For training detection network, the number background samples are equal to the average sample numbers of all categories. is set to 2 for both proposal and detection network. 29 4.3 Evaluation We evaluate the eectiveness of the proposed Cascaded Boundary Regression (CBR) on standard benchmarks THUMOS-14 and TVSeries for both temporal action proposal generation and action detection. Unit-level Feature Extraction. C3D unit-level features: The C3D model is pre- trained on Sports1M [37], we uniformly sample 16 frames in a unit and then input them into C3D; the output of fc6 layer is used as unit-level feature. Two-stream features: We use the two-stream model [95] that is pre-trained on ActivityNet v1.3 training set. In each unit, the central frame is sampled to calculate the appearance CNN feature, which is the output of "Flatten 673" layer in ResNet [26]. For the motion feature, we sample 6 consecutive frames at the center of a unit and calculate optical ows [14] between them; these ows are then fed into the pretrained BN-Inception model [95, 33] and the output of "global pool" layer is extracted. The motion features and the appearance features are concatenated into 4096-dimensional vectors, which are used as unit-level features. 4.3.1 Experiments on THUMOS-14 We rst introduce the datasets and the experiment setup, then discuss the experimental results on THUMOS-14. Dataset. The temporal action localization part of THUMOS-14 contains over 20 hours of videos from 20 sport classes. There are 200 untrimmed videos in validation set and 213 untrimmed videos in test set. The training set of THUMOS-14 contains only trimmed videos. We train our model on the validation set and test it on the test set. Experimental setup. We perform the following experiments on THUMOS-14: (1) explore components in the proposed framework: (a) parametrized osets vs non- parameterized unit-level regression vs non-parameterized frame-level regression, (b) cas- caded steps for boundary regression; (2) comparison with state-of-the-art approaches. The unit size u f is 16, the surrounding unit number n ctx is set to 4. The sliding win- dow lengths and overlaps aref16(16); 32(16); 64(16); 128(32); 256(64); 512(128)g, where the numbers out of brackets are lengths of sliding windows, and the numbers in brackets are the corresponding overlaps of the sliding windows. Temporal coordinate regression. To explore which type of coordinate osets is most eective for boundary regression in temporal action detection, we test three types: (a) parameterized coordinate osets, which are similar to the ones in object detection [71], (b) non-parameterized frame-level coordinate osets and (c) non-parameterized unit-level coordinate osets. The results of temporal action detection are listed in Table 4.1. The cascade step K c is set to be 1 for both proposal stage and detection stage. Both C3D feature and Two-stream CNN feature are tested. We can see that all three regression osets provide improvement over "no regres- sion". Unit-level osets are more eective than frame-level osets; we think the reason is that, the features are extracted at unit level, frame-level coordinates contain redundant information, which may make the regression task more dicult. The performance of pa- rameterized coordinate osets is lower than that of non-parameterized unit-level osets. We think that the reason is that unlike objects which can be re-scaled in images with 30 Table 4.1: Comparison of dierent coordinate osets on action localization (% mAP@tIoU=0.5): parameterized, non-parameterized frame-level, non-parameterized unit-level. The performance with no boundary regression is also listed. no regression parameterized non-para, frame-level non-para, unit-level CBR-C3D 16.7 19.4 18.8 20.5 CBR-TS 22.3 26.1 25.3 27.7 camera projection, actions' time spans can not be easily re-scaled in videos, although the time spans of the same action can be varied in dierent videos. Therefore, "time" itself work as a standard scale for action instances. Cascaded boundary regression. We explore the eects of boundary regression cascade. Cascade step K p c and K d c are the number of boundary regression conducted in proposal stage and detection stage respectively. The results are shown in Table 4.2 and Table 4.3. We investigate the cascade step with C3D feature and two-stream CNN feature. Non-parameterized unit-level coordinate oset is adopted. Table 4.2: Comparison of cascaded step K p c = 1; 2; 3; 4 for temporal proposal generation (% AR@F=1.0) on THUMOS-14. K p c = 1 K p c = 2 K p c = 3 K p c = 4 CBR-C3D 38.6 39.6 39.4 37.8 CBR-TS 42.7 44.5 45.2 44.8 For the proposal network (shown in Table 4.2), we can see that cascaded boundary re- gression increase the performance from 42:7 to 45:0 for two-stream features, and from 38:6 to 39:6 for C3D features. WhenK p c = 3, two-stream CBR achieves the best performance, and after the performance peak, the performance drops slightly. Table 4.3: Comparison of cascaded step K d c = 1; 2; 3; 4 for temporal action detection (% mAP@tIoU=0.5) on THUMOS-14. K d c = 1 K d c = 2 K d c = 3 K d c = 4 CBR-C3D 21.5 22.7 22.4 22.2 CBR-TS 28.4 31.0 30.5 30.2 To test the eects of cascaded boundary regression for action detection, we xK p c = 3. As shown in Table 4.3, we observe a similar trend as in proposal generation: whenK d c = 2, CBR increases the performance from 28:4 to 31:0 for two-stream features, and from 21:5 to 22:7 for C3D features. After K d c = 2, the performance becomes saturated. Comparison with state-of-the-art on temporal proposal generation. We compare CBR-P with state-of-the-art methods on temporal action proposal generation, including SCNN-prop [77], DAPs[13] and Sparse-prop [6] and TURN [23]. The results are shown in Table 4.4. To fairly compare with TURN, we also provide the performance using only optical ow CNN features, which is the same for TURN-FL. We can see that CBR- FL outperforms state-of-the-art (TURN-FL) and CBR-TS provides further improvement over CBF-FL. 31 Table 4.4: Comparison with state-of-the-art on temporal action proposal generation. Average Recall at Proposal Frequency (AR@F=1.0) performance are reported. Method Sparse-prop[6] DAPs[13] SCNN-prop[77] TURN-FL [23] CBR-FL CBR-TS AR@AN=200 32.3 34.1 37.2 42.8 43.5 44.2 AR@F=1.0 33.3 35.7 38.3 43.5 44.4 45.2 Table 4.5: Temporal action detection performance (mAP %) comparison at dierent tIoU thresholds on THUMOS-14. tIoU 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Oneata et al.[62] 36.6 33.6 27.0 20.8 14.4 8.5 3.2 Yeung et al.[103] 48.9 44.0 36.0 26.4 17.1 - - Yuan et al. [107] 51.4 42.6 33.6 26.1 18.8 - - SCNN [77] 47.7 43.5 36.3 28.7 19.0 10.3 5.3 CBR-C3D 48.2 44.3 37.7 30.1 22.7 13.8 7.9 CBR-TS 60.1 56.7 50.1 41.3 31.0 19.1 9.9 Comparison with state-of-the-art on temporal action detection. We com- pare our method with other state-of-the-art temporal action localization methods on THUMOS-14, the results are shown in Table 4.5. We compare with the challenge results [62], and recent methods including based on segment window C3D [77], score pyramids [107] and deep recurrent reinforcement learning [103]. Both SCNN and CBR-C3D are based on C3D features, we can see that CBR-C3D outperforms SCNN at all tIoU thresh- olds, especially at high tIoU, which shows the eectiveness of CBR. If two-stream features are adopted, CBR outperforms state-of-the-art methods by 12% at tIoU=0.5. 4.3.2 Experiments on TVSeries We rst introduce the datasets and the experiment setup, then discuss the experimental results on TVSeries [11]. Dataset. The TVSeries Dataset [11] is a realistic, large-scale dataset for temporal action detection, which contains 16 hours of videos (27 episodes) from six recent popular TV series. 30 daily life action categories are dened in TVSeries, such as "close door", "drive car", "wave". There are totally 6231 action instances annotated with start and end times and action categories in the dataset. The train/validation/test sets contain 13/7/7 episodes respectively. Experimental setup. We test the cascaded steps of boundary regression for both proposal generation and action detection, and then compare with state-of-the-art perfor- mance on action detection. The unit sizeu f is 6, the surrounding unit numbern ctx is set to 4. The sliding window lengths and overlaps aref12(6); 24(6); 48(12); 72(18); 96(24); 192(48); 384(96)g, where the numbers out of brackets are lengths of sliding window, and the numbers in brackets are the corresponding overlaps of the sliding window. Cascaded boundary regression. We explore the eects of cascaded boundary regression on TVSeries. Note that K c is the cascaded step. We investigate the cascaded step with two-stream CNN features. 32 Table 4.6: Comparison of cascaded step K c = 0; 1; 2; 3; 4 for temporal action detection (% mAP@tIoU=0.5) on TVSeries. K c = 0 means that the system only do classication, no boundary regression. K c = 0 K c = 1 K c = 2 K c = 3 K c = 4 Proposal (AR@F=1.0) 20.4 24.3 25.6 26.1 25.9 Detection (mAP@IoU=0.2) 6.2 8.8 9.5 9.2 9.0 As shown in Table 4.6, comparing withK c = 0 andK c = 1, we can see that temporal coordinate regression brings a big improvement, which shows its eectiveness. We can also see that whenK c = 3, CBR achieves the best performance for proposal network. To test detection network, we x K p c = 3, the results show that when K d c = 2 CBR achieves the best performance for action detection. After the peak, the performance starts to decrease. The performance distribution of cascaded step is consistent with THUMOS-14. Comparison with state-of-the-art on action detection. We compare CBR with state-of-the-art performance on TVSeires in Table 4.7. Overall, we can see that TVSeries is a more challenging dataset than THUMOS-14. To provide another comparison, we train SVM classiers based on two-stream features, which is shown in Table 4.7 as SVM-TS. The SVM classiers are trained and tested using the same samples as CBR, which are described in Section 3.6; clip-level features are mean-pooled from unit-level features. We can see that with the same features, CBR outperforms the SVM-based classiers by 3.5% at tIoU=0.2. At tIoU = 0:2, CBR achieves 9.5, while the state-of-the-art method FV [11] only achieves 4:9. We also report mAP performance at tIoU = 0:1 and tIoU = 0:3, which are 11.0 and 7.9 respectively. Table 4.7: Temporal action detection performance (mAP %) comparison at dierent tIoU thresholds on TVSeries. tIoU CNN [11] LSTM [11] FV [11] SVM-TS CBR-TS 0.1 - - - 7.3 11.0 0.2 1.1 2.7 4.9 6.0 9.5 0.3 - - - 4.6 7.9 4.4 Conclusion We present a novel two-stage action detection pipeline with Cascaded Boundary Regres- sion (CBR), which achieves state-of-the-art performance on standard benchmarks. In the rst stage, temporal proposals are generated; based on the proposals, actions are de- tected in the second stage. Cascaded boundary regression are conducted in both stages. Detailed experiments and analysis on cascaded steps are conducted, which show the ef- fectiveness of CBR for both temporal proposal generation and action detection. Dierent temporal regression oset settings are also investigated and discussed. State-of-the-art performance has been achieved on both THUMOS-14 and TVSeires dataset. 33 Chapter 5 CTAP: Complementary Temporal Action Proposal Generation 5.1 Introduction We focus on the task of generating accurate temporal action proposals in videos; akin to object proposals for object detection [71], temporal action proposals are intended to capture \clips" or temporal intervals in videos that are likely to contain an action. There has been some previous work in this topic and it has been shown that, as expected and in analogy with object proposals, quality of temporal action proposals has a direct in uence on the action detection performance [23, 77]. High quality action proposals should reach high Average Recall (AR) with as few number of retrieved proposals as possible. The existing action proposal generation methods can be considered to belong to two main types. The rst type is sliding-window based, which takes clips from sliding windows as input, and outputs scores for proposals. SCNN-prop [77] is a representative of this type; it applies a binary classier to rank the sliding windows. TURN [23] adopts temporal regression in additional to binary classication to adjust the boundary of sliding windows. The architecture of this type is outlined as \SW-R&A" in Fig. 5.1. Sliding windows uniformly cover all segments in the videos (thus cover every ground truth segment), however the drawback is that the temporal boundaries are imprecise, in spite of the use of boundary adjustment, and thus high AR is reached at large number of retrieved of proposals, as shown in circle A in Fig. 5.1. The second type of action proposal generation methods can be summarized as action- ness score based. It applies binary classication on a ner level, i.e., unit or snippet (a few contiguous frames) level, to generate actionness scores for each unit. A Temporal Action Grouping (TAG) [109] technique, derived from the watershed algorithm [72], is designed to group continuous high-score regions as proposals. Each proposal's score is calculated as the average of its unit actionness scores. The structure is shown as \TAG" in Fig. 5.1. This type of method generates high precision boundaries, as long as the quality of actionness scores is high. However, the actionness scores have two common failure cases: having high scores at background segments, and having low scores at action segments. The former case leads to generation of wrong proposals, while the latter case may omit some correct proposals. These lead to the upper bound of AR performance limited at a low value (circle B in Fig. 5.1). 34 slidingwindows actionnessscore actionnessscore ranking adjustment C B A TAG ranking adjustment TAG SW+R&A TAG TAG+R&A Figure 5.1: The architectures of three baseline methods are shown: (1) SW+R&A: slid- ing windows are processed by a model for proposal ranking and boundary adjustment, e.g.TURN[23], SCNN [77]; (2) TAG: TAG [109] generate proposals based on unit-level actionness; (3) TAG+R&A: actionness proposals are processed with proposal ranking and boundary adjustment. Based on the above analysis, ranking-sliding-window and grouping-actionness-score methods have two complementary properties: (1) The boundaries from actionness-based proposals are more precise as they are predicted on a ner level, and window-level rank- ing could be more discriminative as it takes more global contextual information; (2) actionness-based methods may omit some correct proposals when quality of actionness scores is low, sliding windows can uniformly cover all segments in the videos. Adopting the rst complementary characteristic helps to resolve the rst failure case of action- ness proposals (i.e., generating wrong proposals). As shown in Fig. 5.1, a window-level classier is applied after TAG to adjust boundaries and rank the proposals, which cor- responds to model \TAG+R&A". Such combination has higher AR at low number of retrieved proposals compared to the sliding-window-based method (circle C in Fig. 5.1). However, it still fails to solve the second failure case, when actionness scores are low at true action segments, TAG is unable to generate these proposal candidates. This results in the limited performance upper bound as shown in circle B, Fig. 5.1. To address this, we further explore the complementary characteristics, and propose to adaptively select sliding windows to ll the omitted ones in actionness proposals. We propose a novel Complementary Temporal Action Proposal (CTAP) generator consisting of three modules. The rst module is an initial proposal generator, which out- puts actionness proposals and sliding-window proposals. The second module is a proposal complementary lter collects missing correct ones from sliding windows (addressing the second failure case of actionness score). Specically, the complementary lter applies a binary classier on the initial proposals to generate the probabilities indicating whether the proposals can be detected by actionness and TAG correctly, this classier is called proposal-level actionness trustworthiness estimator. The third module ranks the propos- als and adjusts the temporal boundaries. Specically, we design a temporal convolutional neural network, rather than simple temporal mean pooling used in TURN [23], to preserve the temporal ordering information. 35 slidingwindows > PATE + filtered proposals … " ( % ) { % } TAG { * } { , } tconv bdy BA actionnessscore PR BA bdy ppl boundaryunits boundaryunits proposalunits … Figure 5.2: The architecture of Complementary Temporal Action Proposal (CTAP) gen- erator. \BA" is short for boundary adjustment, \PR" is short for proposal ranking, \ppl" is short for proposal and \bdy" is short for boundary. We evaluated the proposed method on THUMOS-14 and ActivityNet v1.3; exper- iments show that our method outperforms state-of-the-art methods by a large margin for action proposal generation. We further apply the generated temporal proposals on the action detection task with a standard detector, and show signicant performance improvements consistently. In summary, our contribution are three-fold: (1) We proposed a novel Complementary Temporal Action Proposal (CTAP) generator which uses the complementary character- istics of actionness proposals and sliding windows to generate high quality proposals. (2) We designed a new boundary adjustment and proposal ranking network with tem- poral convolution which can eectively save the ordering information on the proposal boundaries. (3) We evaluated our method on two large scale datasets (THUMOS-14 and ActivityNet v1.3) and our model outperforms state-of-the-art methods by a large margin. 5.2 Complementary Temporal Action Proposal Generator In this section, we present the details of the Complementary Temporal Action Proposal (CTAP) generator. There are three stages in the pipeline of CTAP. The rst stage is to generate initial proposals, which come from two sources, one is actionness score and TAG [109], the other is sliding windows. The second stage is complementary ltering. As we discussed before, TAG omits some correct proposals when the quality of actionness score is low (i.e.low actionness score on action segments), but sliding windows uniformly cover all segments in videos. Thus, we design a complementary lter to collect high quality complementary proposals from sliding windows to ll the omitted actionness proposals. The third stage is boundary adjustment and proposal ranking, which is composed of a temporal convolutional neural network. 5.2.1 Initial Proposal Generation In this part, we rst introduce video pre-processing, then present the actionness score generation, temporal grouping process and sliding window sampling strategy. 36 Video pre-processing. Following previous work [23], a long untrimmed video is rst cut into video units or snippets, each unit contains n u continuous frames. A video unit u is processed by a visual encoder E v to extract the unit-level representation x u = E v (u)2 R d f . In our experiments, we use the two-stream CNN model [79, 95] as the visual encoder, details are given in Sec 5.3.2. Consequently, a long video is converted to a sequence of unit-level features, which are used as basic processing units later. Actionness score. Based on unit-level features, we train a binary classier to gen- erate actionness score for each unit. Specically, we design a two-layer temporal convo- lutional neural network, which takes a t a continuous unit features as input, x2R tad f , and outputs a probability for each unit indicating whether it is background or action, p x 2R ta . p x =(t conv (x)); t conv (x) =F('(F(x; W 1 )); W 2 ) (5.1) whereF(:; W) denotes a temporal convolution operator, W is the weight of its convo- lution kernel. In this network, W 1 2 R d f dmkk ; W 2 2 R dm1kk (k is the kernel size) are training parameters. '(:) is an non-linear activation function, (:) is a sigmoid function. After generating the probability p x for each continuous unit features x, the loss is calculated as the cross-entropy for each input sample within the batch: L act = 1 N N X i=1 h y > i log(p x i ) + (1 y i ) > log(1p x i ) i (5.2) where y i 2R ta is a binary sequence for each input x i indicating whether each unit in x i contains action (label 1) or not (label 0). N is the batch size. Actionness proposal generation strategy. We follow [109] and implement a watershed algorithm [72] to segment 1-D sequence signals. Given each unit's actionness score, raw proposals are generated whose units all have scores larger than a threshold . For some neighbor raw proposals, if the time during ration (i.e., maximum end time minus minimum start time among these raw proposals) is larger than a ratio of the whole video length, we group them as a proposal candidate. We iterate all possible combinations of and to generate proposal candidates and apply Non-Maximum Suppression (NMS) to eliminate redundant proposals. The output actionness proposals are denoted asfb j g. Sliding window sampling strategy. Unlike actionness proposals which depend on actionness score distribution, sliding windows can uniformly cover all segments in the videos. The goal is to maximum the match with groundtruth segments (high recall), meanwhile maintaining the number of sliding windows as low as possible. In our exper- iments, dierent combinations of window size and overlap ratio are tested on validation set. The sliding windows are denoted asfa k g. Detail setting is given in Sec 5.3.2. 5.2.2 Proposal Complementary Filtering As discussed before, actionness proposals could be more precise but less stable, but sliding windows are more stable but less precise. The goal of second stage is to collect propos- als, that could be omitted by TAG, from sliding windows. The core of this stage is a binary classier, whose input is a sequence of unit features (i.e.a proposal), and output 37 is the probability that indicates whether this proposal can be correctly detected by the unit-level actionness scores and TAG. This classier is called Proposal-level Actionness Trustworthiness Estimator (PATE). PATE training. The training samples are collected as follows: Given a video, the groundtruth segmentsfg i g are matched with actionness proposalsfb j g. For a groundtruth segment g i , if there exists an actionness proposal b j that has temporal Intersection over Union (tIoU) with g i larger than a threshold c , then we label g i as a positive sample (y i = 1); if no such b j exists, then g i is labelled as a negative sample (y i = 0). The unit level features insideg i are mean pooled to a single proposal-level feature x g i 2R d f . PATE outputs trustworthiness scores indicating the probabilities that whether the proposals can be correctly detected by actionness scores and TAG: s i = (W 4 ('(W 3 x g i + b 3 )) + b 4 ) (5.3) where W 3 2 R d f dm , W 4 2 R dm1 , b 3 2 R dm ; b 4 2 R are training parameters. Other notations are similar to Eq. 5.1. The network is trained by a standard cross-entropy loss over training samples from each batch (N is the batch size). L pate = 1 N N X i=1 [y i log(s i ) + (1y i ) log(1s i )] (5.4) Complementary ltering. In test stage, we apply the trustworthiness estimator to every proposal from sliding windowsfa k g. For an input proposal, the trustworthiness scorep t tells us that \how well the actionness scores are trustworthy on the video content from this proposal". For a sliding windowa k , ifp t (a k ) is lower than a threshold a (means TAG may fail on this segment), this sliding window is collected. The collected proposals from sliding windows and all actionness proposals are denoted asfc m g, and are sent to the next stage, which ranks the proposals and adjusts the temporal boundaries. We call this process as complementary ltering and the name derives from somewhat similar processes used in estimation theory. 5.2.3 Proposal Ranking and Boundary Adjustment The third stage of CTAP is to rank the proposals and adjust the temporal boundaries. TURN [23] does this also, however it uses mean-pooling to aggregate temporal features, which losses the temporal ordering information. Instead, we design a Temporal con- volutional Adjustment and Ranking (TAR) network which use temporal conv layers to aggregate the unit-level features. TAR Architecture. Suppose that the start and end units (i.e.temporal bound- ary) of an input proposal c m are u s ;u e , we uniformly sample n ctl unit-level features inside the proposal, called proposal units x c 2 R n ctl d f . We sample n ctx unit features at the start and end boundaries respectively, which are [u s n ctx =2;u s +n ctx =2] and [u e n ctx =2;u e +n ctx =2], called boundary units (denoted as x s 2R nctxd f ; x e 2R nctxd f ). Boundary units and proposal units are illustrated in Fig. 5.2. These three feature se- quences (one sequence for proposal units and two sequences for boundary units) are input to three independent sub-networks. The proposal ranking sub-network outputs 38 probability of action, the boundary adjustment sub-network outputs regression osets. Each sub-network contains two temporal convolutional layers. which can be represented as: o s =t conv (x s ); p c =(t conv (x c )); o e =t conv (x e ) (5.5) where o s ;o e ;p c denote the osets prediction for start and end boundaries and the action probability for each proposal respectively. Other notations are the same in Eq. 5.1. Similar to TURN [23], we use non-parameterized regression osets. The nal score for a proposal a k from sliding windows is multiplied by the PATE score (p t (a k )p c (a k )). The actionness proposals use p c (a k ) as the nal score. TAR Training. To collect training samples, we use dense sliding windows to match with groundtruth action segments. A sliding window is assigned to a groundtruth seg- ments if: (1) it has the highest tIoU overlaps with a certain groundtruth segment among all other windows; or (2) it has tIoU larger than 0.5 with any one of the groundtruth segments. We use the standard Softmax cross-entropy loss to train proposal ranking sub- network and the L1 distance loss for boundary adjustment sub-network. Specically, the regression loss can be expressed as, L reg = 1 N pos Npos X i=1 l i (jo s;i o s;i j +jo e;i o e;i j) (5.6) whereo s;i is the predicted start oset,o e;i is the predicted end oset,o s;i is the groundtruth start oset,o e;i is the groundtruth end oset. l i is the label, 1 for positive samples and 0 for background samples. N pos is the number of positive samples in a mini-batch, as the regression loss is calculated only for positive samples. Similar to Eq. 5.4, a cross entropy objective is calculated to guide the learning of prediction score p c for each proposal. 5.3 Experiments We evaluate CTAP on THUMOS-14 [35] and ActivityNet v1.3 [7] datasets respectively. 5.3.1 Datasets THUMOS-14 contains 1010 and 1574 videos for validation and testing purposes from 20 sport classes. Among them, there are 200 and 212 videos are labeled with temporal information in validation and test set respectively. Following the settings of previous work [23, 77], we train our model on the validation set and conduct evaluation on the test set. ActivityNet v1.3 consists of 19,994 videos collected from YouTube labeled in 200 classes. The whole dataset is divided into three disjoint splits: training, validation and test, with a ration of 50%, 25%, 25%, respectively. Since the annotation of the test split is not publicly available for competition purpose, we compare and report performances of dierent models on the validation set. 39 5.3.2 Experiment Setup Unit-level feature extraction. We use the twostream model [95] as the visual encoder E v that is pre-trained on ActivityNet v1.3 training set. In each unit, the central frame is sampled to calculate the appearance CNN feature, it is the output of \Flatten 673" layer in ResNet [26]. For the motion feature, we sample 6 consecutive frames at the center of a unit and calculate optical ows between them; these ows are then fed into the pretrained BN-Inception model [33] and the output of \global pool" layer is extracted. The motion features and the appearance features are both 2048-dimensional, and are concatenated into 4096-dimensional vectors (d f = 4096), which are used as unit-level features. On THUMOS-14, we test our model with two settings of unit features Flow-16 and Twostream-6. Flow-16 only uses dense ow CNN features, and the unit size is set to 16, which is the same as [23](n u = 16), Twostream-6 use two-stream features and unit size is 6 (n u = 6). On ActivityNet v1.3, two-stream features are used and unit size is 16 (Twostream-16, n u = 16). Actionness score generation. we set the kernel size for each temporal convolution as 3 (k = 3). The stride for temporal convolution is 1. We choose rectied linear unit (ReLU) as the non-linear activation function '. The rst temporal convolution output dimension d m = 1024. t a is set to be 4. Batch size is 128, learning rate is 0.005, and the model is trained for about 10 epochs. TAG algorithm. Following the setting of [109], we set the initial value of as 0:085. To enumerate all possible combinations of (;), we rst iterate in the range of [0:085; 1) with a step of 0:085. In each iteration, we further iterate in the range of [0:025; 1] with a step of 0.025. The threshold of NMS is set as 0:95 to eliminate redundant proposals. PATE setting. We set the rst fully-connected layer's output dimension d m = 1024. a is set to be 0.1 on THUMOS-14 and ActivityNet v1.3. Batch size is 128 and learning rate is 0.005. PATE is trained for about 10 epochs. TAR setting. On THUMOS-14, we uniformly sample 8 unit features inside each pro- posal (n ctl = 4), and 4 unit features as context (n ctx = 4). On ActivityNet v1.3, we set n ctl = 8 and n ctx = 4. d m is set to 1024. TAR is optimized using Adam algorithm [41]. Batch size is 128 and learning rate is 0.005. TAR is trained for 10 epoches on THUMOS-14 and 4 epoches on ActivityNet v1.3. Evaluation Metrics. For temporal action proposal generation task, Average Recall (AR) is usually used as evaluation metrics. Following previous work, we use IoU thresh- olds set from 0.5 to 1.0 with a step of 0.05 on THUMOS-14 and 0.5 to 0.95 with a step of 0.05 on ActivityNet v1.3. We draw the curve of AR with dierent Average Number(AN) of retrieved proposals to evaluate the relationship between recall and proposal number, which is called AR-AN curve. On ActivityNet v1.3, we also use area under the AR-AN curve (AUC) as metrics, where AN varies from 0 to 100. For the evaluation of temporal action detection, we follow the traditional mean Average Precision (mAP) metric used in THUMOS-14. A prediction is regarded as positive only when it has correct category prediction and tIoU with ground truth higher than a threshold. We use the ocial toolkit of THUMOS-14. 40 Table 5.1: Performance comparison between TAR and TURN [23] on THUMOS-14 test set. Same unit feature ( ow-16) and test sliding windows are used on TAR and TURN for fair comparison. Average Recall (AR) at dierent numbers is reported. Method AR@50 AR@100 AR@200 TURN[23] 21.75 31.84 42.96 TAR 22.99 32.21 45.08 Table 5.2: Complementary ltering evaluation on THUMOS-14 test set, compared with \Union" and \tIoU-selection". Average Recall (AR) at dierent numbers is reported. Method AR@50 AR@100 AR@200 Union 25.80 34.70 46.19 tIoU-selection 30.35 38.34 42.41 PATE complementary ltering 31.03 40.23 50.13 5.3.3 Evaluation on THUMOS-14 In this part, we evaluate our method on THUMOS-14 dataset. First, we compare our proposal ranking and boundary adjustment module TAR with TURN [23]. Second, we evaluate the eectiveness of PATE and the proposal complementary ltering module. Third, we compare our full model with state-of-the-art methods, and nally we apply our proposals on action detection task to verify the its performance advantage. TAR vs TURN [23]. As we presented before, TURN [23] uses temporal mean pooling to aggregate features, it losses temporal ordering information, which could be important for boundary adjustment. TAR uses temporal convolution to extract temporal informa- tion from unit features, and adopts independent sub-networks for proposal ranking and boundary adjustment. To fairly compare with TURN, we use ow-16 features, and the same test sliding window settings as TURN. As shown in Table 5.1, we can see that, at AN=50, 100 and 200, TAR outperforms TURN at all these points, which shows the eectiveness of TAR. Complementary ltering. Besides using PATE in the proposal complementary l- tering, we design two baseline methods to combine the sliding windows and actionness proposals. The rst method is a simple \union", in which we simply put all actionness proposals and all sliding windows together, and send them into TAR module for ranking and adjustment. The second method is tIoU-based: all actionness proposals are selected; we calculate the tIoU between the sliding windows and actionness proposals, if there exists a sliding window whose highest tIoU with all actionness proposals is lower than 0.5, then it is selected. We use ow-16 unit features and the same test sliding windows in \TAR vs TURN" experiments. The results are shown in Table 5.2. We can see that, complementary ltering achieves the best AR on every AN (50, 100 and 200). The perfor- mance of \Union" suers at low AN, but is higher than \tIoU-selection" at AN=200. We believe the reason is that simple union method adds too many low quality proposals from sliding windows, and tIoU-selection eliminates some high quality windows simply based 41 Figure 5.3: AR-AN curves of the complementary results with ow-16 feature (F16) and two-stream-6 feature (TS6). Complementary ltering proposals outperform sliding win- dows (SW+TAR) and actionness proposals (TAG+TAR) consistently. Table 5.3: Comparison of CTAP and other proposal generation methods with the same action detector (SCNN) on THUMOS-14 test set, mean Average Precision (mAP % @tIoU=0.5) is reported. Method Sparse [6] DAPs [13] SCNN-prop[77] TURN [23] TAG[109] CTAP-F16 CTAP-TS6 tIoU=0.5 15.3 16.3 19.0 25.6 25.9 27.9 29.9 on the tIoU threshold. Complementary ltering dynamically generates trustworthiness scores on dierent windows, which make the selection process more eective. We also show the AR performance of two sources, actionness proposals and sliding win- dows, in Fig. 5.3. Both ow-16 (F16) feature and twostream-6 (TS6) feature are illus- trated. It can be seen that the performance of complementary proposals is higher than that of actionness proposals (TAG+TAR) and sliding windows (SW+TAR) at every AN consistently, which shows that our method can eectively select high quality complemen- tary proposals from sliding windows to ll the omitted ones in actionness proposals. Comparison with state-of-the-art methods. We compare our full model with state- of-the-art methods on THUMOS-14 dataset by the Average recall on average number of proposals (AR-AN) curve and recall@100-tIoU curve, as shown in Fig. 5.4. It can be seen that our model outperforms the state-of-the-art model by a large margin on both curves. Specically, for AR@100, the performance of CTAP is around 43%, while the state-of-the-art method TURN [23] only achieves about 32%. CTAP for Temporal action detection. To verify the quality of our proposals, we feed CTAP proposals into SCNN [77], and compare with other proposal generation methods on the same action detector (SCNN). The results are shown in Table 5.3. We can see that our CTAP-TS6 achieves the best performance, and outperforms the state-of-the-art proposal method TURN [23] and TAG [109] by over 4%, which proves the eectiveness of the proposed method. 42 Figure 5.4: AN-AR curve and recall@AN=100 curve of CTAP and state-of-the-art meth- ods on THUMOS-14 test set. 5.3.4 Evaluation on ActivityNet v1.3 Evaluation of TAR. To show the eectiveness of TAR, we report the AR@100 val- ues and area under AR-AN curve for dierent models in Table 5.4. For sliding window proposals, we observe that TAR's prediction (SW-TAR) achieves 18.29% and 6.86% im- provement in AR@100 and AUC compared to those of TURN [23] (SW-TURN). The results show that TAR is more eective in temporal boundary adjustment and proposal ranking. For actionness proposals, we observe that TAR achieves 10.70% increase com- pared to TURN [23] on AUC. Evaluation of PATE. Based on TAR, we further explore the function of PATE com- plementary ltering. We evaluate three dierent models: (1) sliding window proposals with TAR (SW-TAR) (2) actioness proposals with TAR (TAG-TAR) (3) PATE Comple- mentary proposals with TAR (our full model, CTAP). Dierent models' performances of AR@100 and AUC are reported in Table 5.5. CTAP achieves consistently better per- formance of AR@100 and AUC compared to SW-TAR and TAG-TAR, which shows its advantage of selecting complementary proposals from sliding windows to ll the omitted ones in actionness proposals. Comparison with state-of-the-art methods. CTAP is compared with state-of-the- art methods on ActivityNet v1.3 validation set by the Average Recall at top 100 ranked proposals (AR@100) and area under AR-AN curve (AUC). In Table 5.7, we nd CTAP Table 5.4: Evaluation of TAR on ActivityNet v1.3 validation set. AR@100 and AUC of AR-AN curve are reported Compared Method AR@100 AUC Our Method AR@100 AUC SW-TURN [23] 49.73 54.16 SW-TAR 68.02 61.02 TAG-TURN [109] 63.46 53.92 TAG-TAR 64.01 64.62 43 Table 5.5: Evaluation of Complementary proposal ltering on ActivityNet v1.3 validation set. AR@100 and AUC of AR-AN curve are reported. Compared Method SW-TAR TAG-TAR CTAP AR@100 68.02 64.01 73.17 AUC 61.02 64.62 65.72 Table 5.6: Generalization evaluation of CTAP on Activity Net v1.3 (validation set) in terms of AR@100 and AR-AN under curve area. Seen (100 classes) Unseen (100 classes) AR@100 74.06 72.51 AR-AN 66.01 64.92 achieves 2.60% and 1.32% increase in AR@100 compared with state-of-the-art methods MSRA [101] and Prop-SSAD [49] respectively. Generalization ability of proposals. We evaluate the generalization ability of CTAP on ActivityNet v1.3 validation set. Following the setting of [96], we evaluate the AR@100 and AR-AN under curve area (AUC) for 100 seen classes and unseen classes respectively. In Table 5.6, we observe that CTAP achieves better performance on 100 seen classes. On unseen 100 classes, there is only a slight drop in AR@100 and AUC, which shows the generalizability of CTAP. 5.3.5 Qualitative Results We further visualize some temporal action proposals generated by CTAP. As shown in Fig. 5.5, CTAP is able to select most suitable initial proposals from actionness proposals or sliding windows, and then adjust their temporal boundaries more precisely. 5.4 Conclusion Previous methods for temporal action proposal generation can be divided to two groups: sliding window ranking and actionness score grouping, which are complementary to each other: sliding windows uniformly cover all segments in videos, but the temporal bound- aries are imprecise; actionness score based method may have more precise boundaries Table 5.7: Evaluation of CTAP on ActivityNet v1.3 validation set. AR@100 and AUC of AR-AN curve are reported. (The AR@100 of MSRA [101] is not available.) Compared Method MSRA [101] prop-SSAD [49] CTAP AR@100 - 73.01 73.17 AUC 63.12 64.40 65.72 44 but it may omit some proposals when the quality of actioness scores is low. We pro- pose a novel Complementary Temporal Action Proposal (CTAP) generator, which could collect high quality complementary proposals from sliding windows and actionness pro- posals. A temporal convolutional network for proposal ranking and boundary adjustment is also designed. CTAP outperforms state-of-the-art methods by a large margin on both THUMOS-14 and ActivityNet 1.3 datasets. Further experiments on action detection show consistent large performance improvements. 45 Figure 5.5: Visualization of temporal action proposals generated by CTAP. First two rows represent 4 temporal action proposals from 2 videos in THUMOS-14. Last two rows represent 4 temporal action proposals from 2 videos in ActivityNet v1.3. 46 Chapter 6 Reinforced Encoder-Decoder Networks for Action Anticipation 6.1 Introduction Action anticipation refers to detection (i.e. anticipation) of an action before it happens. Many real world applications are related to this predictive capability, for example, a surveillance system can raise alarm before an accident happens, and allow for interven- tion; robots can use anticipation of human actions to make better plans and interactions [43]. Note that, online action detection [11] can be viewed as a special case for action anticipation, where the anticipation time is 0. Action anticipation is challenging for many reasons. First, it needs to overcome all the diculties of action detection which require strong discriminative representations of video clips and ability to separate action instances from the large and wide variety of irrelevant background data. Then, for anticipation, the representation needs to capture sucient historical and contextual information to make future predictions that are seconds ahead. t=0 t=1 t=-1 t=-2 t=2 t=3 Anticipate FutureActions (Answer Phone) Figure 6.1: Anticipating future actions by inferring from history information: the normal images represent past frames and the transparent images represent future frames. State-of-the-art methods on online action detection [52, 11, 102] learn LSTM networks to encode history information and predict actions based on the hidden state of the LSTM. For action anticipation, early work [45, 64] was based on traditional hand-crafted features. Recently, Vondrick et al. [88] proposed to use deep neural networks to rst anticipate visual representations of future frames and then categorize the anticipated representations to actions. However, the future representation is anticipated based on a single past frame's representation, while actions are better modeled in a clip, i.e. multiple frames. Besides, 47 their model only anticipates for single xed time, it is desirable to be able to anticipate a sequence of continuous future representations. To address the anticipation challenges, we propose a Reinforced Encoder-Decoder (RED) network. The encoder-decoder network takes continuous steps of history visual repre- sentations as input and outputs a sequence of anticipated future representations. These anticipated representations are processed by a classication network for action classica- tion. Squared loss is used for the representation anticipation and cross-entropy loss is used for action category anticipation (classication) during training. One drawback of the tra- ditional cross-entropy loss is that it only optimizes the encoder-decoder networks greedily at each time step, and lacks sequence level optimization [67]. We propose to use rein- forcement learning to train the encoder-decoder networks on sequence level. The reward function is designed to encourage the model to make the correct anticipations as early as possible. We test RED on TVSeries [11], THUMOS-14 and TV-Human-Interaction [63] for action anticipation and online action detection. State-of-the-art performance has been achieved. 6.2 Reinforced Encoder-Decoder Network RED contains three modules: a video representation extractor; an encoder-decoder net- work to encode history information and anticipate future video representations; a classi- cation network to anticipate action categories and a reinforcement module to calculate rewards, which is incorporated in training phase using a policy gradient algorithm [92]. The architecture is shown in Figure 6.2. 6.2.1 Video Processing A video is segmented into small chunks, each chunk contains f = 6 consecutive frames. The video chunks are processed by a feature extractorE v . It takes the video chunksu i as input and outputs chunk representationV i =E v (u i ). More details on video pre-processing and feature extractors could be found in Section 4.1. 6.2.2 Encoder-Decoder Network The encoder-decoder network uses a LSTM network as basic cell. The input to this network is a vector sequence S in =fV i g;i2 [tT enc ;t), vector V i is a chunk visual representation, T enc is the length of the input sequence, t is the time point in the video. After the last input vector has been read, the decoder LSTM takes over the last hidden state of encoder LSTM and outputs a prediction for the target sequence S out =f ^ V j gj2 [t;t +T dec ), where T dec is the length of the output sequence, i.e., the anticipation steps. The target sequence are representations of the video chunks that come after the input sequence. 48 The goal of decoder LSTM is to regress future visual representations, based on the last hidden state of the encoder networks. The loss function for training the encoder-decoder networks is the squared loss, L reg = 1 N N X k=1 T dec X j=1 jj ^ V k j V k j jj (6.1) where N is the batch size, ^ V k j is the anticipated representation and V k j is the ground truth representation. LSTM LSTM LSTM … LSTM LSTM LSTM … CLS CLS CLS Reward Predictor CrossEntropy Loss Squared Loss Reinforcer Decoder Encoder " # " $ " % &'( (| =1) (| =2) (| =345) 6% '7(8# 6% '7(8$ 9 5;< =4> @ … … [ A #:%&'( ] Figure 6.2: Reinforced Encoder-Decoder (RED) networks architecture for action antici- pation. 6.2.3 Classication Network The output vector sequence, S out , of the encoder-decoder networks is processed by the classication network, which has two fully connected layers, to output a classication distribution on action categories. The loss function of classication is the cross-entropy loss: L cls = 1 N P N k=1 P T dec t=1 log(p(y k t jy k 1:t1 )), where p(y j k ) is the probability score. 6.2.4 Reinforcement Module A natural expectation of action anticipation is to make the correct anticipation as early as possible. For example, we consider two anticipation sequences "000111" and "001110", assuming the ground truth sequence is "011111", where "1" represents that an action category is happening and "0" represents that no action is happening (i.e background). "001110" gives the correct anticipation earlier then "000111", so we consider it is a better anticipation at sequence level. However, cross-entropy loss would not capture such sequence-level distinctions, as it is calculated at each step to output higher condent scores on the ground truth category, and no sequence-level information is involved. 49 To consider sequence-level reward, we incorporate reinforcement learning into our system. The anticipation module (the encoder-decoder networks) and the classication module (the FC layers) together can be viewed as an agent, which interacts with the external environment (the feature vector taken as input at every time step). The parameters of this agent dene a policy, whose execution results in the agent making an prediction. In the action detection and anticipation setting, a prediction refers to predicting the action category in the sequence at each time step. After making a prediction, the agent updates its internal state (the hidden state of LSTM), and observes a reward. We design a reward function to encourage the agent to make the correct anticipation as early as possible. Assuming the agent outputs an anticipation sequencef^ y i g, and the corresponding ground truth labels arefy i g. In the ground truth label sequence, we denote the time position t f that the label start to change from background to some action class as transferring time, for example, in "001111",t f = 2. At each stept of the anticipation, the reward r t is calculated as r t = t + 1t f ; if tt f and ^ y t =y t ; r t = 0; otherwise (6.2) where is a constant parameter. If the agent makes the correct prediction at the trans- ferring time t s , it would receive the largest reward. The reward for making correct anticipation decays with time. The cumulative reward of the sequence is calculated as R = P T dec t=1 r t . The goal is to maximize the reward expectationR when interacting with the environment, that is, to encourage the agent to output correct anticipations as early as possible. More formally, the policy of the agent induces a distribution over possible anticipation sequences p y(1:T) , and we want to maximize the reward under this distribution: J() =E p(y(1:t));) [ T dec X t=1 r t ] =E p(y(1:t));) [R] (6.3) wherey(1 :t) is the action category predicted by our model. To maximizeJ(), as shown in REINFORCE[92], an approximation to the gradient is given by r J 1 N N X k=1 T dec X t=1 r log(a k t jh k 1:t ;a k 1:t1 )R k t (6.4) where is the agent's policy. In our case, the policy is the probability distribution over action categories at each time step. Thus, the gradient can be written as r J 1 N N X k=1 T dec X t=1 r logp(y k t jy k 1:t1 )(R k t b k t ) (6.5) where b k t is a baseline reward, which is estimated by a separate network. The network consists of two fully connected layer and takes the last hidden state of the encoder network as input. 50 6.2.5 Training Procedure The goal of encoder-decoder networks is to anticipate future representations, so unlabelled video segments (no matter whether there contain actions) could be used as training samples. As the positive segments (i.e. some action is happening) are very small part of videos in the whole datasets, RED is trained by a two-stage process. In the rst stage, the encoder-decoder networks for representation anticipation are trained by the regression loss L reg on all training videos in the dataset as initialization. In the second stage, the encoder-decoder networks are optimized by the overall loss functionL, which includes the regression lossL reg , a cross-entropy lossL cls introduced by classication networks and J introduced from reinforcement module on the positive samples in the videos: L =L reg +L cls J (6.6) The classication network is only trained in the second stage by L cls and J. The training samples for the rst stage do not require any annotations, so they could be collected at any position in the videos. Specically, at a time point t, the sequence for encoder networks is [V tTenc ;V t ), the output ground truth sequence for decoder networks is [V t ;V t+T dec ), whereV t is the visual representation att. For the second stage, the training samples are collected around positive action intervals. Specically, given a positive action interval [t s ;t e ], the central time pointt could be selected fromt>t s T enc tot<t e . After picking the central time point t, [V tTenc ;V t ) are used as input samples of the encoder, [V t ;V t+T dec ) are used as output ground truth visual representations for the anticipated sequence. [y t ;y t+T dec ) are ground truth action labels for the anticipated sequence, which are used in classication cross-entropy loss. 6.3 Evaluation We evaluate our approach on standard benchmarks for action anticipation, including TVSeries [11], THUMOS-14 and TV-Human-Interaction [63]. As no previous action anticipation results are available on THUMOS-14 and TVSeries, we build several strong baselines (including one similar to [88]) to compare with. 6.3.1 Implementation Details We extract frames from all videos at 24 Frames Per Second (FPS). The video chunk size is set to 6, so each chunk is 0.25-second long in the original video. We investigate two video feature extractors: a two-stream CNN model [95] and VGG-16 [80] model. For two-stream model, we use central frame to calculate the appearance CNN feature in each chunk. The outputs of "Flatten 673" layer in ResNet are extracted; we calculate the optical ows [14] between the 6 consecutive frames in a chunk, and feed them into the motion stream. The output of "global pool" layer in BN-Inception [33] is used. We concatenate the motion features and the appearance features into 4096-dimensional vectors, which is used as chunk representation. To provide fair comparison with [11], we also use VGG-16 features. The central frame in a chunk is sampled and processed 51 by a VGG-16 model (pre-trained on ImageNet), the output of fc6 is used as the chunk representation. We set T enc = 16 and T dec = 8, which are 4 seconds and 2 seconds respectively. The hidden state of LSTM is set to 4096-dimensional and number of LSTM layer is 1. We also tried 2048-dimensional, but shows a worse performance. in Equation (2) is set to 1. Adam [41] is adopted to train RED, learning rate is set to 0.001, and batch size is 64. LSTM LSTM LSTM … Encoder "# $%&'( "# $%&') * CLS Cross Entropy Loss ANT Squared Loss + #,%- * CLS ANT + #,%- Cross Entropy Loss Squared Loss Figure 6.3: Baseline methods for action anticipation. The left one is FC and the right one is EFC. 6.3.2 Baseline Methods FC: We implement a method similar to that in [88], in which fully connected layers are trained to anticipate the future representation, as shown in Figure 6.3, on the left. Specically, the chunk representation at time t are input to two fully connected layers and regressed to the future chunk representations at timet +T ant . The anticipating time is T ant . The anticipated representations are processed by another classication network, which consists of two fully connected layers. The outputs of the classication network are the action category anticipations. T ant is set to 4, which is 1 second. Encoder-FC (EFC): The above baseline method FC only considers a single representa- tion for anticipation. This baseline method extends to fully connected layers to a LSTM encoder network, which encodes the representations fromtT enc tot, and anticipates the representation at t +T ant , as shown in Figure 6.3, on the right. The classication layers take the anticipated representations to anticipate action categories. We set T enc = 16, the same as the one of RED. T ant is set to 4. Encoder-Decoder (ED): This baseline is similar to RED, but without the reinforce- ment module. The EFC baseline method considers sequenced history representations and output a single anticipated video representation at time t +T ant . Instead of anticipating a single future representation, the decoder learns to anticipate multiple continuous rep- resentations from timet tot +T dec . We setT enc = 16 andT dec = 8, the same as the ones of RED. The aforementioned baseline models are trained by a two-stage process, which is similar to RED. In the rst stage, the representation anticipation networks, i.e. 2-layer FC an- ticipation, EFC anticipation or encoder-decoder anticipation, are trained by a regression loss on all the videos in the dataset (THUMOS-14 or TVSeries) as initialization. In the 52 second stage, the anticipation networks are trained by the regression loss and a cross- entropy loss, which is introduced by the classication network on the positive samples in the videos. 6.3.3 Experiments Dataset. The TVSeries Dataset [11] is a realistic, large-scale dataset for temporal action detection; it contains 16 hours of videos (27 episodes) from six recent popular TV series. TVSeries contains 30 daily life action categories, such as "answer phone", "drive car", "close door". The temporal detection part of THUMOS-14 contains over 20 hours of videos from 20 sport classes. The training set of THUMOS-14 contains only trimmed videos, which are not suitable for action anticipation and action detection. There are 200 and 213 untrimmed videos in the validation and test set respectively. We train RED on the validation set and test it on the test set. TV-Human-Interaction [63] dataset contains 300 video clips extracted from 23 dierent TV shows. There are 4 interactions: hand shakes, high ves, hugs and kisses in the dataset. The video lengths range from 1 second to 10 seconds. Experiment setup. Training representation anticipation (stage 1) needs tens of hours of video data. TVSeries and THUMOS-14 have enough videos, but TV-Human-Interaction only contains tens of minutes. So we train and test our model on both TVSeries and THUMOS-14, and compare the performance of RED for anticipation time T a = 0:25s 2:0s. For TV-Human-Interaction, we use the stage-1 model trained on THUMOS-14 or TVSeries, and train stage-2 on TV-Human-Interaction. On TVSeries, the metric we use is calibrated Average Precision (cAP) which is proposed by [11]. Calibrated precision (cPrec) uses a parameterw, which the ratio between negative frames and positive frames in the calculation of precision, cPrec = TP TP+FP=w , so that the average precision is calculated as if there were an equal amount of positive and negative frames. On THUMOS-14, we report per-frame mean Average Prevision (mAP) performance, which is used in [102]. For TV-Human-Interaction, we report classication accuracy (ACC). Table 6.1: Action anticipation comparison on TVSeries (cAP %) test set and THUMOS- 14 (per-frame mAP %) test set at 1s (4 chunks) with two-stream features. FC EFC ED RED TVSeries (cAP@T a =1s %) 72.4 73.3 74.6 75.5 THUMOS-14 (mAP@T a =1s %) 31.7 33.9 36.8 37.5 Comparison of action anticipation methods. We rst compare the performance at anticipation timeT a = 1s. The results on TVSeries and THUMOS-14 are shown in Table 6.1. Overall, RED outperforms FC on both datasets. Comparing with FC and EFC, we can see that encoding multiple history steps of representation improves the anticipation performance consistently. The dierence between ED and EFC is the presence of decoder networks, EFC uses fully connected layers to anticipate only one future representation, but ED uses decoder networks to anticipate a sequence of continuous representations. Comparing ED and EFC, we can see that anticipating the future representation step by 53 step makes the anticipations more accurate. Comparing ED and RED, it can be seen that the proposed reinforer benets action anticipation. Table 6.2: Action anticipation comparison (ACC %) on TV-Human-Interaction atT a = 1s (4 chunks). Vondrick et al.[88] (THUMOS) RED-VGG (TVSeries) RED-TS (THUMOS) ACC@Ta=1s (%) 43.6 47.5 50.2 We compare RED with Vondrick et al. [88] on TV-Human-Interaction at anticipation time T a = 1s, as shown in Table 6.2. Stage-1 of RED is trained on TVSeries (with VGG feature) or THUMOS-14 (with two stream feature), stage-2 is trained on TV-Human- Interaction. Note that, Vondrick et al. [88] trained the representation anticipation model on THUMOS dataset and trained SVM classiers for action anticipation. We can see that both RED-VGG and RED-TS outperform [88]. Note that [88] uses Alexnet features so possibly some of our gain comes just from use of stronger features. On the other hand, we use a much smaller subset of THUMOS (20 hours vs 400 hours) to train. Also, Table 6.1, which includes a comparison with our implementation of [88], and using the same TS features as RED, does indicate that our network design is responsible for signicant part of the improvements. Table 6.3: Detailed action anticipation (cAP %) comparison for ED and RED on TVSeries test set fromT a = 0:25s toT a = 2:0s with two-stream representations and VGG features. time 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s ED-VGG 71.0 70.6 69.9 68.8 68.0 67.4 67.0 66.7 RED-VGG 71.2 71.0 70.6 70.2 69.2 68.5 67.5 66.8 ED-TS 78.5 78.0 76.3 74.6 73.7 72.7 71.7 71.0 RED-TS 79.2 78.7 77.1 75.5 74.2 73.0 72.0 71.2 Table 6.4: Detailed action anticipation (per-frame mAP %) comparison for ED and RED on THUMOS-14 test set from T a = 0:25s to T a = 2s with two-stream representations. time 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s ED-TS 43.8 40.9 38.7 36.8 34.6 33.9 32.5 31.6 RED-TS 45.3 42.1 39.6 37.5 35.8 34.4 33.2 32.1 Varying anticipation time. Encoder-decoder network allows for sequence anticipation, unlike FC or EFC which can only anticipate at a xed future time. A detailed comparison between "ED" and "RED" is shown in Table 6.3 and Table 6.4. We test RED with two types of video representations on TVSeries: two-stream model and VGG-16. On THUMOS-14, we report two-stream performance. The anticipation time ranges from 0.25s to 2s, which correspond to 1 video chunk and 8 video chunks (each chunk contains 6 frames, and frame extraction rate is 24). "TS" stands for two-stream. As shown in Table 6.3, it can be seen that reinforcement module consistently improves the encoder- decoder networks for action anticipation at each step on both two stream features and 54 VGG features. We think it is because the sequence-level supervision introduced by the designed reward function makes the optimization more eective. The reward function together with the cross-entropy loss not only trains the system to make the correct action anticipation at each single step, but also encourages it to produce the correct sequence prediction as early as possible. The results on THUMOS-14 are consitent with those on TVSeries, the sequence-level optimization from reinforcement learning consistently benets the action anticipation. Comparison of online action detection As discussed before, online action detection can be treated as a special case for action anticipation, where the anticipation time T a = 0. We use the results at minimum anticipation time (T a = 0:25s) for online action detection. As shown in Table 6.5, the previous state-of-the-art methods based on CNN, LSTM and Fisher Vector (FV) in [11] achieved 60.8, 64.1 and 74.3 respectively. [11] uses VGG features for "CNN" and "LSTM" methods, which are same as our RED-VGG. RED-VGG achieves 71.8, outperforms 64.1 by a large margin. RED-TS achieves the highest performance 79.2. Similar to TVSeries, we use the results ofT a = 0:25s as results of online action detection on THUMOS-14. The results are shown in Table 6.6 and we can see that RED outperforms state-of-the-art methods. Table 6.5: Comparison on online action detection in TVSeries test set. CNN[11] LSTM[11] FV [11] RED-VGG RED-TS cAP (%) 60.8 64.1 74.3 71.2 79.2 Table 6.6: Online action detection comparison on THUMOS-14 test set (per-frame mAP %) with two stream features. two-stream[79] LSTM[102] MultiLSTM[102] RED mAP(%) 36.2 39.3 41.3 45.3 6.4 Conclusion We propose Reinforced Encoder-Decoder (RED) networks for action anticipation. RED takes multiple history representations as input and learns to anticipate a sequence of future representations, which are fed into classication networks to categorize actions. The salient aspect of RED is that a reinforcement module is adopted to provide sequence- level supervision. The reward function is designed to encourage the system to make correct prediction as early as possible. RED is jointly optimized by the cross-entropy loss, squared loss and the reward function via a two-stage training process. Experimental results on action anticipation show the eectiveness of proposed reinforcement module and the encoder-decoder network. 55 Chapter 7 Temporal Activity Localization via Language Query 7.1 Introduction Activities in the wild consist of a diverse combination of actors, actions and objects over various periods of time. Earlier work focused on classication of video clips that contained a single activity, i.e.where the videos were trimmed. Recently, there has also been signicant work in localizing activities in longer, untrimmed videos [81, 52]. One major limitation of existing action localization methods is that they are restricted to pre-dened list of actions. Although the lists of activities can be relatively large [7], they still face diculty in covering complex activity questions, for example, \A person runs to the window and then look out." , as shown in Figure 7.1. Hence, it is desirable to use natural language queries to localize activities. Use of natural language not only allows for an open set of activities but also natural specication of additional constraints, including objects and their properties as well as relations between the involved entities. We propose the task of Temporal Activity Localization via Language (TALL): given a temporally untrimmed video and a natural language query, the goal is to determine the start and end times for the described activity inside the video. For traditional temporal action localization, most current approaches [81, 52, 77, 103, 107] apply activity classiers trained with optical ow-based methods [89, 79] or Convolutional Neural Networks (CNNs) [80, 85] in a sliding window fashion. A direct extension to support natural language query is to map the queries into a discrete label space. However, it is non-trivial to design a label space which has enough coverage for such activities without losing useful details in users' queries. To go beyond discrete activity labels, one possible solution is to embed visual features and sentence features into a common space [37, 42, 57]. However, for temporal localization of activities, it is unclear what a proper visual model to extract visual features for retrieval is, and how to achieve high precision of predicted start/end time. Although one could densely sample sliding windows at dierent scales, doing so is not only computationally expensive but also makes the alignment task more challenging, as the search space increases. An alternative to dense sampling is to adjust the temporal boundaries of proposals by learning regression parameters; such an approach has been successful for object localization, as in [71]. However, temporal regression has not been attempted in the past work and is more dicult as the activities are characterized by a spatio-temporal volume, which may lead to more background noise. 56 9.3 s 14.4 s LanguageQuery: A person runs to the window and then look out Figure 7.1: Temporal activity localization via language query in an untrimmed video. These challenges motivate us to propose a novel Cross-modal Temporal Regression Local- izer (CTRL) model to jointly model text query, video clip candidates and their temporal context information to solve the TALL task. CTRL generates alignment scores along with location regression results for candidate clips. It utilizes a CNN model to extract visual features of the clips and a Long Short-term Memory (LSTM) network to extract sentence embeddings. A cross-modal processing module is designed to jointly model the text and visual features, which calculates element-wise addition, multiplication and direct concate- nation. Finally, multilayer networks are trained for visual-semantic alignment and clip location regression. We design the non-parameterized and parameterized location osets for temporal coordinate regression. In parameterized setting, the length and the central coordinate of the clip is rst parameterized by the ground truth length and coordinate. In non-parameterized setting, the start and end coordinates are used directly. We show that the non-parameterized one works better, unlike the case for object boundary regression. To facilitate research of TALL, we also generate sentence temporal annotations for Cha- rades [78] dataset. We name it Charades-STA.We evaluate our methods on TACoS and Charades-STA datasets by the metric of \R@n, IoU=m", which represents the percentage of at least one of the top-n results ( start and end pairs ) having IoU with the ground truth larger than m. Experimental results demonstrate the eectiveness of our proposed CTRL framework. In summary, our contributions are two-fold: (1) We propose a novel problem formulation of Temporal Activity Localization via natural Language (TALL) query. (2) We introduce an eective Cross-modal Temporal Regression Localizer (CTRL) which estimates alignment scores and temporal action boundary by jointly modeling language query and video clips. 1 7.2 Methods In this section, we describe our Cross-modal Temporal Regression Localizer (CTRL) for Temporal Activity Localization via Language (TALL) and training procedure in detail. CTRL contains four modules: a visual encoder to extract features for video clips, a 1 Source codes are available in magentahttps://github.com/jiyanggao/TALL . 57 Sentence Query SentenceEmbedding Skip-‐Thoughts LSTM Word Embedding OR Clip-level featureextractor multi-modal Processing concatenation FC FC FC alignment score location regressor FC Add Mul pooling Visual Encoder " # " $ $ %&' $ ()"&,%&+ - -,./ -,/ SentenceEncoder "$ $ (01,%&+ Figure 7.2: Cross-modal Temporal Regression Localizer (CTRL) architecture. CTRL contains four modules: a visual encoder to extract features for video clips, a sentence encoder to extract embeddings, a multi-modal processing network to generate combined representations for visual and text domain, and a temporal regression network to produce alignment scores and location osets. sentence encoder to extract embeddings, a multi-modal processing network to generate combined representations for visual and text domain, and a temporal regression network to produce alignment scores and location osets between the input sentence query and video clips. 7.2.1 Problem Formulation We denote a video as V =ff t g T t=1 , T is the frame number of the video. Each video is associated with temporal sentence annotations: A =f(s j ; s j ; e j )g M j=1 , M is the sentence annotation number of the videoV ,s j is a natural language sentence of a video clip, which has s j and e j as start and end time in the video. The training data are the sentence and video clip pairs. The task is to predict one or more ( s j ; e j ) for the input natural language sentence query. 7.2.2 CTRL Architecture Visual Encoder. For a long untrimmed video V , we generate a set of video clips C =f(c i ;t s i ;t e i )g H i=1 by temporal sliding windows, where H is the total number of the clips of the video V , t s i and t e i are the start and end time of video clip c i . We dene visual encoder as a function F ve (c i ) that maps a certain clip c i and its context to a feature vector f v , whose dimension is d s . Inside the visual encoder, a feature extractor E v is used to extract clip-level feature vectors, whose input is n f frames and output is a vector with dimension d v . For one video clip c i , we consider itself (as the central clip) and its surrounding clips (as context clips) c i;q ;q2 [n;n], j is the clip shift, n is the shift boundary. We uniformly sample n f frames from each clip (central and context clips). The feature vector of central clip is denoted asf ctl v . For the context clips, we use a 58 pooling layer to calculate a pre-context featuref pre v = 1 n P 1 q=n E v (c i;q ) and post-context feature f post v = 1 n P n q=1 E v (c i;q ). Pre-context feature and post-context feature are pooled separately, as the end and the start of an activity can be quite dierent and both could be critical for temporal localization. f pre v , f ctl v and f post v are concatenated and then linearly transformed to the feature vector f v with dimension d s , as the visual representation for clip c i . Sentence Encoder. A sentence encoder is a function F se (s j ) that maps a sentence description s j to a embedding space, whose dimension is d s ( the same as visual feature space ). Specically, a sentence embedding extractor E s is used to extract a sentence- level embedding f 0 s and is followed by a linear transformation layer, which maps f 0 s to f s with dimension d s , the same as visual representation f v . We experiment two kinds of sentence embedding extractors, one is a LSTM network which takes a word as input at each step, and the hidden state of nal step is used as sentence-level embedding; the other is an o-the-shelf sentence encoder, Skip-thought [42]. More details would be discussed in Section 4. Multi-modal Processing Module. The inputs of the multi-modal processing mod- ule are a visual representation f v and a sentence embedding f s , which have the same dimension d s . We use vector element-wise addition (+), vector element-wise multipli- cation () and vector concatenation (k) followed by a Fully Connected (FC) layer to combine the information from both modalities. Addition and multiplication operation allow additive and multiplicative interaction between two modalities and don't change the feature dimension. The FC layer allows interaction among all elements. The in- put dimension of the FC layer is 2d s and the output dimension is d s . The outputs from all three operations are concatenated to construct a multi-modal representation f sv = (f s f v )k (f s +f v )k FC(f s k f v ), which is the input for our core module, temporal localization regression networks. Temporal Localization Regression Networks. Temporal localization regression net- work takes the multi-modal representationf sv as input, and has two sibling output layers. The rst one outputs an alignment score cs i;j between the sentences j and the video clip c i . The second one outputs clip location regression osets. We design two location osets, the rst one is parameterized oset: t = (t c ;t l ), wheret c andt l are parameterized central point oset and length oset respectively. The parameterization is as follows: t p = (pp c )=l c ; t l =log(l=l c ) (7.1) where p and l denote the clip's center coordinate and clip length respectively. Variables p;p c are for predicted clip and test clip (like wise for l). The second oset is non- parameterized oset: t = (t s ;t e ), where t s and t e are the start and end point osets. t s =ss c ; t e =ee c (7.2) wheres ande denote the clip's start and end coordinate respectively. Temporal coordinate regression can be thought as clip location regression from a test clip to a nearby ground- truth clip, as the original clip could be either too tight or too loose, the regression process tend to nd better locations. 59 7.2.3 CTRL Training Multi-task Loss Function. CTRL contains two sibling output layers, one for alignment and the other for regression. We design a multi-task loss L on a mini-batch of training samples to jointly train for visual-semantic alignment and clip location regression. L =L aln +L reg (7.3) whereL aln is for visual-semantic alignment andL reg is for clip location regression, and is a hyper-parameter, which controls the balance between the two task losses. The alignment loss encourages aligned clip-sentence pairs to have positive scores and misaligned pairs to have negative scores. L aln = 1 N N X i=0 [ c log(1 + exp(cs i;i ))+ N X j=0;j6=i w log(1 + exp(cs i;j ))] (7.4) where N is the batch size, cs i;j is the alignment score between sentence s j and video clip c i , c and w are the hyper parameters which control the weights between positive ( aligned ) and negative ( misaligned ) clip-sentence pairs. The regression loss L reg is calculated for the aligned clip-sentence pairs. A sentence s j annotation contains start and end time ( s j ; e j ). The aligned sliding window clip c i has (t s i ;t e i ). The ground truth osets t are calculated from start and end times. L reg = 1 N N X i=0 [R(t x;i t x;i ) +R(t y;i t y;i )] (7.5) wherex andy indicatep andl for parameterized osets, ors ande for non-parameterized osets. R(t) is smooth L 1 function. Sampling Training Examples. To collect training samples, we use multi-scale tempo- ral sliding windows with [64, 128, 256, 512] frames and 80% overlap. (Note that, at test time, we only use coarsely sampled clips.) We use the following strategy to collect train- ing samples T =f[(s h ; s h ; e h ); (c h ;t s h ;t e h )]g N T h=0 . Each training sample contains a sentence description (s h ; s h ; e h ) and a video clip (c h ;t s h ;t e h ). For a sliding window clip c from C with temporal annotation (t s ;t e ) and a sentence description s with temporal annotation ( s ; e ), we align them as a pair of training samples if they satisfy (1) Intersection over Union (IoU) is larger than 0.5; (2) non Intersection over Length (nIoL) is smaller than 0.2 and (3) one sliding window clip can be aligned with only one sentence description. The reason we use nIoL is that we want the the most part of the sliding window clip to overlap with the assigned sentence, and simply increasing IoU threshold would harm regression layers ( regression aims to move the clip from low IoU to high IoU). As shown in Figure 7.3, although the IoU between c and s 1 is about 0.5, if we assign c to s 1 , then it will disturb the model ,because c contains information of s 2 . 60 clipc " # Intersection Union Non Intersection Length Figure 7.3: Intersection over Union (IoU) and non-Intersection over Length (nIoL). 7.3 Evaluation In this section, we describe the evaluation settings and discuss the experiment results 7.3.1 Datasets TACoS [69]. This dataset was built on the top of MPII-Compositive dataset [73] and contains 127 videos. Every video is associated with two type of annotations. The rst one is ne-grained activity labels with temporal location (start and end time). The second set of annotations is natural language descriptions with temporal locations. The natural language descriptions were obtained by crowd-sourcing annotators, who were asked to describe the content of the video clips by sentences. In total, there are 17344 pairs of sentence and video clips. We split it in 50% for training, 25% for validation and 25% for test. Charades-STA. Charades [78] contains around 10k videos and each video contains tem- poral activity annotation (from 157 activity categories) and multiple video-level descrip- tions. TALL needs clip-level sentence annotation: sentence descriptions with start and end time, which are not provided in the original Charades dataset. We noticed that the names of activity categories in Charades are parsed from the video-level descriptions, so many of activity names appear in descriptions. Another observation we make is that most descriptions in Charades share a similar syntactic structure: consisting of multiple sub-sentences, which are connected by comma, period and conjunctions, such as \then", \while", \after", \and". For example, \A person is sitting down by the door. They stand up and start carefully leaving some dishes in the sink". Based on these observations, we designed a semi-automatic way to generate sentence temporal annotation. The rst step is sentence decomposition: a long sentence is split to sub-sentences by a set of conjunctions (which are collected by hand ), and for each sub- sentence, the subject ( parsed by Stanford CoreNLP [55] ) of the original long sentence is added to start. The second step is keyword matching: we extract keywords for each activity categories and match them to sub-sentences, if they are matched, the temporal annotation (start and end time) are assigned to the sub-sentences. The third step is a human check: for each pair of sub-sentence and temporal annotation, we (two of the co- authors) checked whether the sentence made sense and whether they matched the activity annotation. An example is shown in Figure 7.4. 61 Sit Stand Up Video ActivityAnnotation Sentence Sub-sentences decomposition Sub Sub keyword matching Aperson is sitting down by the door. They stand up and start carefully leaving some dishes in the sink. Sub 0:A person is sitting down by the door. Sub 1:They stand up. Sub 2:Theystart carefully leaving some dishes in the sink [2.4, 4.2]: Stand up [1.3, 2.4]: Sit Sentence ActivityAnnotation STA Sub-Sentences [1.3, 2.4]: Aperson is sitting down by the door [2.4, 4.2]: They stand up. Figure 7.4: Charades-STA construction. Although TACoS and Charades-STA are challenging, their lengths of queries are limited to single sentences. To explore the potential of CTRL framework on handling longer and more complex sentences, we build a complex sentence set. Inside each video, we connect consecutive sub-sentences to make complex query, each complex query contains at least two sub-sentences, and is checked to make sure that the time span is less than half of the video length. We use them for test purpose only. In total, there are 13898 clip-sentence pairs in Charades-STA training set, 4233 clip-sentence pairs in test set and 1378 complex sentence quires. On average, there are 6.3 words per non-complex sentence, and 12.4 words per complex sentence. 7.3.2 Experiment Settings We will introduce evaluation metric, baseline methods and our system variants in this part. 7.3.2.1 Evaluation Metric We adopted a similar metric used by [31] to compute \R@n, IoU=m", which means that the percentage of at least one of the top-n results having Intersection over Union (IoU) larger than m. This metric itself is on sentence level, so the overall performance is the average among all the sentences. R(n;m) = 1 N P N i=1 r(n;m;s i ), where r(n;m;s i ) is the recall for a query s i , N is total number of queries and R(n;m) is the averaged overall performance. 7.3.2.2 Baseline Methods We consider two sentence based image/video retrieval baseline methods: visual-semantic alignment with LSTM (VSA-RNN ) [36] and visual-semantic alignment with Skip-thought 62 vector (VSA-STV) [42]. For these two baseline methods, we use the same training samples and test sliding windows as those for CTRL. VSA-RNN. This baseline method is similar to the model in DVSA [36]. We use a regular LSTM instead of BRNN to encode the input description. The size of hidden state of LSTM is 1024 and the output size is 1000. Video clips are processed by a C3D network that is pre-trained on Sports1M [37]. The 4096 dimensional fc6 vector is extracted and linearly transformed to 1000 dimensional, which is used as the clip-level feature. Cosine similarity is used to calculate the condence score between the clip and the sentence. Hinge loss is used to train the model. At test time, we compute the alignment score between input sentence query and all the sliding windows in the video. VSA-STV: Instead of using RNN to extract sentence embedding, we use an o-the-shelf Skip-thought [42] sentence embedding extractor. A skip-thought vector is 4800 dimensional, we linearly transform it to 1000 dimensional. Visual encoder is the same as for VSA-RNN. Verb and Object Classiers. We also implemented baseline methods based on an- notations of pre-dened actions and objects. TACoS dataset also contains pre-dened actions and object annotations at clip-level. These objects and actions annotations are from the original MPII-Compositive dataset [73]. 54 categories of actions and 81 cat- egories of objects are involved in training set. We use the same C3D feature as above to train action classiers and object classiers. The classier is based on a 2-layer fully connected network, the size of rst layer is 4094 and the size of second layer is the number of categories. The test sentences are parsed by Stanford CoreNLP [55], and verb-object (VO) pairs are extracted using the sentence dependencies. The VO pairs are matched with action and object annotations based on string matching. The alignment score be- tween a sentence query and a clip is the score of matched action and object classier responses. Verb means that we only use action classier; Verb+Obj means that both action classiers and object classiers are used. 7.3.2.3 System Variants We experimented with variants of our system to test the eectiveness of our method. CTRL(aln): we don't use regression, train the CTRL with only alignment loss L aln . CTRL(reg-p): train the CTRL with alignment loss L aln and parameterized regression loss L regp . CTRL(reg-np): context information is considered and CTRL is trained with alignment loss L aln and non-parameterized regression loss L regnp . CTRL(loc): SCNN [77] proposed to use overlap loss to improve activity localization performance. Based on our pure alignment(without regression), we implemented a similar loss function considering clip overlap as in SCNN. L loc = P n i (0:5 ( 1=(1+e cs i;i ) 2 IoU i 1)), where cs i;i andIoU i are respectively the alignment score and Intersection over Union (IoU) between the aligned pairs of sentence and clip in a mini-batch. The major dierence is that SCNN solved a classication problem, so they use Softmax score, however in our case, we consider an alignment problem. The overall loss function is L scnn =L aln +L loc . For this method, we use C3D as the visual encoder and Skip-thought as the sentence encoder. 63 0.1 0.2 0.3 0.4 0.5 Intersection over Union (IoU) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Recall R@10 R@5 R@1 Comparison of Visual Features C3D VGG MeanPooling LRCN Figure 7.5: Performance comparison of dierent visual encoders. 7.3.3 Experiments on TACoS In this part, we discuss the experiment results on TACoS. First we compare the perfor- mance of dierent visual encoders; second we compare two sentence embedding methods; third we compare the performance of CTRL variants and baseline methods. The length of sliding windows for test is 128 with overlap 0.8, multi-scale windows are not used. We empirically set the context clip number n as 1 and the length of context window as 128 frames. The dimension of f v , f s and f sv are all set to 1000. We set batch size as 64, the networks are optimized by Adam [41] optimizer on a Nvidia TITAN X GPU. Comparison of visual features. We consider three clip-level visual encoders: C3D [85], LRCN [12], VGG+Mean Pooling [37]. Each of them takes a clip with 16 frames as input and outputs a 1000-dimensional feature vector. For C3D, fc6 feature vector is extracted and then linearly transformed to 1000-dimension. For LRCN and VGG poolng, we extract fc6 of VGG-16 for each frame. The LSTM's hidden state size is 256.We use Skip-thought as the sentence embedding extractor and other parts of the model are the same to CTRL(aln). There are three groups of curves, which are Recall@10, Recall@5 and Recall@1 respectively, shown in Figure 7.5. We can see that C3D performs generally better than other two methods. LRCN's performance is inferior, the reason maybe that the dataset is relatively small, not enough to train the LSTM well. Comparison of sentence embedding. For sentence encoder, we consider two com- monly used methods: word2vec+LSTM [31] and Skip-thought [42]. In our implemen- tation of word2vec, we train skip-gram model on English Dump of Wikipedia. The dimension of the word vector is 500 and the hidden state size of the LSTM is 512. For Skip-thought vector, we linearly transform it from 4800-dimension to 1000-dimension. We use C3D as the visual feature extractor and other parts are the same to CTRL(aln). From the results, we can see that the performance of Skip-thought is generally better than word2vec+LSTM. We conjecture the reason is that the scale of TACoS is not large 64 0.1 0.2 0.3 0.4 0.5 Intersection over Union (IoU) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Recall@1 Comparison of Sentence Embedding Skip-thought LSTM Figure 7.6: Performance comparison of dierent sentence embedding. enough to train the LSTM (comparing with the counterpart datasets in object detection, like ReferIt [38], Flickr30k Entities [66], which contains over 100k sentences). Comparison with other methods. We test our system variants and baseline methods on TACoS and report the result for IoU2f0:1; 0:3; 0:5g and Recall@f1, 5g. The results are shown in Table 7.1. \Random" means that we randomly select n windows from the test sliding windows and evaluate Recall@n with IoU=m. All methods use the same C3D features. VSA-RNN uses the end-to-end trained LSTM as the sentence encoder and all other methods use pre-trained Skip-thought as sentence embedding extractor. We can see that visual retrieval baselines (i.e. VSA-RNN, VSA-STV) lead to inferior performance, even compared with our pure alignment model CTRL(aln). We believe the major reasons are two-fold: 1) the multilayer alignment network learns better alignment than the simple cosine similarity model, which is trained by hinge loss function; 2) visual retrieval models do not encode temporal context information in a video. Pre-dened clas- siers also produce inferior results. We think it is mainly because the pre-dened actions and objects are not precise enough to represent sentence queries. By comparing Verb and Verb+Obj, we can see that additional object (such as \knife", \egg") information helps to represent sentence queries. Temporal action boundary regression As described before, we implemented a tem- poral localization loss function similar to the one in SCNN [77], which consider clip overlap. Experiment results show that CTRL(loc) does not bring much improvement over CTRL(aln), perhaps because CTRL(loc) still relies on clip selection from sliding windows, which may not overlap with ground truth well. CTRL(reg-np) outperforms CTRL(aln) and CTRL(loc) signicantly, showing the eectiveness of temporal regression model. By comparing CTRL(reg-p) and CTRL(reg-np) in Table 7.1, it can be seen that non-parameterized setting helps the localizer regress the action boundary to a more ac- curate location. We think the reason is that unlike objects can be re-scaled in images due to camera projection, actions' time spans can not be easily rescaled in videos (we don't consider slow motion and quick motion). Thus, to do the boundary regression eectively, the object bounding box coordinates should be rst normalized to some standard scale, but for actions, time itself is the standard scale. Some prediction and regression results are shown in Figure 7.7. We can see that the alignment prediction gives a coarse location, which is limited by the xed window length; the regression model helps to rene the clip's bounding box to a higher IoU location. 65 Table 7.1: Comparison of dierent methods on TACoS Method R@1 IoU=0.5 R@1 IoU=0.3 R@1 IoU=0.1 R@5 IoU=0.5 R@5 IoU=0.3 R@5 IoU=0.1 Random 0.83 1.81 3.28 3.57 7.03 15.09 Verb 1.62 2.62 6.71 3.72 6.36 11.87 Verb+Obj 8.25 11.24 14.69 16.46 21.50 26.60 VSA-RNN 4.78 6.91 8.84 9.10 13.90 19.05 VSA-STV 7.56 10.77 15.01 15.50 23.92 32.82 CTRL (aln) 10.67 16.53 22.29 19.44 29.09 41.05 CTRL (loc) 10.70 16.12 22.77 18.83 31.20 45.11 CTRL (reg-p) 11.85 17.59 23.71 23.05 33.19 47.51 CTRL (reg-np) 13.30 18.32 24.32 25.42 36.69 48.73 Table 7.2: Comparison of dierent methods on Charades-STA Method R@1 IoU=0.5 R@1 IoU=0.7 R@5 IoU=0.5 R@5 IoU=0.7 Random 8.51 3.03 37.12 14.06 VSA-RNN 10.50 4.32 48.43 20.21 VSA-STV 16.91 5.81 53.89 23.58 CTRL (aln) 18.77 6.53 54.29 23.74 CTRL (loc) 20.19 6.92 55.72 24.41 CTRL (reg-p) 22.27 8.46 57.83 26.61 CTRL (reg-np) 23.63 8.89 58.92 29.52 7.3.4 Experiments on Charades-STA In this part, we evaluate CTRL models and baseline methods on Charades-STA and re- port the results for IoU2f0:5; 0:7g and Recall@f1, 5g, which are shown in Table 8.5. The lengths of sliding windows for test are 128 and 256, window's overlap is 0.8. It can be seen that the results are consistent with those in TACoS. CTRL(reg-np) shows a sig- nicant improvement over CTRL(aln) and CTRL(loc). The non-parameterized settings (CTRL(reg-np)) work consistently better than the parameterized settings (CTRL(reg- p)). Figure 7.8 shows some prediction and regression results. We also test complex sentence query on Charades-STA. As shown in Table 7.3, \CTRL" means that we simply input the whole complex sentence into CTRL model. \CTRL+fusion" means that we input each sentence of a complex query separately into CTRL, and then do a late fusion. Specically, we compute the average alignment score over all sentences, take the minimum of all start times and maximum of all end times as start and end time Table 7.3: Experiments of complex sentence query. Method R@1 IoU=0.5 R@1 IoU=0.7 R@5 IoU=0.5 R@5 IoU=0.7 Random 11.83 3.21 43.28 18.17 CTRL 24.09 8.03 69.89 32.28 CTRL+Fusion 25.82 8.32 69.94 32.81 66 ground truth alignment prediction regression refinement 16.0 s 21.9 s 16.7 s 20.9 s 16.0 s 22.3 s Query: He gets a cutting board and knife. Query: The person sets up two separate glasses on the counter. 18.0s 25.3s 21.4s 26.0s 19.2s 25.7s ground truth alignment prediction regression refinement Figure 7.7: Alignment prediction and regression renement examples in TACoS. The row with gray background shows the ground truth for the given query; the row with blue background shows the sliding window alignment results; the row with green background shows the clip regression results. ground truth alignment prediction regression refinement Query:Apersonrunstothewindow andthenlookout ComplexQuery:Apersoniswalkingaroundtheroom.Sheis eatingsomething ground truth 1.0 s 14.5 s alignment prediction 5.1 s 9.4 s regression refinement 2.1 s 12.9 s 9.3 s 14.4s 8.5s 12.8 s 10.2 s 14.2 s regression refinement+ fusion 1.7s 14.9 s Figure 7.8: Alignment prediction and regression renement examples in Charades-STA. 67 of the complex query. Although the random performance in Table 7.3 (complex) is higher than that in Table 8.5 (single), the gain over random performance remains similar, which indicates that CTRL is able to handle complex query consisting multiple activities well. Comparing CTRL and CTRL+Fusion, we can see that CTRL could be an eective rst step for complex query, if combined with other fusion methods. In general, we observed two types of common hard cases: (1) long query sentences increase chances of failure, likely because the sentence embeddings are not discriminative enough; (2) videos that contain similar activities but with dierent objects (e.g.in TACOS dataset, put a cucumber on chopping board, and put a knife on chopping board) are hard to distinguish amongst each other. 7.4 Conclusion We addressed the problem of Temporal Activity Localization via Language (TALL) and proposed a novel Cross-modal Temporal Regression Localizer (CTRL) model, which uses temporal regression for activity location renement. We showed that non-parameterized osets works better than parameterized osets for temporal boundary regression. Exper- imental results show the eectiveness of our method on TACoS and Charades-STA. In the future, we will work on complex sentence queries for compositive activities. 68 Chapter 8 Motion-Appearance Co-Memory Networks for Video Question Answering 8.1 Introduction Understanding video temporal structure is an important topic in computer vision. To achieve this goal, various tasks have been proposed, such as temporal action localization [77, 21], action anticipation [22] and video prediction [87]. Besides these tasks, video Question Answering (QA) [34, 84] is another challenging task, which not only requires the understanding of video temporal structure, but also joint reasoning of videos and texts. In this paper, we tackle the problem of video QA. Image and text question answering have achieved much progress recently. The success comes in part from the application of attention mechanisms [100, 51] and memory mecha- nisms [44] in deep neural networks. Attention mechanisms tell the neural network \where to look", while the memory mechanism renes answers in multiple reasoning cycles. Video QA is dierent from image QA [54, 51] in two aspects: (1) the questions are more about temporal reasoning of the videos, e.g.motion transition and action counting, than spatial attributes, such as colors, spatial locations, which require eective temporal representa- tion modeling; (2) the input source is a sequence of images, rather than a single image, Question: What does the woman do after look uncertain? Answer: smile App. Att. Motion Att. Motion Att. App. Att. Attention Time 2nd Pass 1st Pass Figure 8.1: Answering questions in videos involves both motion and appearance analysis, and usually requires multiple cycles of reasoning, especially for transitive questions, e.g.\ What does the woman do after look uncertain?", we need to rst localize when the woman looks uncertain, which requires motion evidence for looking uncertain and appearance evidence for the woman; and then focus on what the woman does (smile). 69 which contains richer information not only in quantity but also in variety (appearance, motion, transition) to \remember", and it makes the reasoning process more complicated. Dynamic Memory Networks (DMN) [44, 94] were originally proposed for text and image question answering. It contained a memory module to encode the input sources multiple cycles and an attention mechanism allowing the reading process to focus on dierent contents in each cycle. Although DMN contains an input module and a memory module which are able to read and remember a long sequence information, which is applicable for videos, directly applying such a method to video QA task would not give satisfying results. Because it lacks motion analysis, especially joint analysis between motion and appearance in videos, and temporal modeling. To strengthen the memory mechanism, Na etal.[61] proposed a read-write memory network that jointly encode the movie appearance and caption content, however it lacks motion analysis and dynamic memory update. Xu et al.[97] exploited the appearance and motion via gradually rened attention, where the motion and appearance features are fused together. We observe two unique attributes of answering questions in videos. The rst is that the motion and appearance information are usually correlated with each other in the reasoning process. For example, in answering the question \what does the woman do after look uncertain?" as shown in Figure 8.1, we need to rst localize \the woman look uncertain" action, which requires motion evidence for looking uncertain and appearance evidence for the woman; after that, we need to ignore the man's interval, and then focus on what the woman does (smile). Appearance and motion information are both involved in the reasoning process and provide attention cues to each other. The second attribute is that dierent types of questions may require representations from dierent amounts of frames, for example, \what is the color of the bulldog?" needs only a single frame to produce the answer, while \How many times does the cat lick" needs the understanding of the whole video. Based on these observations, we propose a motion-appearance co-memory network for video QA. Our model is built on concepts of DMN/DMN+ [44, 94], so we share the same terms with DMN [44], such as facts, memory and attention. Specically, a video is converted to a sequence of motion and appearance features by the two-stream models [95]. The motion and appearance features are then fed into a temporal convolutional and deconvolutional neural network to build multi-level contextual facts, which have the same temporal resolution but represent dierent contextual information. These contex- tual facts are used as input facts to the memory networks. The co-memory networks hold two separate memory states, one for motion and one for appearance. To jointly model and interact with the motion and appearance information, we design a co-memory attention mechanism that takes motion cues for appearance attention generation, and appearance cues for motion attention generation. Based on these attentions, we design dynamic fact ensemble method to produce temporal facts dynamically at each cycle of fact encod- ing. We evaluate our model on TGIF-QA dataset [34], and outperform state-of-the-art performance signicantly on all four tasks in TGIF-QA. The novelty of our method is three-fold compared with DMN/DMN+ [44, 94]: (1) We design a co-memory attention mechanism to jointly model motion and appearance information. 70 Question GRU Answer Decoder # Memory Update … A-GRU A-GRU A-GRU 2nd Pass … A-GRU A-GRU A-GRU 1st Pass Memory Update $ $ & $ … $ & ) * + Facts Figure 8.2: General Dynamic Memory Network (DMN) [44] architecture. The memory update process for the t-th cycle is : (1) the facts F are encoded by an attention-based GRU in episodic memory module, where the attention is generated by last memorym t1 ; (2) the nal hidden state of the GRU is called contextual vectorc t , which is used to update the memory m t together with question embedding q. The question answer is generated from the nal memory state m T . (2) We use temporal conv-deconv networks to build multi-level contextual facts for video QA. (3) We introduce a method called dynamic fact ensemble to dynamically produce temporal facts in each cycle of fact encoding. In the following, we rst introduce related work, and then outline the DMN/DMN+ framework. In Section 8.3, we present our motion-appearance co-memory network in detail, and in Section 8.4, we show the evaluation of our method on TGIF-QA. 8.2 General Dynamic Memory Networks As our work is closely related to DMN [44, 94], we begin with introducing the general framework of DMN. It contains four distinct modules: an input module, a question module, an episodic memory module and an answer module, as shown in Figure 8.2. Fact module. The fact module converts the input data (e.g.text, image, video) into a set of vectors called facts, which is denoted as F = [f 1 ;f 2 ;:::;f L ], where L is the total number of facts. For text-based QA, [44] used a Gated Recurrent Unit (GRU) to encode all text information; for image-based QA, [94] adopted a bi-directional GRU to encode the local region visual features to globally-aware facts. Question module. The question module converts the question into an embedding q. Specically, [44, 94] used a GRU to encode the question sentence and use the nal hidden state of the GRU as the question embedding. Episodic memory module. Episodic memory is designed to retrieve the relevant in- formation from the facts. To extract information related to the questions from the facts more eectively, especially when transitive reasoning is required in questions, the episodic memory module iterates over the input facts for multiple cycles, and updates the memory after each cycle. There are two important mechanisms in the episodic memory module: an attention mechanism and a memory update mechanism. 71 de-convolution convolution " # " $ " % "/$ % "/$ $ "/' % Figure 8.3: The input temporal representations are processed by temporal conv-deconv layers to build multi-layer contextual facts, which have the same temporal resolution but dierent contextual information. Suppose that the updated memory after t-th cycle ism t , the facts setF = [f 1 ;f 2 ;:::;f L ], the question embedding is q, then the attention gate g t i is given by g t i =F a (f i ;m t1 ;q) (8.1) where F a is an attention function which takes the fact vector f i at step i, memory m t1 at cyclet1 and the questionq as inputs, and outputs a scalar valueg t i , which represents the attention value for the fact f i in cycle t. To eectively use the ordering and positional information in videos, an attention based GRU is designed. Instead of using the original update gate in the GRU, the attention gate g t i is used, the update equation for the modied GRU is h i =g t i ~ h i + (1g t i )h i1 (8.2) The nal hidden state of the attention based GRU is used as the contextual feature c t for updating the episodic memory m t . Together with the question embedding q and the memory for cycle t 1, the t-th cycle memory is updated by m t =F m (m t1 ;c t ;q) (8.3) where F m is a memory update function. The nal memory m T is passed to the answer module to generate the nal answers, where T is the number of memory update cycle. Answer module. The answer module takes both q and m T to generate the models predicted answer. Dierent answer decoders may be applied for dierent tasks, e.g.a softmax output layer for single word answer. 8.3 Motion-Appearance Co-Memory Networks In this section, we present our motion-appearance co-memory networks, including multi- level contextual facts, co-memory module and answer module. The question module remains the same as the one in traditional DMN. 72 A-GRU Mem Answer Co-Memory Attention Dynamic Fact Ensemble A-GRU Mem Co-Memory Attention Dynamic Fact Ensemble Ques Ques Motion Facts " Appearance Facts " " $/& " $/& & ) * + * ) - + - ) - + - - - motion appearance Figure 8.4: Co-memory attention module extracts useful cues from both appearance and motion memories to generate attention ga t /gb t for motion and appearance separately. Dynamic fact ensemble takes the multi-layer contextual facts A L /B L and the attention scores ga t /gb t to construct proper facts A s=h L /B s=h L , which are encoded by an attention- based GRU. The nal hidden state c t b /c t a of the GRU is used to update the memory m t b /m t a . The nal output memorym h is the concatenation of the motion and appearance memory, and it is used to generate answers. 8.3.1 Multi-level Contextual Facts The videos are cut into small units [23] (a sequence of frames). For each video unit, we use two-stream CNN models [95] to extract unit-level motion and appearance features. More feature pre-processing details are given in Section 8.4. The sequence of unit-level appearance features and motion features is represented asfa i g andfb i g respectively. To build multiple levels of temporal representations where each level represent dierent contextual information, we use the temporal convolutional layers to model the tempo- ral contextual information and de-convolutional layers to recover temporal resolution, as shown in Figure 8.3. Specically, the lowest level feature sequence is built directly from the unit features, A 1 L =fa i g, B 1 L =fb i g. The convolutional layers compute a feature hierarchy consisting of temporal feature sequences at several scales with a scaling step of 2, F 1 L , F 2 L=2 , F 3 L=4 , ..., as shown in Figure 8.3. Note that F could be A (for appear- ance features) or B (for motion features). The de-convolutional pathway hypothesizes higher resolution features F 2 L , F 3 L by upsampling temporally coarser, but semantically stronger, feature sequences. Thus, F 1 L , F 2 L and F 3 L have the same resolution but dif- ferent temporal contextual coverage. Note that we only show 3 levels in Figure 8.3, more levels could be modeled by adding more convolutional and de-convolutional layers. F L =fF 1 L ;F 2 L ;:::;F N L g is termed as contextual facts. 73 8.3.2 Motion-appearance Co-Memory Module In this part, we introduce the co-memory attention mechanism and the dynamic fact ensemble method. Co-memory attention. The questions in video QA usually involve both appearance and motion. Appearance usually provides useful cues for motion attention, i.e.guides the focus on motion content, and vice versa. To allow interaction between appearance and motion, we design a co-memory attention mechanism. Specically, two separate memory modules are used to hold motion memorym t b and appearance memorym t a , wheret is the number of cycle for memory update. As indicated before, when the networks read motion facts to update motion memory, appearance memory provides useful cues to generate attentions; motion memory is also helpful for updating appearance attention. Therefore, m t1 b and m t1 a are both used to generate attentions for motion and appearance fact encoding in the t-th cycle. As we build multiple levels of facts, we generate an attention score for each fact vector at each level. The motion attention gate for fact b i j isgb t i;j and the appearance attention for fact a i j is ga t i;j , where t means the number of cycle, i is the level of fact representation and j is the step of the facts. za t i;j = tanh W 2 a a j i + W 1 a [m t1 a ;q] ga t i;j = W 4 a za t i;j + W 3 a [m t1 b ;q] (8.4) zb t i;j = tanh W 2 b b j i + W 1 b [m t1 b ;q] gb t i;j = W 4 b zb t i;j + W 3 b [m t1 a ;q] (8.5) where W 1 a , W 2 a , W 3 a , W 4 a , W 1 b , W 2 b , W 3 b and W 4 b are weight parameters. ga t i;j andgb t i;j are attentions used in dynamic fact ensemble and memory update. Dynamic fact ensemble. As shown in Section 8.3.1, we build a multi-layer contextual facts set F L =fF 1 L ;F 2 L ;:::;F N L g for motion and appearance separately, which have the same temporal resolution, but represent dierent contextual information. There are two reasons that the facts should be selected dynamically: (1) Dierent types of questions may require dierent level of representations, e.g.the \bulldog color" and the \cat lick" questions given in Section 8.1; (2) During the multiple cycles of the fact reading, each cycle may focus on dierent level of information. We designed an attention-based fact ensemble methods shown in Figure 8.5. For simplicity, we use g t i;j to represent the attention gate, which is actually ga t i;j for appearance and gb t i;j for motion. We calculate Softmax over g t i;j along level axis (i.e.i) to get attention scores s t i;j . The ensemble facts can be represented as F s t :ff t j = N X i=0 s t i;j f i j g L j=1 (8.6) 74 " # " $ " % . . . # ' $ ' % ' " ( add Figure 8.5: Multi-layer contextual facts are dynamically constructed via a soft attention fusion process, which computes a weighted average facts according to the attention. wheref i;j is the fact vector of leveli and stepj in the contextual facts F L . The attention scores used in the later fact encoding process are given by s t j = softmax( 1 N N X i=0 g t i;j );j = 1; 2;:::;L (8.7) where the Softmax is computed along j axis. Memory update. The fact encoding processes are conducted separately for motion and appearance, which adopts an attention based GRU [94] to generate contextual vectors c t a and c t b for appearance and motion in the t-th cycle. Motion memory m t b and appearance memory m t a are updated separately as follows. m t a = FC([m t1 a ;q;c t a ]) (8.8) m t b = FC([m t1 b ;q;c t b ]) (8.9) where FC means fully-connected layer, ReLU is used as the non-linear activation. The nal output memory m h is the concatenation of m T a and m T b , where T is the number of cycles. 8.3.3 Answer Module Following [34], we model the four tasks in TGIF-QA [34] into three dierent types: multiple-choice, open-ended numbers and open-ended words. For multiple-choice, we use a linear regression function that takes the memory state m h and outputs a real-valued score for each answer candidate. s = W T m m h (8.10) where W m are weight parameters. The model is optimized by hinge loss between the scores for correct answerss p and the scores for incorrect answers s n , max(0; 1 +s n s p ). This decoder is used to solve repeating action and state transition tasks. For open-ended numbers, we also use a linear regression function which takes the memory state m h and outputs an integer-valued answer. s = [W T n m h +b] (8.11) 75 Table 8.1: Number of samples of dierent tasks in TGIF-QA dataset. # QA pairs Action Trans Count Frame Training 20,475 52,704 26,843 39,392 Testing 2,274 6,232 3,554 13,691 Total 22,749 58,936 30,397 53,083 where [:] means rounding. We adopt ` 2 loss between the groundtruth value and the predicted value to train the model, which is used to solve the repetition count task. For open-ended words, we treat this as a classication problem. A linear function that takes the nal memory state m h followed by a softmax layer is adopted to generate answers. o = softmax W > w m h + b (8.12) where W w are weight parameters and b is bias. Cross-entropy loss is used to train the model and this type of decoder is used in Frame QA task. For each task, we train a separate model by the answer decoder and loss mentioned above. The model of each task is trained and evaluated individually. 8.4 Evaluation In this section, we describe the dataset and evaluation settings, and discuss the experiment results. 8.4.1 Dataset We evaluate the proposed model on TGIF-QA dataset [34], which is a large-scale dataset introduced by Jang et al.for Video QA. The dataset consists of 165k QA pairs collocted from 71k animated Tumblr GIFs. There are four types of tasks: repetition count, repe- tition action, state transition and frame QA. First three tasks are unique to videos and require temporal reasoning to answer them. Tasks. Repetition count is an open-ended task to count the number of repetition of an action (e.g.\How many times does the cat lick?"). There are 11 possible answers (i.e.from 0 to 10+) in total. Repetition action is a 5-option multiple choice task, which is asking about the name of the action that happened specic times (e.g.\what does the duck do 3 times?"). State transition is also a 5-option multiple choice task which can be answered by understanding the transition of two states in a video (e.g.\What does the woman do after drink water?"). Besides, TGIF-QA also provides a traditional frame QA task (i.e.image QA). The image QA questions of previous datasets [3, 70, 53] can be answered by getting eective information from a single given image; but for frame QA in TGIF-QA dataset, the model needs to nd the most relevant frame among all frames in the video to answer the question correctly. Frame QA is dened as an open-ended task. The number of QA pairs of TGIF-QA for the four tasks are shown in table 8.1. 76 Metric. For the task of repetition count, the Mean Square Error (MSE) between the predicted count value and the groundtruth count value is used for evaluation. For repe- tition action, state transition and frame QA, classication accuracy (ACC) is used as the evaluation metric. 8.4.2 Implementation Details Appearance and motion features. Since the frames per second (FPS) of the GIFs in TGIF-QA [34] vary, we extract frames from all GIFs with the FPS that is specied by the corresponding GIF le. The long videos are cut into small units, each unit contains 6 frames. To extract unit-level video features, we use ResNet-152 [26] to process the central frame of a unit, and the outputs of \pool5" layer (2R 2;048 ) of ResNet-152 is used as our appearance features. To utilize motion information, we extract optical ow inside a video unit, and use the ow CNN from two-stream model [95] to get unit-level ow features. Specically, the two-direction dense optical ows [14] which are calculated between two adjacent frames in a six-consecutive-frame unit are fed into the pre-trained ow CNN model, which is a BN- Inception network[33]. Then we take the feature map of the \global pool" layer (2R 1;024 ) as the raw optical ow features. Finally, we down-sample the feature dimension by average pooling and get a 2048-dimension vector as our two-direction optical ow feature. In this process, we pad the rst or last frame if we didn't have enough frames centered at each step. We set the temporal resolution of video features to be 34, long feature sequences are cut and short one are padded. Contextual facts. The output channel number of each layer in the conv-deconv networks is 1024, temporal conv lter size is 3 with stride 1, deconv layer with stride 2, max pool lter size is 2 with stride 2. We build N = 3 layers of contextual facts. Co-memory module. The size of memory state m a and m b is set to be 1024. The hidden state size of the GRU for fact encoding is 512. za t i;j andzb t i;j in equation (4) and (5) are 512-dimensional. Question and answer embedding. For each word in the question, we use a pre- trained word embedding model [65] to convert it to a 300-dimension vector. All words in the question are processed by a two-layer GRU, whose hidden state size is 512. The nal hidden state is used as question embedding. For action transition and repeating action, the candidate answers are a sequence of words, thus we use the same method as the one for encoding questions to encode the answer. Training details. We set the batch size to 64. Adam optimizer [41] is used to optimize the model, the learning rate is set to 0.001. For each task, we train the model for 50 epochs. 8.4.3 System Baselines Besides co-memory networks, there are two direct methods to make use of motion and appearance information: fact concatenation and memory concatenation, which are used as system baselines. Fact concatenation. This baseline method simply concatenate the input motion facts and appearance facts, fb i g andfa i g along the feature dimension. The concatenated 77 Table 8.2: Evaluation of co-memory attention mechanism on TGIF-QA. \Action" is repetition action (ACC %), \Trans" is state transition (ACC %), \Count" is repetition count (MSE) and \Frame" is frame QA (ACC %). Method Action Trans Count Frame Fact-concat 65.0 71.2 4.34 49.9 Mem-concat 64.5 70.7 4.39 50.2 Co-memory 66.8 73.2 4.21 51.0 vectorfh i g which is d b +d a dimensional is used as input facts for multi-level contextual fact module. Only one memory module is used. Memory concatenation. In this baseline method, instead of concatenating the input facts, we use two separate memory modules: one for appearance, the other for motion, and concatenate the nal motion memory states m T b and the nal appearance memory states m T a to m t f together, which are used to decode answers. Co-memory attention mechanism is not used in this baseline. 8.4.4 Experiments on TGIF-QA We rst evaluate the co-memory attention module by comparing it with the two baseline method \fact concatenation" and \memory concatenation". Second, we evaluate the multi-level contextual facts and the dynamic fact ensemble. Finally, we compare our method with the previous state-of-the-art methods. Co-memory attention. In this experiment, we set the layer of contextual facts to be 1, and dynamic fact ensemble is not used. The number of memory updates T = 2. We compare co-memory attention mechanism with \fact concatenation" (fact-concat) and \memory concatenation" (mem-concat) to see the eectiveness of co-memory attention , the results are shown in Table 8.2. We can see that co-memory attention outperforms fact- concat and mem-concat in all four tasks, which shows the eectiveness of the co-memory attention mechanism. We believe the reason is that co-memory attention exploits the knowledge that motion and appearance provide useful cues to each other in attention generation. Contextual facts and dynamic fact ensemble. Dynamic fact ensemble collaborates with multi-level contextual facts to construct proper temporal fact representation, so we test them together. We build 3 layers of contextual facts and do experiments to test dynamic fact ensemble module. We use \fact concatenation" as the top memory network. The results are shown in Table 8.3: \w/o ensemble" means that we don't build the multi- level contextual facts, but just use a single temporal conv layer (lter size is 1) to convert appearance and motion features into 1024-dimension vectors, which are used as input facts. It can be seen that the ensemble provides better results. We believe the reason is that the attention-based fact fusion optimizes the ensemble process by using weighted average of the contextual facts, and avoids just using only one of them, which may make the facts sub-optimal. 78 Table 8.3: Evaluation of dynamic fact ensemble on TGIF-QA. \Action" is repetition action (ACC %), \Trans" is state transition (ACC %), \Count" is repetition count (MSE) and \Frame" is frame QA (ACC %). Method Action Trans Count Frame w/o ensemble 65.0 71.2 4.34 49.9 w/ ensemble 66.3 72.5 4.30 50.4 Table 8.4: Comparison on cycles of memory update on TGIF-QA. \Action" is repetition action (ACC %), \Trans" is state transition (ACC %), \Count" is repetition count (MSE) and \Frame" is frame QA (ACC %). Method Action Trans Count Frame T = 1 65.1 69.9 4.35 50.5 T = 2 66.8 73.2 4.21 51.0 T = 3 66.5 73.1 4.24 51.1 How many cycles of memory update are sucient? We test the co-memory attention model with dierent memory update times T = 1; 2; 3 to see how many cycles of memory update are sucient for video QA task. The dynamic fact ensemble is not used in this experiment. The results are shown in Table 8.4. We can see that two cycles (T = 2) of memory update gives the best performance on the task of \Action", \Trans" and \Count". For \Frame", T = 2 and T = 3 have similar results. Comparing the results of T = 2 and T = 1 in \Trans", we can see that T = 2 improves the performance by 3.3%, we believe the reason is that multiple cycles of fact reading and memory update allow the model to focus on dierent parts of the video in each cycle. The performance begins to saturate at T = 3. Comparison with state-of-the-art method. There are two version of TGIF-QA, we report the performance of the second version, which is released by the authors of [34] on Arxiv. The rst version is originally reported in the CVPR version of [34]. State- of-the-art method [34] on TGIF-QA adopted a dual-LSTM based approach with both spatial and temporal attention. Originally, their model is trained on C3D [85] temporal feature and ResNet-152 [26] frame feature. However, our method adopts Flow CNN model (Inception) for motion and ResNet-152 for appearance. Thus, for fair comparison, we train their model (https://goo.gl/SVKTP9) with our features on all four tasks in TGIF- QA. The results are shown in Table 8.5. In Table 8.5, \SP" means spatial attention, \TP" means temporal attention, \(R+C)" means ResNet-152 features and C3D features, \(R+F)" means ResNet-152 features and Flow CNN features (our feature). We also list methods \VIS-LSTM" [70] and \VQA-MCB" [15], which are provided in [34]. 1 We found an evaluation mistake in [34] (https://goo.gl/SVKTP9) on count task. The new perfor- mances updated by the authors are listed here. 79 Q: What does the man do before look surprise? Co-memory: blink eye ST-TP: pet Q: What does the man do after open a door? Co-memory: grab an object ST-TP: say something while pop cap off of a pen Q: What does the person do 2 times? Co-memory: chew meat ST-TP: rub finger across face Q: What does the woman do 2 times? Co-memory: fold both hands ST-TP: bob head Q: What is the color of the hat? Co-memory: white ST-TP: blue Q: What is the man performing a trick falls and crashes? Co-memory: motorcycle ST-TP: bicycle Q: How many times does the man dip his body? Co-memory: 2 ST-TP: 4 Q: How many times does the woman turn eyes? Co-memory: 3 ST-TP: 2 Figure 8.6: Examples on state transition, repetition action, repetition count and frame QA are shown in 1st, 2nd, 3rd and 4th row. ST-TP is the temporal attention model from [34]. Green is for correct prediction and red is for wrong prediction. 80 Table 8.5: Comparison with the state-of-the-art method on TGIF-QA dataset. \Action" is repetition action (ACC %), \Trans" is state transition (ACC %), \Count" is repetition count (MSE) and \Frame" is frame QA (ACC %). Model Action Trans Frame Count 1 VIS+LSTM(aggr) [70] 46.8 56.9 34.6 5.09 VIS+LSTM(avg) [70] 48.8 34.8 35.0 4.80 VQA-MCB(aggr) [15] 58.9 24.3 25.7 5.17 VQA-MCB(avg) [15] 29.1 33.0 15.5 5.54 Yu et al.[106] 56.1 64.0 39.6 5.13 ST(R+C) [34] 60.1 65.7 48.2 4.38 ST-SP(R+C) [34] 57.3 63.7 45.5 4.28 ST-SP-TP(R+C) [34] 57.0 59.6 47.8 4.56 ST-TP(R+C) [34] 60.8 67.1 49.3 4.40 ST-TP(R+F) 62.9 69.4 49.5 4.32 Co-memory (w/o DFE) 66.8 73.2 51.0 4.21 Co-memory (full) 68.2 74.3 51.5 4.10 There are two co-memory variants shown in Table 8.5: \co-memory (w/o DFE)" uses co-memory attention with T = 2 memory update, but not dynamic fact ensemble; \co- memory (full)" uses co-memory attention with T = 2 memory update and dynamic fact ensemble (soft fusion) on 3-layer contextual facts. We can see that our method outperforms the state-of-the-art method signicantly on all four tasks. Some visualization examples are shown in Figure 8.6. 8.5 Conclusion Comparing with image QA, video QA deals with long sequences of images, which contains richer information in both quantity and variety. In addition, motion and appearance in- formation are both important for video analysis, and usually correlated with each other and able to provide useful attention cues to the other. Motivated by these observations, we propose a motion-appearance co-memory network for video QA. Specically, we design a co-memory attention mechanism that utilizes cues from both motion and appearance to generate attention, a temporal conv-deconv network to generate multi-level contex- tual facts, and a dynamic fact ensemble method to construct temporal representation dynamically for dierent questions. We evaluate our method on TGIF-QA dataset, and outperforms state-of-the-art performance signicantly. 81 Chapter 9 Conclusion and Future Work In the previous chapters, I described my eorts towards solving temporal perception and reasoning problems, from temporal action detection to video question answering. The research scope is outlined in Figure 9.1. We discussed the two main aspects of action detection: online and oine. For online part, we extend it to action anticipation and propose encoder-decoder based model frame-wise prediction; for oine part, we developed the rst complete two-stage action detector based on temporal regression. Then we combine the advances from both aspects with a complementary ltering idea. To address the shortcomings of xed action detection, we propose the \localization by language" task, which is also part of temporal reasoning. Finally, we tackle the core problem in temporal reasoning, video question answering, by a dynamic memory network. Extensive and comprehensive evaluation is done on a range of publicly available video or image datasets for all proposed algorithms, and the state-of-the-art performances show the eectiveness of our methods. However there still some fundamental questions left. First, actions are now dened as intervals between some start and end times. However, such temporal boundaries are not clear as object spatial boundaries. For example, the start time of \slam dunk" should be the time that the player jumps or the time that the player puts to ball into the basket? In many cases, there are no clear temporal boundaries for actions or activities. Maybe it's better to represent action instances by a \central point", which means a central and highlight position inside the action instance. I think this concept of \central point" benets both annotation quality and applications. Raters would not suer from the confusing start/end times, and could improve the labelling speed and quality by focusing the central part of the actions. For many applications, such as video search in Youtube, most users really don't care about the exact start and end time of some actions, what they want are some suggestions of highlight video clips in potential, and such suggestions can be also represented by the \central point". Second, we still don't have an eective, robust and simple enough model for video classi- cation and video embedding extraction. High level video understanding tasks considerably relies on the progress in more fundamental tasks, such as video classication, just like the advances achieved in image classication boost object detection and segmentation depend. However, we still don't have clear conclusions on what really works and what doesn't in fundamental video classication models, at least not like the ones in image do- main. Although video classication is beyond the scope of this dissertation, I would like 82 Detection by Language: TALL Anticipation: RED Temporal Perception Temporal Reasoning Video Question Answering: Co-memory Frame-based Detection Video+Text → Video Video+Text → Text Detection: CTAP Proposal +Frame Detection Detection: TURN+CBR Proposal-based Detection Figure 9.1: There are two parts in this dissertation: temporal perception and temporal reasoning. to point out that, one of the biggest restricts in developing temporal perception models is the basic video embedding extractor from video classication tasks. 83 Reference List [1] Mart n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jerey Dean, Matthieu Devin, et al. Tensor ow: Large-scale machine learning on heterogeneous distributed systems. 2015. [2] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In CVPR, 2016. [3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015. [4] Piotr Bojanowski, R emi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. Weakly-supervised alignment of video with text. In ICCV, 2015. [5] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Car- los Niebles. Sst: Single-stream temporal action proposals. In CVPR, 2017. [6] Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Fast temporal activity proposals for ecient detection of human actions in untrimmed videos. In CVPR, 2016. [7] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015. [8] Kan Chen, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. In CVPR, 2018. [9] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Neva- tia. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015. [10] Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, and Yan Qiu Chen. Temporal context network for activity localization in videos. In ICCV, 2017. [11] Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees Snoek, and Tinne Tuytelaars. Online action detection. In ECCV, 2016. 84 [12] Jerey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Sub- hashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent con- volutional networks for visual recognition and description. In CVPR, 2015. [13] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Daps: Deep action proposals for action understanding. In ECCV, 2016. [14] Gunnar Farneb ack. Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis. Springer, 2003. [15] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question an- swering and visual grounding. In EMNLP, 2016. [16] Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. Devnet: A deep event network for multimedia event detection and evidence re- counting. In CVPR, 2015. [17] Jiyang Gao, Kan Chen, and Ram Nevatia. Ctap: Complementary temporal action proposal generation. In The European Conference on Computer Vision (ECCV), September 2018. [18] Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. Motion-appearance co- memory networks for video question answering. In CVPR, 2018. [19] Jiyang Gao, Chen Sun, and Ram Nevatia. Acd: Action concept discovery from image-sentence corpora. In ICMR. ACM, 2016. [20] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In ICCV, 2017. [21] Jiyang Gao, Zhenheng Yang, and Ram Nevatia. Cascaded boundary regression for temporal action detection. In BMVC, 2017. [22] Jiyang Gao, Zhenheng Yang, and Ram Nevatia. Red: Reinforced encoder-decoder networks for action anticipation. In BMVC, 2017. [23] Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. Turn tap: Temporal unit regression network for temporal action proposals. In ICCV, 2017. [24] Ross Girshick. Fast r-cnn. In ICCV, 2015. [25] Ross Girshick, Je Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [27] Minh Hoai and Fernando De la Torre. Max-margin early event detectors. In CVPR, 2012. 85 [28] Minh Hoai and Fernando De la Torre. Max-margin early event detectors. International Journal of Computer Vision, 107(2):191{202, 2014. [29] Jan Hosang, Rodrigo Benenson, Piotr Doll ar, and Bernt Schiele. What makes for eective detection proposals? IEEE transactions on pattern analysis and machine intelligence, 38(4):814{830, 2016. [30] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. In ICCV, 2017. [31] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Natural language object retrieval. In CVPR, 2016. [32] Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485, 2016. [33] Sergey Ioe and Christian Szegedy. Batch normalization: Accelerating deep net- work training by reducing internal covariate shift. 2015. [34] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017. [35] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Suk- thankar. THUMOS challenge: Action recognition with a large number of classes. In CVPR Workshop, 2015. [36] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015. [37] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk- thankar, and Li Fei-Fei. Large-scale video classication with convolutional neural networks. In CVPR, 2014. [38] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L Berg. Refer- itgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014. [39] Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung- Woo Ha, and Byoung-Tak Zhang. Multimodal residual learning for visual qa. In NIPS, 2016. [40] Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. Deepstory: video story qa by deep embedded memory networks. In IJCAI, 2017. [41] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [42] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In NIPS, 2015. 86 [43] Hema Swetha Koppula and Ashutosh Saxena. Anticipating human activities for re- active robotic response. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 2071{2071. IEEE, 2013. [44] Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In ICML, 2016. [45] Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchical representation for future action prediction. In ECCV, 2014. [46] Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107{123, 2005. [47] Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. Visual semantic search: Retrieving videos via complex textual queries. In CVPR, 2014. [48] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In MM. ACM, 2017. [49] Tianwei Lin, Xu Zhao, and Zheng Shou. Temporal convolution based action pro- posal: Submission to activitynet 2017. CVPR Workshop, 2017. [50] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. Ssd: Single shot multibox detector. In ECCV, 2016. [51] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question- image co-attention for visual question answering. In NIPS, 2016. [52] Shugao Ma, Leonid Sigal, and Stan Sclaro. Learning activity progression in lstms for activity detection and early detection. In CVPR, 2016. [53] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answer- ing about real-world scenes based on uncertain input. In NIPS, 2014. [54] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. In ICCV, 2015. [55] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In ACL, 2014. [56] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descrip- tions. In CVPR, 2016. [57] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015. 87 [58] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Dis- tributed representations of words and phrases and their compositionality. In NIPS, 2013. [59] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204{2212, 2014. [60] Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han. Marioqa: Answering questions by watching gameplay videos. In ICCV, 2017. [61] Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim. A read-write memory network for movie story understanding. In ICCV, 2017. [62] Dan Oneata, Jakob Verbeek, and Cordelia Schmid. The lear submission at thumos 2014. In ECCV Workshop, 2014. [63] Alonso Patron-Perez, Marcin Marszalek, Andrew Zisserman, and Ian D Reid. High ve: Recognising human interactions in tv shows. In BMVC, 2016. [64] Mingtao Pei, Yunde Jia, and Song-Chun Zhu. Parsing video events with goal inference and intent prediction. In ICCV, 2011. [65] Jerey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014. [66] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hock- enmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015. [67] Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Se- quence level training with recurrent neural networks. In ICLR, 2016. [68] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unied, real-time object detection. In CVPR, 2016. [69] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25{36, 2013. [70] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In NIPS, 2015. [71] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015. [72] Jos BTM Roerdink and Arnold Meijster. The watershed transform: Denitions, algorithms and parallelization strategies. Fundamenta informaticae, 2000. 88 [73] Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Man- fred Pinkal, and Bernt Schiele. Script data for attribute-based recognition of com- posite activities. In ECCV, 2012. [74] Pierre Sermanet, David Eigen, Xiang Zhang, Micha el Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014. [75] Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual question answering. In CVPR, 2016. [76] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In CVPR, 2017. [77] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, 2016. [78] Gunnar A. Sigurdsson, G ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016. [79] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. [80] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [81] Bharat Singh, Tim K. Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi- stream bi-directional recurrent neural network for ne-grained action detection. In CVPR, 2016. [82] Chen Sun, Chuang Gan, and Ram Nevatia. Automatic concept discovery from parallel text and visual corpora. In ICCV, 2015. [83] Chen Sun, Sanketh Shetty, Rahul Sukthankar, and Ram Nevatia. Temporal local- ization of ne-grained actions in videos by domain transfer from web images. In ACM MM, 2015. [84] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016. [85] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. [86] Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeul- ders. Selective search for object recognition. International journal of computer vision, 104(2):154{171, 2013. 89 [87] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. De- composing motion and content for natural video sequence prediction. In ICLR, 2017. [88] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual rep- resentations from unlabeled video. In CVPR, pages 98{106, 2016. [89] Heng Wang, Alexander Kl aser, Cordelia Schmid, and Cheng-Lin Liu. Action recog- nition by dense trajectories. In CVPR, 2011. [90] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015. [91] Limin Wang, Yu Qiao, Xiaoou Tang, and Luc Van Gool. Actionness estimation using hybrid fully convolutional networks. In CVPR, 2016. [92] Ronald J Williams. Simple statistical gradient-following algorithms for connection- ist reinforcement learning. Machine learning, 8(3-4):229{256, 1992. [93] Dan Xie, Sinisa Todorovic, and Song-Chun Zhu. Inferring" dark matter" and" dark energy" from videos. In ICCV, 2013. [94] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual question answering. In ICML, 2016. [95] Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song, Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, and Xiaoou Tang. Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797, 2016. [96] Yuanjun Xiong, Yue Zhao, Limin Wang, Dahua Lin, and Xiaoou Tang. A pursuit of temporal accuracy in general activity detection. CoRR, 2017. [97] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually rened attention over appearance and motion. In ACM Multimedia, 2017. [98] Huijuan Xu, Abir Das, and Kate Saenko. R-C3D: Region convolutional 3d network for temporal activity detection. In ICCV, 2017. [99] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV, 2016. [100] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, 2016. [101] Ting Yao, Yehao Li, Zhaofan Qiu, Fuchen Long, Yingwei Pan, Dong Li, and Tao Mei. MSR asia msm at activitynet challenge 2017: Trimmed action recog- nition, temporal action proposals and dense-captioning events in videos. In CVPR Workshop, 2017. 90 [102] Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, and Li Fei-Fei. Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738, 2015. [103] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In CVPR, 2016. [104] Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. Multi-level attention networks for visual question answering. In CVPR, 2017. [105] Gang Yu and Junsong Yuan. Fast action proposals for human action detection and search. In CVPR, 2015. [106] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. End-to-end concept word detection for video captioning, retrieval, and question answering. In CVPR, 2017. [107] Jun Yuan, Bingbing Ni, Xiaokang Yang, and Ashraf A. Kassim. Temporal action localization with pyramid of score distribution features. In CVPR, 2016. [108] Zehuan Yuan, Jonathan C. Stroud, Tong Lu, and Jia Deng. Temporal action local- ization by structured maximal sums. In CVPR, 2017. [109] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In ICCV, 2017. [110] Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi Ma. Structured attentions for visual question answering. In ICCV, 2017. [111] Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, and Ali Farhadi. Visual semantic planning using deep successor representations. In ICCV, 2017. [112] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. In CVPR, 2016. [113] C Lawrence Zitnick and Piotr Doll ar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 91
Abstract (if available)
Abstract
Temporal perception and reasoning in videos are important Computer Vision problems that are closely related to Machine Learning, Deep Learning, Natural Language Processing and other research areas in Computer Science. In temporal domain, one of the most important perception problems is temporal action detection: given a untrimmed long video, temporal action detection aims to predict the start and end time of the actions of interest and also the action category. Broadly, temporal action detection can be divided to two types: online and offline. Online action detection refers to detecting the actions in streaming videos (e.g. surveillance, self-driving car), i.e. no future video is available
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Human activity analysis with graph signal processing techniques
PDF
Robust representation and recognition of actions in video
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Unsupervised learning of holistic 3D scene understanding
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Model based view-invariant human action recognition and segmentation
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Interactive learning: a general framework and various applications
PDF
Deep learning models for temporal data in health care
PDF
Multimodal reasoning of visual information and natural language
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Structured visual understanding and generation with deep generative models
PDF
Grounding language in images and videos
PDF
Understanding and optimizing internet video delivery
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Modeling, learning, and leveraging similarity
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Event detection and recounting from large-scale consumer videos
PDF
Statistical lesion detection in dynamic positron emission tomography
PDF
Reinforcement learning with generative model for non-parametric MDPs
Asset Metadata
Creator
Gao, Jiyang
(author)
Core Title
Temporal perception and reasoning in videos
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
01/31/2019
Defense Date
11/08/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
action recognition,activity detection,OAI-PMH Harvest,video analysis
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ram (
committee chair
), Luo, Haipeng (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
greatjoy1203@gmail.com,jiyangga@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-116764
Unique identifier
UC11675373
Identifier
etd-GaoJiyang-7037.pdf (filename),usctheses-c89-116764 (legacy record id)
Legacy Identifier
etd-GaoJiyang-7037.pdf
Dmrecord
116764
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Gao, Jiyang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
action recognition
activity detection
video analysis