Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Toward situation awareness: activity and object recognition
(USC Thesis Other)
Toward situation awareness: activity and object recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARD SITUATION AWARENESS: ACTIVITY AND OBJECT RECOGNITION by Jiaping Zhao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2016 Copyright 2016 Jiaping Zhao Dedication To my mom, dad and sisters. ii Acknowledgments First and foremost, I would like to thank my advisor Dr. Laurent Itti sincerely. You are always so passionate about research, and you are the most hard-working people in our lab. Definitely you set a really good model for not just me, but all your students. In research, your wisdom and scientific intuitions always inspire me, and my many new ideas are born during discussion with you. Without your researchguide andfinancial support, Icould nothavethis work donewell inseveral years. I really appreciate that I have chance to work with such an excellent advisor! I would like to thank Dr. Farhan Baluch and Dr. Christian Sigian, both of you gave me a lot of help from my move-in day. You helped me to get familiar with the facilities and the projects in iLab. We also have some wonderful collaboration on research projects, and your patience, responsibility and commitment make the collaboration so fruitful. I want to thank my lab members as well: Chin-Kai Chang, Jens Windau, Shane Grant, Jimmy Tanner, Rorry Brenner and Chen Zhang. I spent really wonderful 4 years together with you. We do research together as an unbeatable team, and we live together as a family. I really appreciate those days with you. Of course, I would like to give my sincere thanks and love to my family: my father - Xincheng Zhao, my mother - Xiuzi Li and my sisters - Hanmei Zhao, Xiaomei Zhao. Your financial and spiritual support is what makes me get here! iii Contents Dedication ii Acknowledgments iii Abstract vii 1 Introduction 1 2 Time Series Classification 4 2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Feature points detector . . . . . . . . . . . . . . . . . . . . . 15 2.4.3 HOG-1D descriptor . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.4 DTW-MDS descriptor . . . . . . . . . . . . . . . . . . . . . 22 2.4.5 Time series Encoding . . . . . . . . . . . . . . . . . . . . . . 26 2.4.6 Computational complexity analysis . . . . . . . . . . . . . . 28 2.5 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 Hybrid Sampling Vs. uniform and random sampling . . . . . 31 2.5.2 Complementarity of HOG-1D and DTW-MDS . . . . . . . . 36 2.5.3 Comparison with other descriptors and the state-of-the-art algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.4 Empirical time complexity . . . . . . . . . . . . . . . . . . . 41 2.5.5 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . 41 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Time Series Alignment 44 3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 iv 3.4 shape Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . 53 3.4.2 shape Dynamic Time Warping . . . . . . . . . . . . . . . . . 54 3.5 Shape descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.1 Raw-Subsequence . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.2 PAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.3 DWT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.4 Slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.5 Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5.6 HOG1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5.7 Compound shape descriptors . . . . . . . . . . . . . . . . . . 60 3.6 Alignment quality evaluation . . . . . . . . . . . . . . . . . . . . . . 61 3.7 Experimental validation . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.1 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.2 Time series classification . . . . . . . . . . . . . . . . . . . . 75 3.7.3 Sensitivity to the size of neighborhood . . . . . . . . . . . . 79 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4 Metric Learning in Time Series 83 4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 Local distance metric learning in DTW . . . . . . . . . . . . . . . . 89 4.4.1 Dynamic Time Warping . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Local distance metric learning . . . . . . . . . . . . . . . . . 91 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . 95 4.5.2 Effectiveness of local distance metric learning . . . . . . . . 96 4.5.3 Effects of hyper-parameters . . . . . . . . . . . . . . . . . . 98 4.5.4 Comparison with state of the art algorithms . . . . . . . . . 100 4.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 101 5 Time Series Decomposition 103 5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.4.1 HOG-1D descriptor . . . . . . . . . . . . . . . . . . . . . . . 108 5.4.2 Univariate Time Series Decomposition . . . . . . . . . . . . 109 5.4.3 Multivariate Time Series Decomposition . . . . . . . . . . . 113 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 v 6 Object Recognition: multi-task learning 125 6.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4 A brief introduction of the iLab-20M dataset . . . . . . . . . . . . . 131 6.5 Network Architecture and its Optimization . . . . . . . . . . . . . . 133 6.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.5.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.6.1 Dataset setup . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.6.2 CNNs setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.6.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 139 6.6.4 Decoupling of what and where . . . . . . . . . . . . . . . . . 140 6.6.5 Feature visualizations . . . . . . . . . . . . . . . . . . . . . . 142 6.6.6 Extension to ImageNet object recognition . . . . . . . . . . 144 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7 Object Recognition: disentangling CNN 148 7.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.4.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . 155 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.5.1 iLab-20M dataset . . . . . . . . . . . . . . . . . . . . . . . . 158 7.5.2 Washington RGB-D dataset . . . . . . . . . . . . . . . . . . 164 7.5.3 ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8 Conlusions 170 Reference List 171 vi Abstract Situation awareness refers to using sensors to observe user’s environment, and infer user’s situation status from those perceptions. Being aware of user’s cur- rent situation helps to offer useful cognitive assistance to users. Typical cognitive assistance examples include: give a warning to the driver when the sensors detect the driver close his eyes (fatigued), reject incoming phone calls when the camera observes that the user is in a meeting, etc. In this paper, we develop algorithms to infer user’s situations from sensed data, concretely, which activities users are per- forming and what objects are in the scene. We use Google Glass, a wearable mobile device, as an example: it has embedded IMU sensors and a first-person camera, the former capture streams of acceleration and angular speed, and the latter records video streams. The former steams are multi-variate time series, while the latter are image frame sequences. At current stages, we analyze time series and image frames separately: concretely, we infer user’s current activities from time series, while recognize objects from images. By knowing user’s activities and objects in the scene, it is possible to infer user’s situation and provide cognitive assistance to the user. Activity recognition from time series is not as well studied as activity recogni- tion from videos. In this paper, we first develop a time series classification pipeline, vii in which we introduce a new feature point detector and two novel shape descrip- tors. Our pipeline outperforms the-state-of-the-art classification approaches sig- nificantly. The developed pipeline naturally applies to activity recognition. Then we invent a novel temporal sequence alignment algorithm, named shape Dynamic Time Warping (shapeDTW). We show empirically that shapeDTW outperforms DTW for sequence alignment both qualitatively and quantitatively. When the shapeDTW distance is used as the distance measure under the nearest neighbor classifier to do time series classification, it significantly outperforms the DTW dis- tance measure, which is widely recognized as the best distance measure to date. By using shapeDTW distance under the nearest neighbor classifier, we further improve the activity recognition accuracies, compared with our previous recogni- tionpipeline. Atlast,wedevelopatimeseriesdecompositionalgorithm,whichsplit heterogeneous time series sequences into homogeneous segments. This algorithm facilitates data collection process, i.e., we can collect various different activities for hours or days, and then use this algorithm to automatically segment the sequences. This reduces manually labeling work hugely. Then we did object recognition from natural images. Although contemporary deep convolutional networks advanced objection recognition by a big step, the underneath mechanism is still largely unclear. Here, we attempted to explore the mechanism of object recognition using a large-scale image dataset, iLab20M, which contains 20 million images shot under controlled turntable settings. Compared with the ImageNet dataset, iLab20M is parametric, with detailed pose and lighting informationforeachimage. Hereweshowedtheauxiliaryinformationcouldbenefit object recognition. First, we formulate object recognition in a CNN-based multi- task learning framework, designed a specific skip connection pattern, and showed its superiority to single task learning theoretically and empirically. Moreover, we viii introduced an two-stream CNN architecture, which disentangles object identity from its instantiation factors (e.g., pose, lighting), and learned more discriminative identity representations. We experimentally showed that the learned feature from iLab20M generalizes well to other datasets, including ImageNet and Washington RGB-D. ix Chapter 1 Introduction 1 The concept of situation awareness refers to the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning and the projection of their status in the near future [? ]. Situation aware- ness focuses on modelling and understanding the user’s environment, and helps the user to be aware of his current situation and anticipate future events. Often, situ- ation awareness is divided into three levels: environmental perception, data acqui- sition from the environment; situation understanding, the analysis, interpretation and reasoning of perceptual data to comprehend the user’s situation; cognitive assistance, the ability to offer helpful assistance to users. With the development of cloud computing and mobile computing technologies, situation awareness becomes possible on a modern mobile device, e.g., Google Glass. Currentmobiledeviceshavenumerousembeddedsensors, likecamera, GPS, accelerometer, gyroscope etc, which could collect multi-modal data continuously from user’s physical environments. By developing some intelligent algorithm, it is possible to do real-time scene understanding and user’s situation inference from acquiredvideostreamsandrecordedaccelerationandGPSreadings. Subsequently, the mobile device could provide cognitive assistance to users based on the inferred user contexts: e.g, rejecting incoming phone calls when a user is in a meeting, giving a warning to a driver when the driver is fatigued (by blink/eye-closure detection) and providing cognitive assistance to users with cognitive impairment or decline. In the rest of the paper, we attempt to address activity recognition from time series data and object recognition from static images separately. Using Google Glass as an example, it has IMU sensors, such as accelerometer and gyroscope, and a ego-centric video camera. IMU sensors capture streams of acceleration and angular speed records, while camera records video streams. The former steams are 2 multi-variate time series, while the latter are image frame sequences. At current stages, we analyze time series and image frames separately: concretely, we infer user’s current activities from time series, while recognize objects from images. By knowing user’s activities and objects in the scene, it is possible to infer user’s situation and provide cognitive assistance to the user. Although we use Google Glass as an example, our algorithms are applicable to any similar mobile devices with embedded cameras and IMUs. We first address activity recognition from time series, and then object recog- nition from images. We split activity recognition into four chapters: time series classification, time series alignment, metric learning in time series and time series decomposition. In the last two chapters, we develop two new deep convolutional architectures to recognize objects from static images. 3 Chapter 2 Time Series Classification 4 2.1 Abstract Time series classification (TSC) arises in many fields and has a wide range of applications. Here we adopt the bag-of-words (BoW) framework to classify time series. Our algorithm first samples local subsequences from time series at feature- point locations when available. It then builds local descriptors, and models their distribution by Gaussian mixture models (GMM), and at last it computes a Fisher Vector (FV) to encode each time series. The encoded FV representations of time series are readily used by existing classifiers, e.g., SVM, for training and prediction. In our work, we focus on detecting better feature points and crafting better local representations, while using existing techniques to learn codebook and encode time series. Specifically, we develop an efficient and effective peak and valley detection algorithm from real-case time series data. Subsequences are sampled from these peaks and valleys, instead of sampled randomly or uniformly as was done previ- ously. Then, two local descriptors, Histogram of Oriented Gradients (HOG-1D) and Dynamic time warping-Multidimensional scaling (DTW-MDS), are designed to represent sampled subsequences. Both descriptors complement each other, and their fused representation is shown to be more descriptive than individual ones. We test our approach extensively on 43 UCR time series datasets, and obtain significantly improved classification accuracies over existing approaches, including NNDTW and shapelet transform. 2.2 Introduction Time series classification (TSC) has numerous applications in many fields, including data mining, machine learning, signal processing, computational biology, etc. Typical classification approaches can be categorized as instance-based (e.g., 5 onenearestneighborclassifierwithEuclideandistance(NN-Euclidean),ordynamic time warping distance (NNDTW)), shapelet [139, 58, 86, 91], feature-based [143, 30], and local pattern-frequency histogram based methods [105, 106, 10, 128, 30]. Instance-based methods, like NNDTW, have been successfully used for TSC and shown to be very hard to beat [8, 34, 131], but they are usually less interpretable. Shapelet is another promising method for TSC, and it discovers subsequences that are discriminative of class membership and provides more interpretable results, but searching for shapelets on large datasets becomes time-consuming or even intractable [102]. Feature-based methods do show promising classification results, but their capabilities are largely attributed to strong classifiers like SVM, adaboost and random forest, instead of being due to better global/local features and repre- sentations. Our work belongs to local pattern-frequency histogram methods, and we exploit the general Bag-of-Words (BoW) framework. A typical BoW framework consists of three major steps: (1) local feature points detection and description, (2) codebook generation and (3) signal encoding. After- wards, anyclassifiercanbetrainedonsignalencodingstodothefinalclassification. The performance of a BoW framework implementation depends on all steps. In the computer vision community, many efforts have been made to improve each step. Regarding local feature detection and description, successful feature extrac- tors (e.g., SIFT [89], Space Time Interest Points (STIPs) [79]) have been developed to detect local feature points, and manually-crafted descriptors (e.g., Histogram of Gradients [26], Motion Boundary Histogram (MBH) [27]) have been invented to represent local 2D image patches and 3D visual cuboid patterns around feature points. However, as reviewed below, fewer developments have been made with 1D timeseriesdescriptors. Thenextstep, codebookgeneration, attemptstomodelthe local descriptor space and to provide a partition in that space. Two typical ways 6 1.training time series 2.feature points extraction 3.hybrid sampling 4.subsequences representation 5.time series encoding by Fisher Vector 6.linear kernel SVM GMM Fisher Vector HOG-1D DTW-MDS Figure 2.1: Algorithm pipeline: we adopt a typical BoW framework: (1) local fea- ture extraction and representation; (2) codebook generation; (3) time series encod- ing. The encoded feature vectors are used by any classifier to do the classification. In our work, we focus on detecting better feature points and obtaining better local representations, while using existing techniques to learn codebook and encode time series. Specifically, we develop an effective algorithm to detect peaks and valleys from time series, and subsequences are sampled from these feature points (steps 2, 3). Then we introduce two descriptors, Histogram of Oriented Gradients (HOG- 1D) and Dynamic Time Warping - Multidimensional Scaling (DTW-MDS), to represent local subsequences (step 4). Afterwards, we fit the distribution of sub- sequences descriptors by a K component Gaussian Mixture Model (GMM) and encode each time series by a Fisher Vector (FV) (step 5). At last, linear kernel SVM is employed to classify time series based on FV encodings. The pipeline shows the process for training time series, and test time series T t go through the same process, and the only difference is in step 5: T t is directly encoded as a FV by learned GMM model parameters from training (without GMM model fitting any more). are K-means and Gaussian Mixture Models (GMM). For the last step, encoding, there is a large family of research studies; several representative encoding methods are vector quantization (hard voting) [118], sparse coding [137] and Fisher Vector encoding [98]. In this work, we adopt the BoW pipeline: we focus on improving the first step: designing better local feature extractors and descriptors, while using existing techniques for the 2nd and 3rd steps; specifically, GMM is used to produce the codebook and Fisher Vector [98] is employed to encode the time series. 7 While local feature extractors are well studied in the computer vision com- munity, in the time series community, no widely used extractors exist yet, such that most methods sample feature points either uniformly or randomly. In this paper, we introduce an efficient and effective feature point extractor, which detects all peaks and valleys, termed as landmarks, from time series. Afterwards, sub- sequences centered on landmarks are sampled. Landmark-based sampling gives deterministicandphase-invariantsubsequences, whileuniformorrandomsampling are affected by the phase of the time series [10]. Due to the observation that dense sampling outperforms sparse interest-points sampling in image classification [18] and activity recognition [127], in experiments, we adopt a hybrid sampling strat- egy: first sample subsequences from landmarks, then sample uniformly in “flat” featureless regions of the signal. In this way, information from both feature-rich and feature-less intervals is incorporated in the sampled subsequences. We show experimentally that this new hybrid sampling strategy outperforms both uniform and random sampling significantly. To the best of our knowledge, little recent literature on time series classification is focused on developing better local descriptors for local time series subsequences. Commonly used local features are often simple, including mean, variance and slope [10, 30, 106]. However, statistical features like mean and variance cannot charac- terize local segment shapes well. Although slope incorporates shape information, it will underfit the shape of local subsequences if the interval (here, a subsequence is divided into equal-length non-overlapping temporal intervals and represented as a sequence of slopes of intervals) is too long, and becomes sensitive to noise if the interval is too short. Symbolic Aggregate approXimation (SAX) [83] is shown be a good representation for time series, however, its usage in a BoW framework [84] creates a large codebook, resulting in high-dimensional encoding vectors for time 8 series, which inevitably enburdens downstream classifier training and prediction. Other widely used and somewhat older time series representations include Dis- crete Fourier Transform (DFT) coefficients, Discrete Wavelet Transform (DWT), piecewise linear approximation (PLA), etc. It is important to clarify that SAX, DFT, DWT and PLA have been used in general to represent the whole time series, instead of local subsequences. In our work, we propose two new local descriptors, namely Histogram of Ori- ented Gradients of 1D time series (HOG-1D) and Dynamic time warping - Mul- tidimensional scaling (DTW-MDS), which are shown experimentally to be quite descriptive of local subsequence shapes. These two descriptors have individual advantages: HOG-1Dconsistsofstatisticalhistograms, thereforeisrobusttonoise. Moreover, HOG-1D is invariant to y-axis magnitude shift. While DTW-MDS is sensitive to noise and magnitude shift, it is more invariant to stretching, contrac- tion and temporal shifting. Two descriptors thus complement each other. By fusing them into a single descriptor, the fused one, HOG-1D+DTW-MDS, com- bines the benefits of both descriptors, becomes more descriptive of subsequences, and thus is more discriminative for classification tasks. Experimental results show that our fused descriptor outperforms existing descriptors, such as DFT, DWT and Zhang’s [143, 134], significantly on 43 UCR datasets for time series classification. Here DFT, DWT and Zhang’s are used to represent local subsequences, instead of the whole time series. All local descriptors, including our fused one, work under thesameclassificationpipeline: (1)featurepointsextraction, (2)localsubsequence representation, (3) time series encoding by Fisher Vector, (4) linear kernel SVM classification. In addition, we compare TSC performance of our fused descriptor with two state-of-the-art algorithms, NNDTW and shapelet transform [58], on 41 9 UCR datasets, and ours achieves the best performance on 22 of them (includ- ing ties). Wilcoxon signed rank test on relative accuracy boost (see section 2.5.3 for its definition) shows our fused descriptor improves relative classification accu- racies significantly compared to NNDTW (p < 0.0017) and shapelet transform (p < 0.0452). Our algorithm performs well on UCR datasets, which have fixed length time series instances, however, our algorithm is also applicable to datasets with variable length time series instances, since Fisher Vector is essentially a nor- malized encoder, making encodings largely invariant to time series length. Our contributions are several fold: (1) we introduce a simple but effective feature point extractor, which detects a set of landmarks from time series; (2) we explicitly design two local subsequence descriptors, namely HOG-1D and DTW- MDS, which are descriptive of local shapes and complement each other; (3) we obtain significantly improved classification accuracies using our fused descriptor when compared with two competing state-of-the-art TSC algorithms, NNDTW and shapelet transform, and three existing descriptors, DFT, DWT and Zhang’s, on 43 UCR datasets. Our algorithm pipeline 1 is shown in Fig.2.1. 2.3 Previous Work Time series classification (TSC) methods can be categorized into instance- based, shapelets, feature-based and pattern frequency histogram methods. Instance-based methods predict labels of test time series based on their sim- ilarity to the training instances. The most popular similarity metrics include Euclidean distance and elastic distances, e.g., the dynamic time warping (DTW) distance. Using a single nearest neighbor, with Euclidean distance (NNEuclidean) 1 code available at: https://github.com/jiapingz/TSClassification 10 or DTW distance (NNDTW), has demonstrated successful time series label pre- diction. DTW allows time series to be locally shifted, contracted and stretched, and lengths of time series hence need not be the same. Therefore, DTW usually gives a better similarity measurement than Euclidean distance, and NNDTW has been shown to be very hard to beat on many datasets [8]. A number of more complex elastic distance measures have been proposed, including longest common subsequences (LCSS) [126], Edit distance with Real Penalty (ERP) [20] and edit distance on Real Sequence (EDR) [21]. However, in [101], the authors claimed that no other elastic distance measure outperforms DTW by a statistically sig- nificant amount, and DTW is the best measure. Instance-based approaches, like NN-Euclidean and NNDTW, are accurate, but they are less interpretable, since they are based on global matching and provide limited insights into the temporal characteristics. Shapelet is a localized time series subsequence, which is discriminative of class membership, and it was first proposed and used by Ye and Keogh [139] for time series classification. The original shapelet algorithm [139] searches for shapelets recursively, and builds a decision tree using different shapelets as splitting criteria. However, the expressiveness of shapelets is limited to binary decision questions. In [91], the authors proposed logical-shapelets, specifically conjunctions or disjunc- tions of shapelets, which are shown to be more expressive than a single shapelet, and to experimentally outperform the original shapelet algorithm. The above two algorithms embed shapelet discovery in a decision tree, while in [58], the authors separateshapeletdiscoveryfromclassifierbyfindingthebestk shapeletsinasingle scan of all time series. The shapelets are used to transform the data, where each attribute in the new dataset represents the distance of a time series to one of thek shapelets. Hills et al. demonstrate that the transformed data, in conjunction with 11 more complex classifiers, produces better accuracies than the embedded shapelet tree. Since shapelets are localized class-discriminative subsequences, shapelets- based methods have increased interpretability than global instance-based match- ing. The main drawback is the time complexity of searching for shapelets, and subsequent research, e.g., [102], focuses on developing efficient shapelet-searching algorithms. Feature-based methods generally consist of two sequential steps: extract fea- tures and train a classifier based on features. Typical global features include statistical features, like variance and mean, PCA coefficients, DFT coefficients, zero-crossing rate [143]. These features are extracted either from time domain or from transformed domains, like frequency domain and principal component space. Afterwards, the extracted features either go through feature selection procedures to prune less significant ones [143], or are fed directly into complex classifiers, like multi-layer neural network [92]. Global features lose temporal information, although it is potentially informative for classification. In [106], the authors extracted features from intervals of time series, constructed and then boosted binary stumps on these interval features, and trained an SVM on outputs of the boosted binary stumps. In [30], the authors extracted simple interval features as well, including mean, variance and slope, trained a random forest ensemble classifier, and showed better performance than NNDTW. Although feature-based methods have shown promising classification results, their capabilities are largely attributed to strong classifiers such as SVM, adaboost and random forest, instead of being due to better global/local features and representations. Another popular method is based on pattern frequency histograms, widely known as bag of words (BoW). The BoW approach incorporates word frequen- cies but ignores their locations. In time series applications, several recent papers 12 adopted BoW ideas. Lin et al. [84] first symbolize time series by SAX, then slide a fixed-sizedwindowtoextractacontiguoussetofSAXwords, andatlastusethefre- quencydistributionofSAXwordsasarepresentationforthetimeseries. Baydogan et al. [10] propose a similar bag-of-features framework. They sample subsequences with varying lengths randomly, use mean, variance, slope and temporal location t to represent each subsequence, afterwards utilize random forest classification to estimate class probabilities of each subsequence, and finally represent the raw time series by summarizing the subsequence class-probability distribution information. They showed superior or comparable results to competing methods like NNDTW on UCR datasets [91]. Wang et al. [128] adopted a typical bag of words framework to classify biomedical time series data, and they sample subsequences uniformly and represent them by DWT. Grabocka and Schmidt-Thieme [50] introduce a similar BoW pipeline to classify time series: they sample subsequences from time series instances uniformly, learn latent patterns and membership assignments of each subsequence to those patterns, and sum up membership assignments of sub- sequences on a time series as the representation of that time series. Time series representations are then classified by polynomial kernel SVM. Our work belongs to this category, but emphasizes detecting better local feature points and developing better local subsequence representations. There are two recent papers using local descriptors as well [16, 130]. In [16], the authors attempt to improve efficiency of traditional DTW computation, to be concrete, they extract local feature points, match them by their descriptors and compute the local band constraints (based on matched pairs) applicable during the execution of the DTW algorithm. In this way, they only have to compute the accumulative distances within the band, and the DTW computation efficiency is improved. Our work is different from [16] in that: our use local descriptors to 13 improve classification accuracies, while [16] use local descriptors to improve DTW computation efficiency. In [130], the authors extract local features from multi- variate time series by leveraging meta-data, and their method for local feature extraction is only applicable for multi-variate time series data with known correla- tions and dependencies among different dimensions. UCR datasets are univariate time series datasets, and their method cannot be used here. 2.4 Methodology 2.4.1 Algorithm overview We follow the classic bag-of-words pipeline closely to do the classification. As mentioned, we focus on developing better local feature extractors and better local subsequence representations, while using existing algorithms for the following steps, specifically, GMM is used to generate the codebook and Fisher Vector [98] is employed to encode the time series. We propose a simple but effective feature point extractor, which robustly detects peaks and valleys from real case time series, and local subsequences are sampled from these feature points instead of sampled randomly or uniformly as previously done. Afterwards, we introduce two descrip- tors, namely HOG-1D and DTW-MDS, to represent the local subsequences. Two descriptors have individual advantages and complement each other, and both cap- ture local shapes well. Since each subsequence has two descriptors, we can either use them separately, or fuse them as a single descriptor. In case of fusion, two descriptors are first` 2 -normalized and then concatenated to form a new descriptor, i.e.,d i = [d i HOG−1D /kd i HOG−1D k 2 d i DTW−MDS /kd i DTW−MDS k 2 ], where d i HOG−1D and d i DTW−MDS are HOG-1D and DTW-MDS descriptors of the i th subsequence, and d i is its concatenated new descriptor: termed as HOG-1D+DTW-MDS. 14 To do TSC, we first extract subsequences and represent them by either (1) HOG-1D, (2) DTW-MDS, or (3) HOG-1D + DTW-MDS descriptors, then learn a generative K component GMM to model the distribution of local descriptors based on training data, and at last encode each time series by a Fisher Vector [98]. Subsequently, an SVM classifier with linear kernel is used for training and testing based on the Fisher Vector representations of time series. The details of local feature extraction and representation, codebook generation and global time series encoding are given in the following sections. 2.4.2 Feature points detector To the best of our knowledge, existing literature uses regular constant-step sliding window sampling (a.k.a., uniform sampling) or random sampling strategies to extract subsequences from long time series [84, 10, 128]. In the case of uniform sampling, a fixed-size window is slid along the temporal axis with a constant stride and subsequence within each sliding window is sampled. In case of random sam- pling, subsequences are extracted from random locations on time series. There is inevitable randomness in both sampling strategies, specifically different start sam- plingpointsmakesampledsubsequencesdifferentundertheuniformsamplingcase, while randomness under random sampling is inherent. This partially motivates us to design a procedure to make sampled subsequences deterministic. Concretely we propose to extract temporal feature points first, and then sample subsequences from there, if feature points are deterministic, sampled subsequences are fixed each time. Compared with non-feature points, local feature points are more descriptive of local shapes, and more robust to noise. Successful local feature point detectors include the 2D image feature point detector SIFT [89] and 3D spatio-temporal 15 video feature point detector STIPs [79]. SIFT and STIPs are widely used for object recognition in 2D images and activity recognition in videos. Inspired by theirgreatperformance, weintroducea1Dtemporalfeaturepointdetector, aiming at reaching downstream higher TSC accuracies. The feature point detection makes following steps invariant, to some extent, to time series phases. We define temporal feature points to be peaks and valleys on time series. Given a time series: T = (t 1 ,t 2 ,...,t n ), a peak or valley t i on a noise-free time series satisfies: (t i −t i−1 )· (t i+1 −t i )< 0, i.e., its left and right derivatives change signs. However, this simple criterion for feature point detection fails to work for real case signals, since there are many small bumps on signals, and many false positives are detected as a result. Since valley detection in raw time seriesT can be transformed to peak detection in−T, we will focus on peak detection in raw time series T. If our algorithm returns t i as a peak when it satisfies t i > t i−1 and t i > t i+1 , many points found in this way lie on ascending or descending slopes, which are false peaks. The challenge is to find ’better’ local maxima while discarding ’fake’ ones. To be more selective of peaks, a straightforward modification is: only if t i > t i−1 + Δ and t i > t i+1 + Δ, where Δ > 0, then t i is returned as a local peak. The larger Δ is, the more selective the algorithm is and the fewer peaks are returned. In this way, on the contrary, lots of true positives will be missed when setting Δ to be large. We can relax the constraints: t i is returned as a peak if It is larger than one neighbor by a gap Δ, we name this algorithm as Algo1: Algo1: t i is returned as a peak when it satisfies either (1) t i > t i−1 + Δ and t i >t i+1 or (2) t i >t i−1 and t i >t i+1 + Δ. However, Algo1 will inevitably miss some true positives. For example, in Fig.2.2, p on two segments are both local peaks, suppose Δ 1 in two cases are the same, but Δ 2 on the left segment is much smaller than Δ 2 on the right segment, 16 A’ A B’ B P Δ1 Δ2 A’ A B’ B P Δ1 Δ2 Figure 2.2: Local maxima: Algo1 will miss some true local optima. p on two segments are both local peaks, suppose Δ 1 in both cases are the same, but Δ 2 on the left segment is much smaller than Δ 2 on the right segment, then p on the left segment is harder to satisfy the peak condition in Algo1, indicating p on the left segment may be missed. then p on the left segment is harder to satisfy the peak condition in Algo1, indi- cating p on the left segment may not be returned as peak. We introduce Algo2, which is guaranteed to find both peaks found by Algo1 and some true positives missed by Algo1. Algo2: Given a real case time series data T = (t 1 ,t 2 ,...,t n ), only keep temporal points whose left and right derivatives change sign, i.e., (t i −t i−1 )· (t i+1 −t i )< 0 and organize them in the same temporal order as they are in the raw time series to form a trimmed time series as T 0 = (t 0 1 ,t 0 2 ,...,t 0 m ). Use T 0 as input, return t 0 i as a peak when t 0 i satisfies either (1) t 0 i >t 0 i−1 + Δ and t 0 i >t 0 i+1 or (2) t 0 i >t 0 i−1 and t 0 i >t 0 i+1 + Δ. Compared with Algo1, there is a trimming step in Algo2, and the following peak extraction process remains the same. Algo2 is guaranteed to detect peak points detected by Algo1. Proof: let (...,A 0 ,A,p,B,B 0 ,...) be part of raw time series T as shown in Fig.2.2 (the right segment), and assumep is returned by Algo1 as a peak, thenp must satisfy either (1) p > A + Δ and p > B; or (2) p > A and p > B + Δ. In either case, (p−A)· (B−p) < 0, i.e., p is a left/right derivative sign changing point and will be kept in the trimmed time series T 0 . We will show p will be returned by Algo2 as a peak as well. Case one: if both A and B are sign changing points, then segment−A−p−B− is also a segment in the trimmed time seriesT 0 . Then 17 p satisfies either (1) p>A + Δ and p>B; or (2) p>A and p>B + Δ as well, and returned by Algo2 as a peak. Case two: if only one of A and B is a sign changing point, without loss of generality, assume A be the sign changing point and kept inT 0 to be left neighbor ofp. Assume the new right neighbor ofp isB 00 , then B 00 is a sign changing point on the right side of B in T. If B 00 ≥ B, then there must be some point b B in the raw time timesT betweenB andB 00 , satisfying b B < B and b B < B 00 , otherwise B will be a sign changing point. In this case b B is a sign changing point and would be the right neighbor of p, contradicting the assumption. Therefore B 00 ≤B. In this case, p in the trimmed time series T 0 will as well satisfy the criterion defined in Algo1 and be returned. Case three: both A and B are non-sign changing points, since this can be reduced to case two, and we omit its proof. Algo2 will return more peak points than Algo1. From the above analy- sis, the peak condition is easier to be satisfied in the trimmed time series, resulting in more returned peak points. A concrete example is: under some magnitude con- straint of Δ, Δ 1 , Δ 2 , p on the left segment of Fig. 2.2 will be missed by Algo1, but will be detected by Algo2 easily. In practice, we use a slightly modified version of Algo2 to detect peaks, and name the new algorithm as Algo3: after obtaining a trimmed time series T 0 = (t 0 1 ,t 0 2 ,...,t 0 m ) as in Algo2, return t 0 i as a peak when t 0 i satisfies either (1) t 0 i > min(t 0 <i ) + Δ andt 0 i >t 0 i+1 or (2)t 0 i >t 0 i−1 andt 0 i >min(t 0 >i ) + Δ. Note in Algo2, t 0 i is compared with its immediate left and right neighbors t 0 i−1 and t 0 i+1 , while in Algo3, t 0 i is compared to its immediate right (left) neighbor t 0 i+1 (t 0 i−1 ) and some neighbor min(t 0 <i ) (min(t 0 >i )) from its left (right) side. In the constraint min(t 0 <i ) (min(t 0 >i )) denotes some point residing betweent 0 i and its left (right) closest peak, and having the minimum value. Compared with Algo2, Algo3 is more robust 18 to noise, and easily removes many ’false peaks’ on ascending or descending slopes (see feature points detection results in Fig.2.3). The only parameter in Algo3 is Δ, which is set to be some ratio of the value range of the time series instance, i.e., Δ = λ· (max(T )− min(T )), where 0 < λ < 1 (see supplementary materials for details and run our demo code). Figure 2.3: Feature Points Extraction by Algo3: a real-case time series and peak and valley points (red circles) returned by running Algo3. Visually, all the true positives are extracted while the false positives are suppressed. Figure 2.4: Hybrid Sampling: in feature-rich regions, sample from feature points; in flat regions, sample uniformly. Red circles are detected feature points, and magenta circles are evenly-spaced points in flat regions. Subsequences centered around both point types are sampled. The 2nd row shows examples of sampled subsequences. Fig. 2.3 shows a typical example of feature points detection by Algo3 from a noisy time series. As we see, Algo3 achieves both high precision and recall of feature points, i.e., detects most visually true peaks/valleys while suppressing false positives. Afterwards, subsequences of fixed length centered on these landmarks will be extracted. However, in practice, we exploit a hybrid sampling strategy: in feature-rich regions, we sample subsequences from landmarks while in “flat” 19 regions, we sample uniformly. In this way, information from both feature-rich and feature-less intervals is preserved, and time series are well characterized by the hybrid-sampled subsequences. Hybrid sampling is illustrated in Fig. 2.4. In the following two sections, we introduce two descriptors, HOG-1D and DTW-MDS, to represent sampled subsequences. 2.4.3 HOG-1D descriptor Histogramoforientedgradients(HOG)wasfirstintroducedbyDalalandTriggs [26] for object detection. It is shown that local object appearances and shapes are well captured by the distribution of local intensity gradients. Its excellent performance for pedestrian detection was empirically demonstrated. Later on, Klaser et al.[73] generalized the key HOG concept to 3D spatio-temporal video domain,anddevelopedhistogramsoforiented3Dspatio-temporalgradients(HOG- 3D) descriptor. They applied HOG-3D descriptor to several action datasets, and obtained the state-of-the-art activity recognition results at that time. Based on the success of HOG and HOG-3D, we here introduce a new HOG- 1D descriptor for 1D time series data. We inherit key concepts from HOG, and adapt them to 1D temporal data. Assume a subsequence s = (p 1 ,p 2 ,...,p l ) of length l, divided into n constant-length overlapping or non-overlapping intervals I ={I 1 ,I 2 ,...,I n }, where the cardinality of each interval is|I i |=c. Within each interval I i , a 1D histogram of oriented gradients is accumulated over all temporal points in I i . Concatenating n interval histograms forms the descriptor of the sub- sequence s, and we term it the HOG-1D descriptor. Fig.2.5 shows the generation process of HOG-1D descriptor of a sample subsequence. The statistical nature of histograms makes HOG-1D less sensitive to observation noise, while the concate- nation of sequential histograms captures temporal information well. When the 20 Figure 2.5: HOG-1D descriptor: a subsequence s is shown as green line. At each temporal point p i on s, centered gradient is estimated, with the blue arrow indi- cating its direction and magnitude. The subsequence is divided into 3 overlapping intervals, boxed by magenta, red and cyan rectangles. In each interval, a histogram of oriented gradients (HOG) is accumulated over all temporal points in that inter- val, and shown under that interval. Concatenation of 3 HOGs gives the HOG-1D descriptor for the subsequences. Gradient orientations lie within (−90 0 , 90 0 ), and inthisfigure8evenlyspacedorientationbinsareused, resultingina24-DHOG-1D descriptor. number of intervals decreases to 1 (i.e., n = 1), the descriptiveness of HOG-1D is weakened since temporal information is lost, while whenn increases, the cardinal- ity of each interval will decrease (supposen intervals do not overlap), and the HOG of a short interval will become more sensitive to noise. In practice, settingn = 2, 3 works well (the influence of the number of intervals on classification performances is analyzed in supplementary materials). In the following, we give implementation details of HOG computation within each interval. Given an interval I = (p t 1 , p t 1 +1 ,...,p t 2 ) with time span (t 1 ,t 2 ), the magni- tude of gradient at temporal point t (t 1 ≤ t ≤ t 2 ) is calculated as |g t | = |σ· (p t+1 −p t−1 )/((t + 1)− (t− 1))| =|σ· 1 2 (p t+1 −p t−1 )|, and its orientation is arctan(g t ), which lies within (−90 ◦ , 90 ◦ ). σ is a global scaling factor, accounting for different time series sampling frequencies. Specifically, for time series sampled at a high frequency, adjacent observations p t−1 and p t will almost be the same, making|g t |≈ 0 and arctan(g t )≈ 0 ◦ everywhere. Under this scenario, HOG over intervalI will be a spiked distribution, making HOG-1D unable to distinguish dif- ferent local shapes. In practice, σ is set such that gradient orientations distribute 21 within (−90 ◦ , 90 ◦ ) approximately evenly (see the algorithm in the supplementary material to search for α). After gradient computation at each temporal point, the next step is to accumulate gradient votes within orientation bins, and obtain HOG over bins. The orientation bins are evenly spaced within (−90 ◦ , 90 ◦ ). Typically, a gradient votes for its two neighboring bins, and votes are determined bi-linearly in terms of angular distances from the gradient orientation to the bin centers. In our experiments, we exploit a kernel smoothed voting strategy, i.e., a gradient votes for all orientation bins, and the voting magnitude for the i th bin b i is determined as|g t |· exp{− 1 2 (arctan(g t )−\(b i )) 2 /ˆ σ 2 }, where\(b i ) is the orientation of bin b i , ˆ σ is a scale factor indicating the decaying rate of Gaussian smoothing kernel. HOG-1D is insensitive to noise and and invariant to y-axis magnitude shift; how- ever, since magnitude information sometimes benefits TSC, we introduced another subsequence descriptor, DTW-MDS, which complements HOG-1D to account for magnitude shift, as well contraction and stretching distortion. 2.4.4 DTW-MDS descriptor There are dozens of similarity measures for time series, and the most straight- forward measure is the Euclidean Distance; however it does not handle distortion and misalignment in time. DTW is another ubiquitous measure, which accounts for nonlinear distortions in the temporal dimension. Dozens of alternative elastic distance measures have been invented, but experimental tests on forty datasets suggested none of them consistently beats DTW[101]. In this paper, we choose DTW as distance measure between time series subsequences. Let s p = (p 1 ,p 2 ,...,p m ) and s q = (q 1 ,q 2 ,...,q n ) denote two subsequences of length m and n. Sometimes, a warping window size4 is further enforced to make temporal indices of matched pointsp i andq j satisfy|i−j|≤4. However, in 22 our experiments, we use DTW without the warping widow size constraint. Assume thereareN subsequences{s 1 ,s 2 ,...,s N },bycomputingpairwisesubsequenceDTW distances, we get a N×N symmetric DTW distance matrix d DTW . In order to used DTW in the kernel based classifier like SVM, several attempts have been made to derive kernels from d DTW , examples include Gaussian dynamic time warping (GDTW) kernel functionK(s i ,s j ) = exp{−γd DTW (s i ,s j )} or negative dynamic time warping (NDTW) kernel functionK(s i ,s j ) =−d DTW (s i ,s j ). However, kernel matrices constructed by these functions are not positive semi-definite (PSD), thus, efficacy of SVM cannot be enjoyed by these kernels. Empirical results in [53] showed that SVM with either GDTW or NDTW kernel has inferior performance in TSC compared with NNDTW. Some attempts have been made to construct a PSD kernel matrix from a distance measure between two time series [25]. The authors considered the similarity scores spanned by all possible alignments, instead of just the score of the best alignment, and derived a kernel out of this formulation. This kernel is shown to be PSD and can be used in the kernel-based classifiers. However, construction of a valid PSD kernel from some dissimilarity matrix is unfortunately a non-trivial task. To work around the difficulty of fixing DTW kernel matrix to be PSD and at the same time to enjoy the superiority of the DTW distance measure and the strong performance of SVM, we introduce the DTW- MDS descriptor for subsequences. Briefly, we use the multidimensional scaling (MDS) algorithm to find a layout of N subsequences in space Ω (Ω∈ R h ) such that pairwise subsequence DTW distances d DTW are preserved as well as possible in Ω. Then each subsequence is represented by a vector x (x∈R h ) in Ω, and x is termed as the DTW-MDS descriptor. Given an N×N DTW distance matrix d DTW computed between N subse- quences {s 1 ,s 2 ,...,s N }, MDS aims to find a h dimensional descriptor x i (x i ∈ 23 subsequences DTW DTW−MDS N N N h Figure2.6: DTW-MDSdescriptor: the1stcolumnshowsN sampledsubsequences, the 2nd column shows aN×N symmetric dynamic time warping distance matrix d DTW , with each entryd ij indicating the DTW distance between subsequencei and j, and the 3rd column shows: by applying MDS ond DTW , we obtainN vectors inh dimensional space, with each being the DTW-MDS descriptor of one subsequence. R h , i = 1, 2,...,N) for each subsequence, such that for each pair of subsequences s i and s j , their Euclidean Distancek x i −x j k 2 in the h dimensional space sat- isfies their DTW distance d DTW (s i ,s j ) as well as possible. Mathematically, we define representation error for each pairwise DTW distance as e ij =k x i −x j k 2 −d DTW (s i ,s j ), then MDS minimizes the following normalized stress: X = arg min {x 1 ,x 2 ,...,x N } 1 Z X 1≤i<j≤N e 2 ij (2.1) whereZ is a normalization factor. In practice, we setZ to the sum of squares of pairwise DTW distances, i.e.,Z = X 1≤i<j≤N d 2 DTW (s i ,s j ). The solution X of problem (2.1) is an N×h matrix, with i th row x i being the descriptor for subsequences i . Problem (2.1) can be solved in a coordinate descent way: first, determine an updating order of rows, typically either in random order, or sequentially (e.g., 1→ 2→ ...→ N→ 1→ 2→ ...). Then update each row x i by keeping all other rows x j (j6=i) fixed; in this case, problem (2.1) becomes a 24 quadratic programming ofx i , which is convex andx i can be solved by the standard Levenberg-Marquardt algorithm. After N iterations, all rows x i (i = 1, 2,...,N) are updated once. We repeat this process and terminate either when reaching the maximum number of iterations, or when4x i ε. Fig.2.6 shows the process of computing DTW-MDS descriptors. In the above way, each subsequences i from training time series is coded by an h dimensional vector x i . However, we do not see subsequences of test time series during training. Here, we can use the same framework to encode each test subse- quence. Given a test subsequence b s, we compute its DTW distanced DTW (b s,s i ) to each training subsequence s i (s i ∈ S train ), and DTW-MDS descriptor b x of b s can be obtained by solving the following minimization program: arg min ˆ x X s i ∈S train {k b x−x i k 2 −d DTW (b s,s i )} 2 (2.2) This least-square problem can be solved by Levenberg-Marquardt algorithm as well. In experiments, we set the maximum number of iterations to 50, and use this as the termination criterion for both Program 2.1 and 2.2. SolvingProgram2.1and2.2becomestimeandspaceconsumingwhenthenum- ber of training subsequences N goes too large, while in practice, we develop an approximate algorithm to compute DTW-MDS descriptors, in which both space consumption and time cost to compute the DTW-MDS descriptor of a subsequence are independent ofN. The key of our approximation algorithm is to chooseR rep- resentative subsequences from N training subsequences, and first encode R repre- sentatives by solving Program 2.1, then encode each of the left (N−R) training subsequences and test subsequences by solving Program 2.2 (see supplementary materials for algorithm details). 25 After transforming pairwise subsequence DTW distances to descriptors, exist- ing valid kernel functions like RBF and polynomials are ready to be used in the kernel machine classifiers (e.g., SVM). Another advantage of DTW-MDS descrip- tor is that it can be fused with other features extracted from the same subsequence. When these features and DTW-MDS descriptors are complementary to each other, the merged feature vector has the potential to further boost the classification accu- racy. In our experiments, we show by fusing DTW-MDS with HOG-1D, better classification accuracies on most datasets are obtained. 2.4.5 Time series Encoding After feature point detection, local subsequence extraction and representation, each time series ˆ T is represented by a set of sampled subsequences{ˆ s 1 , ˆ s 2 ,..., ˆ s ˆ N }, and each subsequence is described by a descriptor vector (either HOG-1D, DTW- MDS, or HOG-1D+DTW-MDS) ˆ y i (ˆ y i ∈ {ˆ y 1 , ˆ y 2 ,..., ˆ y ˆ N }). Time series encod- ing aims at figuring out a vector representation for ˆ T, based on its local sub- sequence representations{ˆ y 1 , ˆ y 2 ,..., ˆ y ˆ N }. Typical encoding methods include hard voting-based (e.g., BoW), reconstruction-based(e.g., LLC [129]) and super vector- based(e.g., Fisher Vector [98]) approaches. As experimentally demonstrated in [97], super vector based encoding is more effective than the other two encodings foractionrecognition. Inourexperiments, wechoosetouseFisherVectorencoding [98] for time series. TheFisherVectorencoding[98]isderivedfromFisherKernel, whichconstructs kernels from probabilistic generative models, in this way, we apply generative mod- elsinadiscriminativesettingandtakebenefitsofbothmodels. Theconstructionof the Fisher vector starts by learning a Gaussian mixture model (GMM) model from 26 a set of local descriptors X ={x 1 ,x 2 ,...,x N }. The probability density function of a GMM with K components is given by: p(x;θ) = K X k=1 π k ·N(x;μ k , Σ k ) (2.3) where θ = (π 1 ,μ 1 , Σ 1 ,...,π K ,μ K , Σ K ) is the vector of parameters in this model, including the component prior probability π k , the mean μ k (∈ R D ) and the pos- itive definite covariance matrix Σ k (∈ R D×D ) of each Gaussian components. For simplicity, covariance matrices are assumed to be diagonal, therefore the GMM is fully specified by (2D+1)×K scalar parameters. Given a set ofN local descriptors X ={x 1 ,x 2 ,...,x N }, the parameters in GMM are learned by maximal likelihood estimation using Expectation Maximization algorithm. After fitting a GMM model, given a set of descriptors ˆ Y = {ˆ y 1 , ˆ y 2 ,..., ˆ y ˆ N } sampled from some time series ˆ T, letγ n k be the soft assignment of descriptor ˆ y n to Gaussian component k (k∈{1, 2,...,K}), define vectors G ˆ Y μ,k and G ˆ Y σ,k as: G ˆ Y μ,k = 1 ˆ N· √ π k ˆ N X n=1 γ n k ( ˆ y n −μ k σ k ), G ˆ Y σ,k = 1 ˆ N· √ 2π k ˆ N X n=1 γ n k " ( ˆ y n −μ k σ k ) 2 − 1 # (2.4) Each of which isD dimensional. The fisher vector of the set of local descriptors ˆ Y is then given by the concatenation ofG ˆ Y μ,k andG ˆ Y σ,k for allK Gaussian components, resulting in a 2DK vector: FV ˆ Y = [G ˆ Y μ,1 , G ˆ Y σ,1 , ...,G ˆ Y μ,K , G ˆ Y σ,K ] (2.5) whichistheFisherVectoroftimeseries ˆ T. Inourexperiments,thelocaldescriptors are either (1) HOG-1D; (2) DTW-MDS; or (3) HOG-1D + DTW-MDS. GMM is 27 first learned to fit the distribution of local descriptors from training time series, then each training time series can be encoded by Fisher Vector as in Eq. 2.5. Given a test time series, after subsequence sampling and representation, it is represented by a set of descriptors ˆ Y ={ˆ y 1 , ˆ y 2 ,..., ˆ y ˆ N }, and following Eq.2.4 and Eq. 2.5, the test time series can be encoded by a Fisher Vector as well. Power normalization: as shown in [98], as the number of Gaussian compo- nentsK increases, the Fisher Vector becomes sparser. Dot products are poor mea- sures of similarity on sparse vectors. Therefore, they proposed to power-normalize each dimension of the raw Fisher Vector by the same power factor α: FV N i = sign(FV i )·|FV i | α , i∈{1, 2,..., 2DK} (2.6) Inexperiments, thepowerfactorα isdeterminedbycross-validationontraining data. Until now, each time series T has a normalized encoding FV N T (FV N T = [FV N 1 ,FV N 2 ,...,FV N 2DK ]), together with its label, we train a linear kernel SVM to do the classification. 2.4.6 Computational complexity analysis Our classification pipeline is composed of sequential steps, and its time com- plexity is a sum of time costs at individual steps. Define notations as follows: let L be the length of a time series, N train be the number of training time series instances,l andn train be the length and number of sampled training subsequences, and r train be the number of representative training subsequences (see its defi- nition in Sec. 2.4.4). Our pipeline consists of feature point extraction, HOG- 1D computation, DTW-MDS computation, FV encoding and linear SVM clas- sification, whose time complexities during training are: O(N train L),O(n train l), 28 O(r 2 train l 2 ) +O(n train h 3 ),O(|d|n train +Kn 2 train ) andO(N train ), where|d| is the dimensionality of the subsequence descriptor. Therefore, the total time cost is: O(N train L) +O(n train l) +O(r 2 train l 2 ) +O(n train h 3 ) +O(|d|n train +Kn 2 train ). Since in general n train N train n train r train and l,h,|d| are usually small positive values (e.g., l = 40, h = 20,|d|≤ 36 in our experiments), the training time cost is quadratic in n train , i.e.,O(n 2 train ). In practice n train is usually large, training is done offline. At test, to classify a time series, let n test be the number of extracted sub- sequences from that time series, then subsequence extraction, HOG-1D, DTW- MDS computation, FV encoding and SVM classification have complexitiesO(L), O(n test l), O(n test r train l 2 ) +O(n test h 3 ), O(n test ) andO(1) respectively, and the overall test complexity isO(n test r train l 2 ) +O(n test h 3 ) +O(L), wheren test L and l,h,r train are small positive values as well, therefore, the test can be done online. NNDTW has a time costO(N train L 2 ) in general; Shapelet Transform [58] con- tains time-consuming shapelet searching, and in general it takes timeO(N 2 train L 3 ). Shapelet Transform is usually more time-consuming than our algorithm and NNDTW. Although in general, our algorithm has higher time cost than NNDTW, our time cost at test is usually much cheaper than that of NNDTW. As the number of training time series increases, efficiency gain of our algorithm at test is further enlarged, since our test time cost is independent of N train . 2.5 Experimental Validation We extensively test our feature point extractor and subsequence descriptors on 43 UCR time series datasets [23]. All results reported throughout the experiments are obtained under the following fixed settings: (1) subsequence length (l) is set to 29 be 40 (l = 40) on all datasets; (2) stride of sliding window in the case of uniform sampling is set to be 5 (s = 5); (2) in computing HOG-1D descriptors, each sub- sequence is divided into 2 non-overlapping segments, and the gradient orientation bins are evenly spaced over (−90 0 90 0 ) with the bin number set as 8. Therefore, the dimensionality of HOG-1D is 16; (3) in computing DTW-MDS descriptors, DTW distance is calculated without the warping window size constraint, and the dimen- sionality h of DTW-MDS is set to be 20 (h = 20); (4) we use the Bag-of-Words framework to encode time series, specifically using GMM clustering to generate codebook and Fisher Vector to encode time series; (5) we use a linear-kernel SVM classifier for classification. Two tunable parameters are the number of components (K) in GMM and the power normalization factor α, and different datasets have different optimal parameters. In experiments, K andα on each dataset are set by cross-validation on training data. Baseline descriptors: we compare with two widely known time series rep- resentations, Discrete Fourier Transform (DFT) and Discrete Wavelet Transform (DWT), and another representation introduced in [143, 134] (Zhang’s). Although all three representations are proposed to represent the whole time series, here we use them to represent time series subsequences. (1) DFT: we keep the first half of coefficients, and use them as the representation for the subsequence. (2) DWT: we use the Haar wavelet basis and decompose each subsequence into 3 levels. The detail wavelet coefficients of three levels and the approximation wavelet coefficients of the third level are concatenated to form the final representation. (3) Zhang’s: Zhang et al. [143] augments conventional statistical features with a set of physical features, the combination of statistical and physical features forms the representa- tion, for details of the feature sets, please refer to their paper. 30 Hybrid Vs. Uniform descriptors HOG-1D DTW-MDS HOG-1D+DTW-MDS Zhang’s DFT DWT p-values(×1.0e-7) .2933 .1648 .3632 .6840 .1834 .3631 Hybrid Vs. Random descriptors HOG-1D DTW-MDS HOG-1D+DTW-MDS Zhang’s DFT DWT p-values(×1.0e-7) .2365 .1120 .1834 .1967 .1385 .1648 Table 2.1: Wilcoxon signed rank test between performances of hybrid and uniform (random) sampling. Under the Wilcoxon signed rank test, the null hypothesis is that performance differences between hybrid and uniform (random) sampling come from a distribution whose median is 0 at 1% significance level. Since all p-values are much smaller than 0.01, showing that the null hypothesis is rejected strongly. This indicates hybrid sampling outperforms both uniform and random sampling in a statistically significant way. Baseline TSC methods: we compare with two TSC methods, NNDTW and shapelet transform [58]. NNDTW is shown to be very hard to beat [8], and, in experiments, we use one nearest neighbor under DTW distance with the warping window size constraint. Classifiers built on shapelet transform[58] are shown to be more accurate than the original tree-based shapelet classifier [139] on a wide range of datasets. Both NNDTW and shapelet transform are top ranked TSC methods, and therefore, they are used as baseline for comparison. 2.5.1 Hybrid Sampling Vs. uniform and random sampling We compare three different sampling strategies: uniform, random and the pro- posed hybrid sampling. We keep the classification pipeline (local subsequence extraction and representation + FV time series encoding + linear SVM classifica- tion) fixed, but only change the subsequence sampling methods. To test whether the superiority of hybrid sampling is independent of local descriptors, we test 6 descriptors, including HOG-1D, DTW-MDS, HOG-1D+DTW-MDS, DWT, DFT and Zhang’s. 31 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracies DFT InlineSkate WordsSynonyms 50words Haptics uWaveGestureLibrary−Y Adiac uWaveGestureLibrary−Z uWaveGestureLibrary−X CinC−ECG−torso FacesUCR ChlorineConcentration Cricket−X Cricket−Y FaceAll Cricket−Z MedicalImages fish yoga Beef Lightning7 OSULeaf SwedishLeaf SonyAIBORobotSurface FaceFour OliveOil SonyAIBORobotSurfaceII Two−Patterns Lightning2 MoteStrain ECG200 Coffee MALLAT Symbols Gun−Point ECGFiveDays ItalyPowerDemand StarLightCurves DiatomSizeReduction synthetic−control TwoLeadECG wafer CBF Trace hybrid uniform random 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracies DTW−MDS InlineSkate Haptics WordsSynonyms uWaveGestureLibrary−Y 50words Beef uWaveGestureLibrary−Z uWaveGestureLibrary−X Adiac CinC−ECG−torso ChlorineConcentration Cricket−X yoga MedicalImages Cricket−Y Cricket−Z Lightning7 OliveOil OSULeaf MoteStrain Coffee ECGFiveDays ECG200 Lightning2 FacesUCR SonyAIBORobotSurfaceII FaceAll SwedishLeaf Symbols fish StarLightCurves SonyAIBORobotSurface MALLAT Two−Patterns ItalyPowerDemand DiatomSizeReduction FaceFour TwoLeadECG Gun−Point wafer synthetic−control CBF Trace hybrid uniform random 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracies DWT InlineSkate Haptics WordsSynonyms Beef uWaveGestureLibrary−Y 50words uWaveGestureLibrary−Z CinC−ECG−torso Adiac uWaveGestureLibrary−X ChlorineConcentration MedicalImages yoga Cricket−X Lightning7 Cricket−Z Cricket−Y OliveOil Lightning2 OSULeaf FacesUCR MoteStrain fish FaceAll ItalyPowerDemand ECG200 Symbols StarLightCurves MALLAT SonyAIBORobotSurfaceII SwedishLeaf Two−Patterns DiatomSizeReduction FaceFour ECGFiveDays SonyAIBORobotSurface Gun−Point synthetic−control TwoLeadECG wafer CBF Coffee Trace hybrid uniform random 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracies HOG−1D InlineSkate WordsSynonyms Haptics Beef uWaveGestureLibrary−Y 50words uWaveGestureLibrary−Z Cricket−Y Cricket−X ChlorineConcentration Cricket−Z uWaveGestureLibrary−X MedicalImages Adiac CinC−ECG−torso fish Lightning7 FacesUCR yoga SonyAIBORobotSurfaceII Lightning2 FaceAll ECG200 OSULeaf ItalyPowerDemand MoteStrain FaceFour OliveOil synthetic−control SwedishLeaf Symbols SonyAIBORobotSurface DiatomSizeReduction MALLAT CBF StarLightCurves ECGFiveDays Gun−Point Two−Patterns TwoLeadECG wafer Coffee Trace hybrid uniform random 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracies HOG−1D+DTW−MDS InlineSkate WordsSynonyms Haptics 50words uWaveGestureLibrary−Y Beef uWaveGestureLibrary−Z Adiac ChlorineConcentration uWaveGestureLibrary−X CinC−ECG−torso MedicalImages Lightning7 Cricket−Y Cricket−X Cricket−Z yoga OliveOil Lightning2 OSULeaf MoteStrain FacesUCR SonyAIBORobotSurfaceII FaceAll ItalyPowerDemand SwedishLeaf ECG200 SonyAIBORobotSurface StarLightCurves Symbols MALLAT fish FaceFour DiatomSizeReduction ECGFiveDays TwoLeadECG Gun−Point synthetic−control Two−Patterns wafer CBF Coffee Trace hybrid uniform random 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracies Zhang’s InlineSkate WordsSynonyms uWaveGestureLibrary−Y Haptics 50words uWaveGestureLibrary−X uWaveGestureLibrary−Z Beef ChlorineConcentration Adiac synthetic−control Cricket−Z Cricket−Y Two−Patterns CinC−ECG−torso Cricket−X fish FaceAll SonyAIBORobotSurfaceII yoga OSULeaf FacesUCR MedicalImages Lightning7 CBF SonyAIBORobotSurface MoteStrain OliveOil Symbols Coffee Lightning2 SwedishLeaf FaceFour ECG200 TwoLeadECG ItalyPowerDemand StarLightCurves MALLAT Gun−Point ECGFiveDays wafer DiatomSizeReduction Trace hybrid uniform random Figure 2.7: Performance comparison of hybrid, uniform and random sampling strategies, across different descriptors and different datasets. We test 6 different descriptors, including HOG-1D, DTW-MDS, HOG-1D+DTW-MDS, DWT, DFT and Zhang’s, on 43 UCR time series datasets under 3 different sampling methods. Performance accuracies of different descriptors are obtained under BoW encoding and linear kernel SVM pipelines. As visually seen in the plot, hybrid sampling works almost consistently better than uniform and random sampling by a large gap, both across different descriptors and across different datasets. Quantitatively, we run Wilcoxon signed rank test between performance of hybrid and uniform (random) sampling, p-values are listed in Table2.1, indicating hybrid sampling outperforms both uniform and random sampling significantly (p< 0.01). Uniform sampling: the stride s of the sliding window is set be 5 (s = 5), making contiguous windows overlap by half. For each time series, we randomly 32 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 HOG−1D+DTW−MDS accuracies HOG−1D accuracies 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 HOG−1D+DTW−MDS accuracies DTW−MDS accuracies t-SNE visualization of DTW-MDS t-SNE visualization of HOG-1D Figure 2.8: Complementarity of HOG-1D and DTW-MDS descriptors. HOG-1D is invariant to y-axis shift and insensitive to noise, while DTW-MDS is more contrac- tion, stretching and translation invariant. The fusion of two descriptors benefits from their individual advantages, and outperforms each separate descriptor on most of the UCR datasets. Two plots in the 1st row show t-SNE [123] visualiza- tion of HOG-1D and DTW-MDS descriptors of 246 subsequences. As we see from both plots, similar shaped subsequences are displayed proximately, indicating both descriptors capture shapes very well (Here, different subsequence colors are inde- pendent of t-SNE, but just for visual comfortability. 246 subsequences are k-means clusteredinto10clusters, witharandomcolorassignedforeachcluster). Twoplots in the 2nd row show performance comparisons between HOG-1D+DTW-MDS and HOG-1D (DTW-MDS). On most datasets, the fused descriptor outperforms each separate one. By running Wilcoxon signed rank test, p-value between the fused and HOG-1D (DTW-MDS) is 6.7· 10 −7 and 1.5· 10 −5 , showing fused descriptor performs significantly better. 33 choose a start sampling point from (l/2 l) (l is the length of subsequence, l = 40 in experiments) and then slide the window at stride s to get the following sub- sequences. Since sampled subsequences of uniform sampling depend on the start sampling point, we repeat experiments for 10 times, and the mean classification accuracy is reported. Random sampling: to make the number of sampled subsequences from some time series the same as in uniform sampling,b L s c (where L is the length of the time series ands denotes the stride of sliding window) subsequences are randomly sampled from each time series. Similarly we repeat experiments 10 times, and the mean classification accuracy is reported. Hybrid sampling: in this case, subsequences are obtained by (1) sampling from landmark points and (2) uniformly sampling from flat regions with stride 5 (s = 5). Sampling from landmark points results in deterministic subsequences, while uniform sampling introduces some randomness because of the start sampling point. Since in all 43 UCR datasets, flat regions take only minimal proportions of a time series, we do the uniform sampling at flat regions only once and ignore the impact of randomness. The reported accuracy is based on one experiment. The performance of different sampling strategies under the same descriptor on 43 datasets is shown in Fig.2.7. The hybrid sampling almost consistently outper- forms both uniform and random sampling across different descriptors and different datasets, and in most cases, the performance gap is quite huge. Quantitatively, we perform a Wilcoxon signed rank test between performance of hybrid and uniform (random) sampling, and p-values are listed in Table 2.1. As seen, hybrid sampling works significantly better than both uniform and random sampling (p 0.01). This is partially attributed to phase-invariance of landmark-points based sampling, i.e., sampling from landmark points makes the sampled subsequences independent 34 of phases of time series. This is also due to the fact that subsequences at feature points represent the time series better than those from non-feature points. Sam- pling from landmarking points results in deterministic subsequences, decreasing randomness to 0, while uniform and random sampling contain much randomness, which results from time series phase shift and sampling initializations. Since hybrid sampling is shown to be superior to both uniform and random sampling, all experiments in the following sections use hybrid sampling to get subsequences. 0 5 10 15 20 25 # of wins # of wins of each algorithm NNDTW shapelet−transform HOG−1D+DTW−MDS 22 11 9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 HOG−1D+DTW−MDS accuracies shapelet−transform accuracies 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 HOG−1D+DTW−MDS accuracies NNDTW accuracies Figure2.9: ComparisonwiththestateoftheartTSCalgorithms. Wecomparewith two state-of-the-art algorithms: NNDTW (with warping window constraints) and shapelet transform [58]. The performance of our fused descriptor HOG-1D+DTW- MDS is obtained under BoW encoding and linear kernel SVM classification. The first two plots show performance comparisons between HOG-1D+DTW-MDS and NNDTW (Shapelet Transform), and as being observed, HOG-1D+DTW-MDS performs better on most datasets. The 3rd plot shows the number of datasets on which each algorithm wins (including ties) the other two, and ours gets the best performance on 22 out of 41. A Wilcoxon signed rank test shows that our method improves relative classification accuracies significantly compared to NNDTW (p< 0.0017) and shapelet transform (p< 0.0452). 35 2.5.2 Complementarity of HOG-1D and DTW-MDS WequalitativelyshowthedescriptivenessoflocalshapesofHOG-1DandDTW- MDS and quantitatively report the classification accuracies of HOG-1D, DTW- MDS and HOG-1D+DTW-MDS on 43 UCR datasets. FirstwevisualizeHOG-1D(16-D)andDTW-MDS(20-D)descriptorsbyt-SNE [123]. t-SNE displays high-dimensional data by giving each data point a location in a 2D or 3D map and is shown to reveal the manifold structures of high-dimensional datasets. 2D (or 3D) visualization of t-SNE preserves neighborhood relationship among high-dimensional points, therefore, the mapped 2D (or 3D) points of similar descriptors have similar coordinates and are grouped and displayed proximately, while descriptors from different manifolds have dissimilar dimension-reduced 2D (or 3D) coordinates and are displayed in different groups. The 1st row in Fig.2.8 shows 2D t-SNE visualization of HOG-1D and DTW-MDS descriptors. The plot is generated in four steps: (1) sample subsequences from time series by hybrid sampling; (2) calculate their descriptors (HOG-1D or DTW-MDS); (3) use t-SNE to map descriptors to 2D locations; (4) plot corresponding raw subsequences on their 2D locations. As we see, subsequences of visually similar shapes are displayed proximately, while dissimilar subsequences are spatially separated. This indicates that visually similarly-shaped subsequences have similar descriptors (HOG-1D and DTW-MDS), while differently-shaped subsequences possess different descriptors. t-SNE visualization shows qualitatively that both HOG-1D and DTW-MDS cap- ture local subsequence shapes quite well. Then we quantitatively show classification performances of 3 descriptors. Two plots in the 2nd row in Fig.2.8 show classification performances between HOG- 1D+DTW-MDS and HOG-1D (DTW-MDS). Most points lie above the diago- nal, showing better performance of the fused descriptor than either HOG-1D or 36 DTW-MDS. By running a Wilcoxon signed-rank test between the classification accuracies of HOG-1D+DTW-MDS and HOG-1D (DTW-MDS), we get p-values 6.7· 10 −7 (1.5· 10 −5 ), which shows the fused descriptor outperformed both HOG- 1D and DTW-MDS significantly (p 0.01). The classification error rates of 3 descriptors are listed in Table 3.2. The better performance demonstrates that the fused descriptor takes advantages of individual descriptors, becomes more descrip- tive of subsequences, and thus is more discriminative for classification tasks. In fact, HOG-1D and DTW-MDS are complementary to each other: HOG-1D is more robust to noise, and invariant to y-axis magnitude shift; while DTW-MDS is sensitive to noise and magnitude shift, but it is more stretching, contraction and temporal shifting invariant. 2.5.3 Comparison with other descriptors and the state-of- the-art algorithms We compare our fused descriptor HOG-1D+DTW-MDS with 3 other subse- quence descriptors: DFT, DWT and Zhang’s [143, 134]. Additionally, we will compare with 2 state-of-the-art TSC algorithms: NNDTW and shapelet trans- form [58]. Four local descriptors: as mentioned, we use the Bag-of-Words classifica- tion pipeline, hybrid sampling + local subsequence representation + Fisher Vector time series encoding + linear SVM classification, for 4 different local subsequence descriptors. Two tunable parameters, K in GMM and the power normalization factor α, are determined by cross-validation on training data, and classification performance on test data is reported. Two state-of-the-art algorihtms: (1) NNDTW: we use DTW with warping windowconstraintsasthedistancemeasureandthelabelofthenearestneighboras 37 the predicted label for the test time series. Instead of running NNDTW again, we import results from UCR time series website [23]; (2) Shapelet Transform[58]: in the original paper, authors only tested on 29 UCR datasets, while they provided full results on their website [59]. Again, we directly import results from their website. Beforehand, we define a terminology: Relative accuracy boost. It is defined to be the ratio between the boosted accuracy by HOG-1D+DTW-MDS and the original accuracy, i.e., rA i β = (A i HOG−1D+DTW−MDS −A i β )/A i β , where rA i β is the relative accuracy boost of algorithmβ on dataseti (β∈{DFT, Zhang’s, NNDTW, shapelet transform, DWT}),A i β andA i HOG−1D+DTW−MDS are the accuracy of algo- rithm β and reference algorithm HOG-1D+DTW-MDS on dataset i. Since shapelet transform [59] does not provide classification accuracies on 2 datasets – ’50words’ and ’ECG200’, all the comparisons (including plots in Fig.2.9 and p-values in Table 2.2) in this section are reported on the remaining 41 datasets. The left two plots in Fig. 2.9 show performance comparisons between HOG- 1D+DTW-MDS and NNDTW (Shapelet Transform), and most points still lie above the diagonal, showing better performances of our algorithm. The right plot shows the number of datasets on which each algorithm wins (including ties) the other two. As shown, on 22 out of 41 datasets, HOG-1D+DTW-MDS wins both NNDTW and Shapelet Transform. By running a Wilcoxon signed rank test on the relative accuracy boost of NNDTW (shapelet transform), we obtain p-values 0.0017 (0.0452), which indicates the relative accuracy boost by HOG-1D+DTW- MDS is significant (p< 0.05). We do Wilcoxon signed rank test on relative accuracy boost of 5 algorithms, and report the p-values in Table 2.2. Under the significance level 5%, the relative accuracy boost by HOG-1D+DTW-MDS is significant for all 5 algorithms. 38 signrank test on relative accuracy boost by HOG-1D+DTW-MDS methods DFT Zhang’s NNDTW shapelet transform DWT p-values 5.2e-8 5.2e-8 0.0017 0.0452 6.2e-4 Table 2.2: Wilcoxon signed rank test on relative accuracy boost by HOG- 1D+DTW-MDS. Under the significance level 5%, all 5 algorithms have significant relative accuracy boosts (p< 0.05). 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Expected Accuracy Gain (train) Actual Accuracy Gain (test) TP FP FN TN Figure 2.10: Texas Sharpshooter plot: HOG-1D+DTW-MDS vs. NNDTW. TP: true positive (our algorithm was expected from the training data to outperform NNDTW, and it actually did on the test data). TN: true negatives, FP: false positives, FN: false negatives. Each dot is one UCR dataset (43 dots total). Our algorithm performed on the test set as predicted from the training performance in nearly 90% of the datasets (38 out of 43 dots are in the TP or TN regions). Texas Sharpshooter Fallacy: although our algorithm outperforms other algorithms according to the Wilcoxon signed rank test, knowing this is not use- ful unless we can tell in advance on which problems it will be more accurate, as stated in [8]. In this section we use the Texas sharpshooter plot [8] to show when our algorithm has superior performance on the test set as predicted from perfor- mance on the training set, compared with NNDTW. We run leave-one-out cross validation on training data to measure the accuracies of HOG-1D+DTW-MDS 39 and NNDTW, and we calculate the expected gain: accuracy(HOG-1D+DTW- MDS)/accuracy(NNDTW). We then measure the actual accuracy gain using the test data. The results are plotted in Fig.2.10. Most points ( 88.4% percent) fall in the TP and TN regions, indicating we can confidently predict that our algorithm will be superior/inferior to NNDTW. There are only 5 points falling in the FP region, but as seen they just represent minor losses or gains. For consistency with convention in reporting error rates in TSC problems, we document error rates of 8 algorithms in Table 3.2. The lowest error rate on each dataset is highlighted in bold font (HOG-1D and DTW-MDS are excluded from performance comparison). 10 2 10 3 10 4 10 5 time (s) Coffee Beef OliveOil ECG200 FaceFour Lighting2 SonyAIBORobotSurface Trace Gun−Point DiatomSizeReduction CBF Lighting7 MoteStrain TwoLeadECG fish SonyAIBORobotSurfaceII ItalyPowerDemand Adiac MedicalImages ECGFiveDays WordsSynonyms OSULeaf synthetic−control SwedishLeaf Cricket−Z Symbols Cricket−X 50words Cricket−Y FacesUCR FaceAll Haptics ChlorineConcentration CinC−ECG−torso MALLAT yoga InlineSkate Two−Patterns uWaveGestureLibrary−Z uWaveGestureLibrary−Y wafer uWaveGestureLibrary−X StarLightCurves NNDTW NN−DTW−DDTW Ours Ours−test Figure 2.11: Empirical time cost of our algorithm, NNDTW and NN-DTW- DDTW. Since our algorithm has two steps, training and test, we report the total running time (including training and test, red curve) and running time at test only (cyan curve). Although our algorithm in general has higher time cost than NNDTW and NN-DTW-DDTW, we are usually much more efficient at test, mak- ing our algorithm suitable for online prediction. 40 2.5.4 Empirical time complexity A practical and important issue is the efficiency of the proposed algorithm. As analyzed theoretically, in general our algorithm has a higher time complexity than NNDTW. However, the majority of our time cost is during training (usually not a problem as it is done offline), while during test, our algorithm is usually much more efficient than NNDTW, especially when the training data is large. We com- pare empirical running time of three different algorithms, including ours, NNDTW and NN-DTW-DDTW, on 43 UCR datasets. Here, NN-DTW-DDTW is the near- est neighbor classifier under the fused distance metric, i.e., (DTW, β·DDTW ), where DDTW is derivative Dynamic time warping [71] and β is a weighting fac- tor between two metrics, tuned by cross-validation on training data. We report running time using the following machine: Ubuntu 12.04 64-bit, Intel i7 CPU 960, 8 cores, 12G RAM, matlab2015a. As seen in Fig. 2.11, although in general our algorithm has higher time cost (including training and test) than NNDTW and NN-DTW-DDTW, we are more efficient during test, which is important for real-time prediction. 2.5.5 Sensitivity analysis The above results are reported under a fixed parameter setting, however, the performance robustness to different parameter settings is important for an algo- rithm. In the supplementary materials, we evaluate the performance sensitivity of our pipeline to each parameter, and do the evaluation under the philosophy that: vary one parameter at a time while maintaining other parameters fixed. Extensive experiments show that our algorithm performs well under wide ranges of subse- quence lengths l, strides s and DTW-MDS dimensionalities h, and therefore is largely insensitive to them. We demonstrate as well that (1) dense sampling with 41 a small stride outperforms sparse sampling with a large stride; (2) hybrid sam- pling outperforms both uniform and random sampling; (3) Fisher Vector encoding is superior to the ordinary Bag-of-words frequency encoding. See supplementary materials for detailed results. datasets HOG-1D DTW-MDS DFT Zhang’s NNDTW Shapelet Transform DWT HOG-1D+ DTW-MDS 50words 0.419 0.455 0.622 0.536 0.242 – 0.386 0.402 Adiac 0.297 0.355 0.570 0.422 0.391 0.435 0.322 0.320 Beef 0.467 0.400 0.367 0.467 0.467 0.167 0.467 0.367 CBF 0.057 0.000 0.029 0.287 0.004 0.003 0.000 0.000 ChlorineConcentration 0.335 0.285 0.449 0.433 0.350 0.300 0.259 0.307 CinC-ECG-torso 0.275 0.327 0.474 0.365 0.070 0.154 0.323 0.249 Coffee 0.000 0.107 0.179 0.179 0.179 0.000 0.000 0.000 Cricket-X 0.351 0.249 0.510 0.362 0.236 0.218 0.223 0.195 Cricket-Y 0.359 0.223 0.438 0.382 0.197 0.236 0.187 0.205 Cricket-Z 0.323 0.218 0.436 0.385 0.180 0.228 0.190 0.185 DiatomSizeReduction 0.065 0.036 0.085 0.023 0.065 0.124 0.039 0.016 ECG200 0.130 0.100 0.210 0.120 0.120 – 0.090 0.060 ECGFiveDays 0.017 0.103 0.139 0.059 0.203 0.001 0.020 0.012 FaceAll 0.159 0.075 0.436 0.331 0.192 0.263 0.095 0.082 FaceFour 0.102 0.034 0.273 0.136 0.114 0.057 0.034 0.034 FacesUCR 0.228 0.094 0.464 0.314 0.088 0.087 0.128 0.090 Gun-Point 0.013 0.020 0.147 0.067 0.087 0.020 0.007 0.007 Haptics 0.536 0.536 0.614 0.568 0.588 0.523 0.516 0.471 InlineSkate 0.615 0.665 0.695 0.680 0.613 0.615 0.673 0.551 ItalyPowerDemand 0.116 0.036 0.095 0.096 0.045 0.048 0.093 0.070 Lightning2 0.180 0.098 0.230 0.148 0.131 0.344 0.180 0.148 Lightning7 0.260 0.205 0.438 0.301 0.288 0.260 0.219 0.205 MALLAT 0.058 0.046 0.178 0.090 0.086 0.060 0.050 0.035 MedicalImages 0.304 0.236 0.424 0.305 0.253 0.396 0.251 0.230 MoteStrain 0.104 0.130 0.216 0.236 0.134 0.109 0.118 0.090 OSULeaf 0.116 0.161 0.351 0.318 0.384 0.285 0.136 0.120 OliveOil 0.100 0.167 0.267 0.233 0.167 0.100 0.167 0.167 SonyAIBORobotSurface 0.072 0.047 0.276 0.251 0.305 0.067 0.017 0.042 SonyAIBORobotSurfaceII 0.171 0.091 0.255 0.329 0.141 0.115 0.049 0.084 StarLightCurves 0.039 0.048 0.091 0.096 0.095 0.024 0.060 0.040 SwedishLeaf 0.090 0.074 0.366 0.184 0.157 0.093 0.048 0.061 Symbols 0.074 0.071 0.155 0.231 0.062 0.114 0.076 0.036 Trace 0.000 0.000 0.000 0.000 0.010 0.020 0.000 0.000 Two-Patterns 0.010 0.045 0.253 0.367 0.001 0.059 0.046 0.004 TwoLeadECG 0.008 0.022 0.067 0.115 0.132 0.004 0.005 0.007 WordsSynonyms 0.555 0.534 0.688 0.619 0.252 0.403 0.475 0.483 fish 0.274 0.063 0.389 0.360 0.160 0.023 0.097 0.034 synthetic-control 0.097 0.007 0.080 0.403 0.017 0.017 0.007 0.007 uWaveGestureLibrary-X 0.309 0.367 0.479 0.527 0.227 0.216 0.316 0.280 uWaveGestureLibrary-Y 0.440 0.474 0.580 0.594 0.301 0.303 0.448 0.399 uWaveGestureLibrary-Z 0.369 0.398 0.522 0.526 0.322 0.273 0.376 0.321 wafer 0.001 0.009 0.046 0.027 0.005 0.002 0.001 0.001 yoga 0.192 0.240 0.379 0.324 0.155 0.195 0.235 0.182 Table 2.3: Error rates of different algorithms on UCR time series datasets. The lowest error rate on each dataset is highlighted in bold font (HOG-1D and DTW- MDS are excluded for performance comparison) 42 2.6 Conclusions In this work, we focus on developing better local feature extractors and bet- ter local subsequence descriptors of time series data. We introduced a simple but effective feature point detector from real case time series data, proposed to sam- ple subsequences both from feature points and flat regions, and experimentally showed this hybrid sampling method performed significantly better than tradi- tional uniform and random sampling. Further more, two novel descriptors, HOG- 1D and DTW-MDS, are developed to represent local subsequences. Experimental results show that both descriptors are quite descriptive of local subsequence shapes and complementary of each other, and the fused descriptor outperformed individ- ual descriptors significantly. We tested our fused descriptor extensively on 43 standard UCR time series datasets, compared with two state-of-the-art compet- ing algorithms, NNDTW and shapelet transform, and 3 other local subsequence descriptors, including DFT, Zhang’s and DWT, and experimental results showed our fused descriptor performs significantly better than all competing algorithms and local descriptors. To the best of our knowledge, the feature point detector and two local descriptors are first introduced here. Since our system is essentially a Bag-of-Words pipeline, the temporal informa- tioninthetimeseriesisnotencodedbyitsFisherVectorrepresentation. Inthecase when the temporal or the phase information is important to distinguish time series from different classes, we could enhance the representation of each subsequence by its temporal span, and then process the modified subsequence descriptors by our proposed pipeline. 43 Chapter 3 Time Series Alignment 44 3.1 Abstract Dynamic Time Warping (DTW) is an algorithm to align temporal sequences with possible local non-linear distortions, and has been widely applied to audio, video and graphics data alignments. DTW is essentially a point-to-point match- ing method under some boundary and temporal consistency constraints. Although DTWobtainsaglobaloptimalsolution,itdoesnotnecessarilyachievelocallysensi- ble matchings. Concretely, two temporal points with entirely dissimilar local struc- tures may be matched by DTW. To address this problem, we propose an improved alignment algorithm, named shape Dynamic Time Warping (shapeDTW), which enhances DTW by taking point-wise local structural information into considera- tion. shapeDTWisinherentlyaDTWalgorithm, butadditionallyattemptstopair locally similar structures and to avoid matching points with distinct neighborhood structures. We apply shapeDTW to align audio signal pairs having ground-truth alignments, as well as artificially simulated pairs of aligned sequences, and obtain quantitatively much lower alignment errors than DTW and its two variants. When shapeDTW is used as a distance measure in a nearest neighbor classifier (NN- shapeDTW) to classify time series, it beats DTW on 64 out of 84 UCR time series datasets, with significantly improved classification accuracies. By using a properly designed local structure descriptor, shapeDTW improves accuracies by more than 10% on 18 datasets. To the best of our knowledge, shapeDTW is the first distance measure under the nearest neighbor classifier scheme to significantly outperform DTW, which had been widely recognized as the best distance measure to date. Our code is publicly accessible at: https://github.com/jiapingz/shapeDTW. 45 3.2 Introduction Dynamic time warping (DTW) is an algorithm to align temporal sequences, which has been widely used in speech recognition [100], human motion animation [62], humanactivityrecognition[75]andtimeseriesclassification[23]. DTWallows temporal sequences to be locally shifted, contracted and stretched, and under some boundary and monotonicity constraints, it searches for a global optimal alignment path. DTW is essentially a point-to-point matching algorithm, but it additionally enforcestemporalconsistenciesamongmatchedpointpairs. Ifwedistillthematch- ing component from DTW, the matching is executed by checking the similarity of two points based on their Euclidean distance. Yet, matching points based solely on their coordinate values is unreliable and prone to error, therefore, DTW may gen- erate perceptually nonsensible alignments, which wrongly pair points with distinct local structures (see Fig.3.1 (c)). This partially explains why the nearest neighbor classifier under the DTW distance measure is less interpretable than the shapelet classifier [139]: although DTW does achieve a global minimal score, the alignment process itself takes no local structural information into account, possibly resulting in an alignment with little semantic meaning. In this paper, we propose a novel alignment algorithm, named shape Dynamic Time Warping (shapeDTW), which enhances DTW by incorporating point-wise local structures into the matching process. As a result, we obtain perceptually interpretable alignments: similarly- shaped structures are preferentially matched based on their degree of similarity. We further quantitatively evaluate alignment paths against the ground-truth align- ments, and shapeDTW achieves much lower alignment errors than DTW on both simulated and real sequence pairs. An alignment example by shapeDTW is shown in Fig.3.1 (d). 46 (a) (b) (c) (d) Figure 3.1: Motivation to incorporate temporal neighborhood structural infor- mation into the sequence alignment process. (a) an image matching example: two correspondingpointsfromtheimagepairsareboxedoutandtheirlocalpatchesare shown in the middle. Local patches encode image structures around spatial neigh- borhoods, and therefore are discriminative for points, while it is hard to match two points solely by their pixel values. (b) two time series with several similar local structures, highlighted as bold segments. (c) DTW alignment: DTW fails to align similar local structures. (d) shapeDTW alignment: we achieve a more interpretable alignment, with similarly-shaped local structures matched. Point matching is a well studied problem in the computer vision community, widely known as image matching. In order to search corresponding points from two distinct images taken from the same scene, a quite naive way is to compare their pixel values. But pixel values at a point lacks spatial neighborhood context, making it less discriminative for that point; e.g., a tree leaf pixel from one image may have exactly the same RGB values as a grass pixel from the other image, but these two pixels are not corresponding pixels and should not be matched. Therefore, a routine for image matching is to describe points by their surrounding image patches, and then compare the similarities of point descriptors. Since point 47 a. Input time series d. Shape descriptor sequences e. Align descriptor sequences by DTW f. Transfer the warping path b. Sample subsequences c. Compute shape descriptors Figure 3.2: Pipeline of shapeDTW. shapeDTW consists of two major steps: encode local structures by shape descriptors and align descriptor sequences by DTW. Concretely, we sample a subsequence from each temporal point, and fur- ther encode it by some shape descriptor. As a result, the original time series is converted into a descriptor sequence of the same length. Then we align two descriptor sequences by DTW and transfer the found warping path to the original time series. descriptors designed in this way encode image structures around local neighbor- hoods, they are more distinctive and discriminative than single pixel values. In early days, raw image patches were used as point descriptors [1], and now more powerful descriptors like SIFT [89] are widely adopted since they capture local image structures very well and are invariant to image scale and rotation. Intuitively, local neighborhood patches make points more discriminative from other points, while matching based on RGB pixel values is brittle and results in high false positives. However, the matching component in the traditional DTW bears the same weakness as image matching based on single pixel values, since similarities between temporal points are measured by their coordinates, instead of by their local neighborhoods. An analogous remedy for temporal matching hence is: first encode each temporal point by some descriptor, which captures local 48 subsequence structural information around that point, and then match temporal points based on the similarity of their descriptors. If we further enforce temporal consistencies among matchings, then comes the algorithm proposed in the paper: shapeDTW. shapeDTW is a temporal alignment algorithm, which consists of two sequential steps: (1) represent each temporal point by some shape descriptor, which encodes structural information of local subsequences around that point; in this way, the original time series is converted into a sequence of descriptors. (2) use DTW to align two sequences of descriptors. Since the first step takes linear time while the second step is a typical DTW, which takes quadratic time, the total time complex- ity is quadratic, indicating that shapeDTW has the same computational complex- ity as DTW. However, compared with DTW and its variants (derivative Dynamic Time Warping (dDTW) [71] and weighted Dynamic Time Warping (wDTW)[67]), it has two clear advantages: (1) shapeDTW obtains lower alignment errors than DTW/dDTW/wDTW on both artificially simulated aligned sequence pairs and real audio signals; (2) the nearest neighbor classifier under the shapeDTW distance measure (NN-shapeDTW) significantly beats NN-DTW on 64 out of 84 UCR time series datasets [23]. NN-shapeDTW outperforms NN-dDTW/NN-wDTW signifi- cantly as well. Our shapeDTW time series alignment procedure is shown in Fig. 4.1. Extensive empirical experiments have shown that a nearest neighbor classifier with the DTW distance measure (NN-DTW) is the best choice to date for most time series classification problems, since no alternative distance measures outper- forms DTW significantly [131, 101, 99]. However, in this paper, the proposed temporal alignment algorithm, shapeDTW, if used as a distance measure under the nearest neighbor classifier scheme, significantly beats DTW. To the best of 49 our knowledge, shapeDTW is the first distance measure that outperforms DTW significantly. Our contributions are several fold: (1) we propose a temporal alignment algo- rithm, shapeDTW, which is as efficient as DTW (dDTW, wDTW) but achieves quantitatively better alignments than DTW (dDTW, wDTW); (2) Working under the nearest neighbor classifier as a distance measure to classify 84 UCR time series datasets, shapeDTW, under all tested shape descriptors, outperforms DTW signif- icantly; (3) shapeDTW provides a quite generic alignment framework, and users can design new shape descriptors adapted to their domain data characteristics and then feed them into shapeDTW for alignments. 3.3 Related work Since shapeDTW is developed for sequence alignment, here we first review research work related to sequence alignment. DTW is a typical sequence alignment algorithm, and there are many ways to improve DTW to obtain better alignments. Traditionally, we could enforce global warping path constraints to prevent patho- logical warpings [100], and several typical such global warping constraints include Sakoe-Chiba band and Itakura Parallelogram. Similarly, we could choose to use different step patterns in different applications: apart from the widely used step pattern - “symmetric1”, there are other popular steps patterns like “symmetric2”, “asymmetric” and “RabinerJuangStepPattern” [45]. However, how to choose an appropriate warping band constraint and a suitable step pattern depends on our prior knowledge on the application domains. There are several recent works to improve DTW alignment. In [71], to get the intuitively correct “feature to feature” alignment between two sequences, the 50 authors introduced derivative dynamic time warping (dDTW), which computes first-order derivatives of time series sequences, and then aligns two derivative sequencesbyDTW.In[67], theauthorsdevelopedweightedDTW(wDTW),which isapenalty-basedDTW.wDTWtakesthephasedifferencebetweentwopointsinto account when computing their distances. Batista et al [8] proposed a complexity- invariant distance measure, which essentially rectifies an existing distance measure (e.g., Euclidean, DTW) by multiplying a complexity correction factor. Although they achieve improved results on some datasets by rectifying the DTW measure, they do not modify the original DTW algorithm. In [78], the authors proposed to learn a distance metric, and then align temporal sequences by DTW under this newmetric. Onemajordrawbackistherequirementofgroundtruthalignmentsfor metric learning, because in reality true alignments are usually unavailable. In [16], the authors proposed to utilize time series local structure information to constrain the search of the warping path. They introduce a SIFT-like feature point detector and descriptor to detect and match salient feature points from two sequences first, and then use matched point pairs to regularize the search scope of the warping path. Their major initiative is to improve the computational efficiency of dynamic time warping by enforcing band constraints on the potential warping paths, such that they do not have to compute the full accumulative distance matrix between the two sequences. Our method is sufficiently different from theirs in following aspects: first, we have no notion of feature points, while feature points are key to their algorithm, since feature points help to regularize downstream DTW; second, our algorithm aims to achieve better alignments, while their algorithm attempts to improve the computational efficiency of the traditional DTW. In [99], the authors focus on improving the efficiency of the nearest neighbor classifier under the DTW distance measure, but they keep the traditional DTW algorithm unchanged. 51 Our algorithm, shapeDTW, is different from the above works in that: we mea- sure similarities between two points by computing similarities between their local neighborhoods, while all the above works compute the distance between two points based on their single-point y-values (derivatives). Since shapeDTW can be applied to classify time series (e.g., NN-shapeDTW), we review representative time series classification algorithms. In [84], the authors use the popular Bag-of-Words to represent time series instances, and then classify the representations under the nearest neighbor classifier. Concretely, it discretizes time series into local SAX [83] words, and uses the histogram of SAX words as the time series representation. In [102], the authors developed an algorithm to first extract class-membership discriminative shapelets, and then learn a decision tree classifier based on distances between shapelets and time series instances. In [113], they first represent time series using recurrent plots, and then measure the simi- larity between recurrence plots using Campana-Keogh (CK-1) distance (PRCD). PRCD distance is used as the distance measure under the one-nearest neighbor classifier to do classification. In [10], a bag-of-feature framework to classify time series is introduced. It uses a supervised codebook to encode time series instances, and then uses random forest classifier to classify the encoded time series. In [50], the authors first encode time series as a bag-of-patterns, and then use polynomial kernel SVM to do the classification. Zhao and Itti [146] proposed to first encode time series by the 2nd order encoding method - Fisher Vectors, and then clas- sify encoded time series by a linear kernel SVM. In their paper, subsequences are sampled from both feature points and flat regions. shapeDTW is different from above works in that: shapeDTW is developed to align temporal sequences, but can be further applied to classify time series. How- ever, all above works are developed to classify time series, and they are incapable 52 to align temporal sequences at their current stages. Since time series classification is only one application of shapeDTW, we compare NN-shapeDTW against the above time series classification algorithms in the supplementary materials. The paper is organized as follows: the detailed algorithm for shapeDTW is introduced in Sec.3.4, and in Sec.3.5 we introduce several local shape descriptors. Then we extensively test shapeDTW for both sequence alignments and time series classification in Sec. 3.7, and conclusions are drawn in Sec.3.8. 3.4 shape Dynamic Time Warping Inthissection, weintroduceatemporalalignmentalgorithm, shapeDTW.First we introduce DTW briefly. 3.4.1 Dynamic Time Warping DTW is an algorithm to search for an optimal alignment between two temporal sequences. It returns a distance measure for gauging similarities between them. Sequences are allowed to have local non-linear distortions in the time dimension, and DTW handles local warpings to some extent. DTW is applicable to both univariate and multivariate time series, and here for simplicity we introduce DTW in the case of univariate time series alignment. A univariate time seriesT is a sequence of real values, i.e.,T = (t 1 , t 2 ,..., t L ) T . Given two sequencesP andQ of possible different lengthsL P andL Q , namely P = (p 1 ,p 2 ,...,p L P ) T and Q = (q 1 ,q 2 ,...,q L Q ) T , and let D(P,Q) ∈ R L P ×L Q be an pairwise distance matrix between sequencesP andQ, whereD(P,Q) i,j is the distance between p i and p j . One widely used pairwise distance measure is the Euclidean distance, i.e., D(P,Q) i,j = |p i − q j |. The goal of temporal 53 alignment betweenP andQ is to find two sequences of indices α and β of the same length l (l≥ max(L P ,L Q )), which match index α(i) in the time seriesP to index β(i) in the time seriesQ, such that the total cost along the matching path P l i=1 D(P,Q) α(i),β(i) is minimized. The alignment path (α,β) is constrained to satisfies boundary, monotonicity and continuity conditions [108, 70, 44]: α(1) =β(1) = 1 α(l) =L P , β(l) =L Q (α(i + 1),β(i + 1))− (α(i),β(i))∈{(1, 0), (1, 1), (0, 1)} (3.1) Given an alignment path (α,β), we define two warping matrices W P ∈ {0, 1} l×L P andW Q ∈{0, 1} l×L Q forP andQ respectively, such thatW P (i,α(i)) = 1, otherwiseW P (i,j) = 0, and similarlyW Q (i,β(i)) = 1, otherwiseW Q (i,j) = 0. Then the total cost along the matching path P l i=1 D(P,Q) α(i),β(i) is equal to kW P ·P−W Q ·Qk 1 , thus searching for the optimal temporal matching can be formulated as the following optimization problem: arg min l,W P ∈{0,1} l×L P,W Q ∈{0,1} l×L Q kW P ·P−W Q ·Qk 1 (3.2) Program 4.1 can be solved efficiently inO(L P ×L Q ) time by a dynamic pro- gramming algorithm [39]. Various different moving patterns and temporal window constraints [108] can be enforced, but here we consider DTW without warping window constraints and taking moving patterns as in (3.1). 3.4.2 shape Dynamic Time Warping DTW finds a global optimal alignment under certain constraints, but it does not necessarily achieve locally sensible matchings. Here we incorporate local shape information around each point into the dynamic programming matching process, 54 resulting in more semantically meaningful alignment results, i.e., points with simi- lar local shapes tend to be matched while those with dissimilar neighborhoods are unlikely to be matched. shapeDTW consists of two steps: (1) represent each tem- poral point by some shape descriptor; and (2) align two sequences of descriptors by DTW. We first introduce the shapeDTW alignment framework, and in the next section, we introduce several local shape descriptors. Given a univariate time seriesT = (t 1 ,t 2 ,...,t L ) T ,T ∈R L , shapeDTW begins by representing each temporal point t i by a shape descriptor d i ∈ R m , which encodes structural information of temporal neighborhoods around t i , in this way, the original real value sequenceT = (t 1 ,t 2 ,...,t L ) T is converted to a sequence of shape descriptors of the same length, i.e., d = (d 1 ,d 2 ,...,d L ) T ,d ∈ R L×m . shapeDTW then aligns the transformed multivariate descriptor sequences d by DTW, and at last the alignment path between descriptor sequences is transferred to the original univariate time series sequences. We give implementation details of shapeDTW: Given a univariate time series of length L, e.g.,T = (t 1 ,t 2 ,...,t L ) T , we first extract a subsequence s i of length l from each temporal point t i . The subse- quences i is centered ont i , with its lengthl typically much smaller thanL(lL). Note we have to pad both ends ofT byb l 2 c with duplicates of t 1 (t L ) to make subsequences sampled at endpoints well defined. Now we obtain a sequence of subsequences, i.e., S = (s 1 ,s 2 ,...,s L ) T , s i ∈ R l , with s i corresponding to the temporal point t i . Next, we design shape descriptors to express subsequences, under the goal that similarly-shaped subsequences have similar descriptors while differently-shaped subsequences have distinct descriptors. The shape descriptor of subsequences i naturally encodes local structural information around the temporal 55 pointt i , and is named as shape descriptor of the temporal pointt i as well. Design- ing a shape descriptor boils down to designing a mapping functionF(·), which maps subsequence s i ∈R l to shape descriptor d i ∈R m , i.e., d i =F(s i ), so that similaritybetweendescriptorscanbemeasuredsimplywiththeEuclideandistance. Different mapping functions define different shape descriptors, and one straightfor- ward mapping function is the identity functionI(·), in this case, d i =I(s i ) =s i , i.e., subsequence itself acts as local shape descriptor. Given a shape descriptor computation functionF(·), we convert the subsequence sequenceS to a descriptor sequenced = (d 1 ,d 2 ,...,d L ) T d i ∈R m , i.e.,d =F(S) = (F(s 1 ),F(s 2 ),...,F(s L )) T . At last, we use DTW to align two descriptor sequences and transfer the warping path to the original univariate time series. Given two univariate time seriesP = (p 1 ,p 2 ,...,p L P ) T ,P ∈ R L P andQ = (q 1 ,q 2 ,...,q L Q ) T ,Q∈R L Q , let d P = (d P 1 ,d P 2 ,...,d P L P ) T , d P i ∈R m , d P ∈R L P ×m and d Q = (d Q 1 ,d Q 2 ,...,d Q L Q ) T , d Q i ∈R m , d Q ∈R L Q ×m be their shape descriptor sequences respectively, shapeDTW alignment is equivalent to solving the opti- mization problem: arg min l, ˜ W P ∈{0,1} l×L P, ˜ W Q ∈{0,1} l×L Q k ˜ W P ·d P − ˜ W Q ·d Q k 1,2 (3.3) Where ˜ W P and ˜ W Q are warping matrices of d P and d Q , andk·k 1,2 is the ` 1 /` 2 -norm of matrix, i.e.,k M p×n k 1,2 = P p i=1 k M i k 2 , whereM i is the i th row of matrixM. Program 3.3 is a multivariate time series alignment problem, and can be effectively solved by dynamic programming in timeO(L P ×L Q ). The key difference between DTW and shapeDTW is that: DTW measures similarities between p i and q j by their Euclidean distance|p i −q j |, while shapeDTW uses the Euclidean distance between their shape descriptors, i.e., k d P i − d Q j k 2 , as the similarity measure. shapeDTW essentially handles local non-linear warping, 56 since it is inherently DTW, and, on the other hand, it prefers matching points with similar neighborhood structures to points with similar values. shapeDTW algorithm is described in Algo.1. Algorithm 1 shape Dynamic Time Warping Inputs: univariate time seriesP ∈R L P andQ∈R L Q ; subsequence length l; shape descriptor functionF shapeDTW: 1. Sample subsequences:S P ←P,S Q ←Q; 2. Encode subsequences by shape descriptors: d P ←F(S P ), d Q ←F(S Q ); 3. Align descriptor sequences d P and d Q by DTW. Outputs: warping matrices: ˜ W P ∗ and ˜ W Q ∗ ; shapeDTW distance:k ˜ W P ∗ ·d P − ˜ W Q ∗ ·d Q k 1,2 3.5 Shape descriptors shapeDTW provides a generic alignment framework, and users can design shape descriptors adapted to their domain data characteristics and feed them into shapeDTW for alignments. Here we introduce several general shape descriptors, each of which maps a subsequences i to a vector representationd i , i.e.,d i =F(s i ). The lengthl of subsequences defines the size of neighborhoods around temporal points. When l = 1, no neighborhood information is taken into account. With increasing l, larger neighborhoods are considered, and in the extreme case when l = L (L is the length of the time series), subsequences sampled from different temporal points become the same, i.e., the whole time series, in which case, shape descriptors of different points resemble each other too much, making temporal pointslessidentifiablebyshapedescriptors. Inpractice,l issettosomeappropriate value. But in this section, we first letl be any positive integers (l≥ 1), which does 57 not affect the definition of shape descriptors. In Sec.3.7, we will experimentally explore the sensitivity of NN-shapeDTW to the choice of l. 3.5.1 Raw-Subsequence Raw subsequences i sampled around pointt i can be directly used as the shape descriptor of t i , i.e., d i =I(s i ) =s i , whereI(·) is the identity function. Although simple, it inherently captures the local subsequence shape and helps to disam- biguate points with similar values but different local shapes. 3.5.2 PAA Piecewise aggregate approximation (PAA) is introduced in [69, 140] to approx- imate time series. Here we use it to approximate subsequences. Given a l- dimensional subsequence s i , it is divided into m (m≤l) equal-lengthed intervals, the mean value of temporal points falling within each interval is calculated and a vector of these mean values gives the approximation ofs i and is used as the shape descriptor d i of s i , i.e.,F(·) =PAA, d i =PAA(s i ). 3.5.3 DWT DiscreteWaveletTransform(DWT)isanotherwidelyusedtechniquetoapprox- imatetimeseriesinstances. Again,hereweuseDWTtoapproximatesubsequences. Concretely, we use a Haar wavelet basis to decompose each subsequence s i into 3 levels. The detail wavelet coefficients of all three levels and the approximation coefficients of the third level are concatenated to form the approximation, which is used the shape descriptor d i of s i , i.e.,F(·) =DWT, d i =DWT (s i ). 58 3.5.4 Slope All the above three shape descriptors encode local shape information inher- ently. However, they are not invariant to y-shift, to be concrete, given two sub- sequences p,q of exactly the same shape, but p is a y-shifted relative to q, e.g., p = q + Δ·1, where Δ is the magnitude of y-shift, then their shape descriptors under Raw-Subsequence, PAA and DWT differ approximately by Δ as well, i.e., d(p)≈ d(q) + Δ·1. Although magnitudes do help time series classification, it is also desirable that similarly-shaped subsequences have similar descriptors. Here we further exploit three shape descriptors in experiments, Slope, Derivative and HOG1D, which are invariant to y-shift. Slope is extracted as a feature and used in time series classification in [10, 30]. Here we use it to represent subsequences. Given a l-dimensional subsequence s i , it is divided into m (m≤l) equal-lengthed intervals. Within each interval, we employ the total least square (TLS) line fitting approach [42] to fit a line according to points falling within that interval. By concatenating the slopes of the fitted lines from all intervals, we obtain a m-dimensional vector representation, which is the slope representation of s i , i.e.,F(·) =Slope, d i =Slope(s i ). 3.5.5 Derivative Similar to Slope, Derivative is y-shift invariant if it is used to represent shapes. Given a subsequence s, its first-order derivative sequence is s 0 , where s 0 is the first order derivative according to timet. To keep consistent with derivatives used in derivative Dynamic Time Warping [71] (dDTW), we follow their formula to compute numeric derivatives. 59 3.5.6 HOG1D HOG1D is introduced in [146] to represent 1D time series sequences. It inherits key concepts from the histogram of oriented gradients (HOG) descriptor [26], and uses concatenated gradient histograms to represent shapes of temporal sequences. Similarly to Slope and Derivative descriptors, HOG1D is invariant to y-shift as well. In experiments, we divide a subsequence into 2 non-overlapping intervals, com- pute gradient histograms (under 8 bins) in each interval and concatenate two histograms as the HOG1D descriptor (a 16D vector) of that subsequence. We refer interested readers to [146] for computation details of HOG1D. We have to emphasize that: in [146], the authors introduce a global scaling factor σ and tune it using all training sequences; but here, we fix σ to be 0.1 in all our experiments, therefore, HOG1D computation on one subsequence takes only linear timeO(l), where l is the length of that subsequence. See our published code for details. 3.5.7 Compound shape descriptors Shape descriptors, like HOG1D, Slope and Derivative, are invariant to y-shift. However, in the application of matching two subsequences, y-magnitudes may sometimes be important cues as well, e.g., DTW relies on point-wise magni- tudes for alignments. Shape descriptors, like Raw-Subsequence, PAA and DWT, encode magnitude information, thus they complement y-shift invariant descriptors. By fusing pure-shape capturing and magnitude-aware descriptors, the compound descriptor has the potential to become more discriminative of subsequences. In the experiments, we generate compound descriptors by concatenating two comple- mentary descriptors, i.e., d = (d A ,γd B ), where γ is a weighting factor to balance two simple descriptors, and d is the generated compound descriptor. 60 3.6 Alignment quality evaluation Here we adopt the “mean absolute deviation” measure used in the audio litera- ture [72] to quantify the proximity between two alignment paths. “Mean absolute deviation” is defined as the mean distance between two alignment paths, which is positively proportional to the area between two paths. Intuitively, two spa- tially proximate paths have small between-areas, therefore low “Mean absolute deviation”. Formally, given a reference sequenceP, a target sequenceQ and two alignment paths α,β between them, the Mean absolute deviation between α and β is calculate as: δ(α,β) =A(α,β)/L P , whereA(α,β) is the area between α and β andL P is the length of the reference sequenceP. Fig. 3.3 shows two alignment paths α,β, blue and red curves, betweenP andQ.A(α,β) is the area of the slashed region, and in practice, it is computed by counting the number of cells falling within it. Here a cell (i,j) refers to the position (i,j) in the pairwise distance matrixD(P,Q)∈R L P ×L Q betweenP andQ. P: reference sequence Q: target sequence proximity between two alignment paths Figure 3.3: “Mean absolute deviation”, which measures the proximity between alignment paths. The red and blue curves are two alignment paths between sequencesP andQ, and “Mean absolute deviation” between these two paths is defined as: the area of the slashed region divided by the length of the reference sequenceP. 61 3.7 Experimental validation We test shapeDTW for sequence alignment and time series classification exten- sively on 84 UCR time series datasets [23] and the Bach10 dataset [38]. For sequence alignment, we compare shapeDTW against DTW and its other variants both qualitatively and quantitatively: specifically, we first visually compare align- ment results returned by shapeDTW and DTW (and its variants), and then quan- tify their alignment path qualities on both synthetic and real data. Concretely, we simulate aligned pairs by artificially scaling and stretching original time series sequences, align those pairs by shapeDTW and DTW (and its variants), and then evaluate the alignment paths against the ground-truth alignments. We further evaluate the alignment performances of shapeDTW and DTW (and its variants) on audio signals, which have the ground-truth point-to-point alignments. For time series classification, since it is widely recognized that the nearest neighbor classi- fier with the distance measure DTW (NN-DTW) is very effective and is hard to beaten[131,3], weusethenearestneighborclassifieraswelltotesttheeffectiveness of shapeDTW (NN-shapeDTW), and compare NN-shapeDTW against NN-DTW. We further compare NN-shapeDTW against six other state-of-the-art classification algorithms in the supplementary materials. 3.7.1 Sequence alignment We evaluate sequence alignments qualitatively in Sec. 3.7.1 and quantitatively in Sec. 3.7.1 and Sec. 3.7.1. We compare shapeDTW against DTW, deriva- tive Dynamic Time Warping (dDTW) [71] and weighted Dynamic Time Warping (wDTW)[67]. dDTW first computes derivative sequences, and then aligns them by DTW. wDTW uses a weighted ` 2 distance, instead of the regular ` 2 distance, 62 to compute distances between points, and the weight accounts for the phase differ- ences between points. wDTW is essentially a DTW algorithm. Here, both dDTW and wDTW are variants of the original DTW. Before the evaluation, we briefly introduce some popular step patterns in DTW. Step pattern in DTW Step pattern in DTW defines the allowed transitions between matched pairs, and the corresponding weights. In both Program. 4.1 (DTW) and Program. 3.3 (shapeDTW), we use the default step pattern, whose recursion formula is D(i,j) = d(i,j) + min{D(i− 1,j− 1), D(i,j− 1), D(i− 1,j)}. In the follow- ing alignment experiments, we try other well-known step patterns as well, and we follow the naming convention in [45] to name these step-patterns. Five popu- lar step-patterns, “symmetric1”, “symmetric2”, “symmetric5”, “asymmetric” and “rabinerJuang”, are listed in Fig. 3.4. Step-pattern (a), “symmetric1”, is the one used by shapeDTW in all the following alignment and classification experiments, and we will not explicitly mention that in following texts. 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 (a) symmetric1 (b) symmetric2 (c) asymmetric (d) rabinerJuang (e) symmetric5 Figure 3.4: Five step patterns. Numbers on transitions indicate the multiplicative weight for the local distance d(i,j). Step-pattern (a) “symmetric1” is the default step pattern for DTW and (b) gives more penalties to the diagonal directions, such that the warping favors stair-stepping paths. Step patterns (a) and (b) obtain a continuouswarpingpath, whilesteppatterns(c), (d)and(e)mayresultinskipping elements, i.e., some temporal points from one sequence are not matched to any points from the other sequence, and vice verse. 63 Qualitative alignment assessment We plot alignment results by shapeDTW and DTW/dDTW, and evaluate them visually. shapeDTW under 5 shape descriptors, Raw-Subsequence, PAA, DWT, Derivative and HOG1D, obtains similar alignment results, here we choose Derivative as a representative to report results, with the subsequence length set to be 30. Here, shapeDTW, DTW and dDTW all use step pattern (a) in Fig. 3.4. Time series with rich local features: time series with rich local features, such as those in the “OSUleaf” dataset (bottom row in Fig.3.5), have many bumps and valleys; DTW becomes quite brittle to align such sequences, since it matches two points based on their single-point y-magnitudes. Because single magnitude value does not incorporate local neighborhood information, it is hard for DTW to dis- criminate a peak point p from a valley point v with the same magnitude, although p and v have dramatically different local shapes. dDTW bears similar weakness as DTW,sinceitmatchespointsbasesontheirderivativedifferencesanddoesnottake local neighborhood into consideration either. On the contrary, shapeDTW distin- guishes peaks from valleys easily by their highly different local shape descriptors. Since shapeDTW takes both non-linear warping and local shapes into account, it gives more perceptually interpretable and semantically sensible alignments than DTW (dDTW). Some typical alignment results of time series from feature rich datasets “OSUleaf” and “Fish” are shown in Fig.3.5. Simulated sequence-pair alignment We simulate aligned sequence pairs by scaling and stretching original time series. Then we run shapeDTW and DTW (and its variants) to align the simulated pairs, and compare their alignment paths against the ground-truth. In this section, shapeDTW is run under the fixed settings: (1) fix the subsequence length to be 64 (a) DTW (b) dDTW (c) shapeDTW sh OSUleaf Figure 3.5: Alignments between time series with rich local features. Time series at the top and bottom row are from “Fish”(train-165, test-1) and “OSUleaf”(test- 114, test-134) datasets respectively. In each pair of time series, temporal points with similar local structures are boxed out by rectangles. Perceptually, shapeDTW aligns these corresponding points better than both DTW and dDTW. 30, (2) use Derivative as the shape descriptor and (3) use “symmetric1” as the step-pattern. Aligned-pairs simulation algorithm : concretely, given a time series T of length L, we simulate a new time series by locally scaling and stretchingT . The simulationconsistsoftwosequentialsteps: (1)scaling: scaleT point-wisely, result- ing in a new time series ˆ T =T⊗S, whereS is a positive scale vector with the same length asT , and⊗ is a point-wise multiplication operator; (2) stretching: ran- domly chooseα percent of temporal points from ˆ T , stretch each point by a random length τ and result in a new time seriesT 0 . T 0 andT are a simulated alignment pair, with the ground-truth alignment known from the simulation process. The simulation algorithm is described in Alg. 2. One caveat we have to pay attention to is that: scaling an input time series by a random scale vector can make the resulting time series perceptually quite different from the original one, such that simulated alignment pairs make little sense. Therefore, in practice, a scale vector S should be smooth, i.e., adjacent 65 elements in S cannot be random, instead, they should be similar in magnitude, making adjacent temporal points from the original time series be scaled by a sim- ilar amount. In experiments, we first use a random process, which is similar to Brownian motion, to initialize scale vectors, and then recursively smooth it. The scale vector generation algorithm is shown in Alg. 2. As seen, adjacent scales are initialized to be differed by at most 1 (i.e., s(t + 1) = s(t) +sin (π×randn)), such that the first order derivatives are bounded and initialized scale vectors do not change abruptly. Initialized scale vectors usually have local bumps, and we further recursively utilize cumulative summation and sine-squashing, as described in the algorithm, to smooth the scale vectors. Finally, the smoothed scale vectors are linearly squashed into a positive range [ab]. After non-uniformly scaling an input time series by a scale vector, we obtain a scale-transformed new sequence, and then we randomly pick α percent of points from the new sequence and stretch each of them by some random amount τ. Stretching at point p by some amount τ is to duplicate p by τ times. Aligned-pairs simulation : using training data from each UCR dataset as the original time series, we simulate their alignment pairs by running Alg. 2. Since there are 27,136 training time series instances from 84 UCR datasets, we simulate 27,136 aligned-pairs in total. We fix most simulation parameters as follows: [ab] = [0.5 1], Γ = 5, τ ={1, 2, 3}, and the stretching percentage α is the only flexible parameter we will vary, e.g., when α = 15%, each original input time series is on average stretched by 30% (in length). Typical scale vectors and simulated alignment pairs are shown in Fig. 3.6. The scale vectors are smooth and the simulated time series are both scaled and stretched, compared with the original ones. 66 Algorithm 2 simulate alignment pairs Simulate an alignment pair : Inputs: a time series instanceT ; scale vector range [ab], smoothing iterations Γ; stretching percentage α, stretching amount τ 1. simulate a scale vector S; 2. scaleT point-wisely, ˆ T ←T⊗S; 3. stretching α percent of points from ˆ T by a random amount τ, resulting in a simulated time seriesT 0 . Outputs:T 0 Simulate a scale vector: Inputs: length L, iteration Γ, range [ab] 1.Initialize: s(1) =randn s(t + 1) =s(t) +sin (π×randn),t∈{1, 2,...,L− 1} 2.smoothing: while iteration < Γ a. set the cumulative sum up to t as the scale at t: s(1)←s(1) s(t + 1)←s(t + 1) +s(t),t∈{1, 2,...,L− 1} b. squash scale at t into the range [−1, 1]: s(t)←sin (s(t)), t∈{1, 2,...,L} end 3. squash elements in the scale vector S into range [ab] by linear scaling. Outputs: a scale vector S ={s(1),s(2),...,s(L)} Alignment comparison between shapeDTW and DTWs : we run shapeDTW and DTW/dDTW/wDTW to align simulated pairs, and com- pare alignment paths against the ground-truth in terms of “Mean Absolute Deviation” scores. DTW and dDTW are parameter-free, but wDTW has one tuning parameter g (see Eq. (3) in their paper), which controls the curvature of the logistic weight function. However in the case of aligning two sequences, g is impossible to be tuned and should be pre-defined by experiences. Here we fix g to be 0.1, which is the approximate mean value of the optimal g in the original paper. For the purpose of comparing the alignment qualities of different algorithms, we use the default step pattern, (a) in Fig. 3.4, for both 67 0.5 1 original simulated (c) dDTW (a) scale vectors (b) simulated alignment pairs (e) shape DTW (d) ground truth 50 100 150 200 250 50 100 150 200 250 300 original simulated alignment path GT dDTW shapeDTW Figure 3.6: Alignments between simulated time series pairs. (a) simulated scale vectors: they are smooth and squashed to the range [0.5 1.0]; (b) a simulated alignment pair: generated by artificially scale and stretch the original time series; (c) dDTW alignment: run dDTW to align the simulated pair; (d) ground truth alignment; (e) shapeDTW alignment. The plot on the right shows alignment paths of dDTW, shapeDTW and the ground-truth, visually the alignment path of shapeDTW is closer to the ground-truth, and quantitatively, shapeDTW has 1.1 alignment errors in terms of “Mean Absolute Deviation” score, compared with 4.7 of dDTW. shapeDTW and DTW/dDTW/wDTW, but we further evaluate effects of different step-patterns in the following experiments. We simulate alignment pairs by stretching raw time series by different amounts, 10%, 20%, 30%, 40% and 50%, and report the alignment qualities of shapeDTW and DTW/dDTW/wDTW under each stretching amount in terms of the mean of “Mean Absolute Deivation” scores over 27,136 simulated pairs. The results are shown in Fig. 3.7, which shows shapeDTW achieves lower alignment errors than DTW / dDTW / wDTW over different stretching amounts consistently. shapeDTW almost halves the alignment errors achieved by dDTW, although dDTW already outperforms its two competitors, DTW and wDTW, by a large margin. 68 10% 20% 30% 40% 50% 0 1 2 3 4 5 6 7 8 9 stretching percentages mean of ’Mean Absolute Deviation’ shapeDTW dDTW−symmetric1 DTW−symmetric1 wDTW−symmetric1 Figure 3.7: Alignment quality comparison between shapeDTW and DTW/dDTW/wDTW, under the step pattern “symmetric1”. As seen, as the stretching amount increases, the alignment qualities of both shapeDTW and DTW/dDTW/wDTW drop. However, shapeDTW consistently achieves lower alignment errors under different stretching amounts, compared with DTW, dDTW and wDTW. Effects of different step patterns : choosing a suitable step pattern is a tradi- tionally way to improve sequence alignments, and it usually needs domain knowl- edgetomaketherightchoice. Here,insteadofchoosinganoptimalsteppattern,we run DTW/dDTW/wDTW under all 5 step patterns in Fig. 3.4 and compare their alignment performances against shapeDTW. Similar as the above experiments, we simulate aligned-pairs under different amounts of stretches, report alignment errors under different step patterns in terms of the mean of “Mean Absolute Deiva- tion” scores over 27,136 simulated pairs, and plot the results in Fig. 3.8. As seen, different step patterns obtain different alignment qualities, and in our case, step patterns, “symmetric1” and “asymmetric”, have similar alignment performances and they reach lower alignment errors than the other 3 step patterns. However, shapeDTW still wins DTW/dDTW/wDTW (under “symmetric1” and “asymmet- ric” step-patterns) by some margin. Fromtheabovesimulationexperiments, weobservedDTW(underthesteppat- terns “symmetric1” and “asymmetric”) has the closest performance as shapeDTW. Here we simulate aligned-pairs with on average 30% stretches, run dDTW (under “symmetric1” step pattern) and shapeDTW alignments, and report the “Mean 69 Mean Absolute Deviation from the ground-truth alignments mean std. mean std. datasets shapeDTW dDTW shapeDTW dDTW datasets shapeDTW dDTW shapeDTW dDTW 50words 1.49 2.85 1.03 2.14 MedicalImages 0.93 2.14 0.66 2.05 Adiac 1.77 5.73 0.61 2.45 MiddlePhalanxOutlineAgeGroup 0.47 0.80 0.18 0.30 ArrowHead 0.94 1.70 0.48 0.91 MiddlePhalanxOutlineCorrect 0.46 0.81 0.18 0.27 Beef 0.85 1.86 0.22 0.83 MiddlePhalanxTW 0.53 0.91 0.27 0.39 BeetleFly 0.69 2.22 0.16 0.80 MoteStrain 0.78 1.07 0.31 0.85 BirdChicken 1.11 2.35 0.85 1.65 NonInvasiveFatalECG- Thorax1 0.65 0.72 0.24 0.49 Car 1.83 6.34 1.74 3.21 NonInvasiveFatalECG- Thorax2 0.80 1.06 0.51 0.89 CBF 0.60 0.13 0.28 0.03 OliveOil 1.89 3.90 0.79 0.69 ChlorineConcentration0.64 0.23 0.18 0.24 OSULeaf 0.69 1.92 0.17 0.94 CinC-ECG- torso 0.69 0.67 0.33 0.92 PhalangesOutlinesCorrect0.62 1.04 0.29 0.49 Coffee 0.69 1.36 0.17 0.41 Phoneme 0.69 0.89 0.52 5.37 Computers 11.18 10.73 12.62 13.10 Plane 0.51 1.44 0.16 0.59 Cricket-X 0.64 0.18 0.17 0.07 ProximalPhalanxOutlineAgeGroup 0.60 1.15 0.29 0.52 Cricket-Y 0.65 0.19 0.16 0.07 ProximalPhalanxOutlineCorrect 0.61 1.14 0.29 0.49 Cricket-Z 0.65 0.19 0.15 0.06 ProximalPhalanxTW 0.56 1.08 0.26 0.49 DiatomSizeReduction 2.21 7.43 1.15 2.82 RefrigerationDevices 1.28 1.21 1.33 1.53 DistalPhalanxOutlineAgeGroup 0.57 0.88 0.27 0.49 ScreenType 11.26 11.00 11.29 11.70 DistalPhalanxOutlineCorrect 0.57 0.85 0.30 0.49 ShapeletSim 0.60 0.20 0.17 0.10 DistalPhalanxTW 0.60 0.85 0.25 0.44 ShapesAll 1.13 3.07 0.79 2.65 Earthquakes 0.97 0.77 0.59 0.76 SmallKitchenAppliances 15.77 15.88 11.67 12.16 ECG200 0.61 0.22 0.25 0.15 SonyAIBORobotSurface 0.63 0.18 0.16 0.11 ECG5000 0.67 0.24 0.25 0.13 SonyAIBORobotSurfaceII0.79 0.16 0.28 0.06 ECGFiveDays 0.78 0.26 0.33 0.15 Strawberry 0.71 1.07 0.17 0.47 FaceAll 0.54 0.23 0.19 0.11 SwedishLeaf 0.65 1.36 0.29 1.25 FaceFour 0.64 0.16 0.10 0.07 Symbols 2.20 6.04 1.73 4.17 FacesUCR 0.55 0.25 0.17 0.12 synthetic-control 0.54 0.12 0.59 0.07 FISH 2.64 10.56 1.16 3.33 ToeSegmentation1 0.69 0.36 0.17 0.19 FordA 0.50 0.59 0.07 0.20 ToeSegmentation2 0.65 0.45 0.22 0.42 FordB 0.50 0.64 0.07 0.15 Trace 0.62 0.25 0.24 0.22 Gun-Point 1.36 5.95 0.65 2.86 TwoLeadECG 0.78 0.78 0.23 0.39 Ham 0.73 0.63 0.13 0.29 Two-Patterns 0.63 0.35 0.22 0.17 HandOutlines 7.08 22.92 6.86 6.92 UWaveGestureLibraryAll1.13 2.45 0.79 2.57 Haptics 1.37 2.00 0.70 1.19 uWaveGestureLibrary- X 1.44 3.69 1.62 4.02 Herring 0.97 3.44 0.40 1.45 uWaveGestureLibrary- Y 1.39 3.98 1.33 4.48 InlineSkate 0.68 0.19 0.16 0.09 uWaveGestureLibrary- Z 1.52 4.07 1.52 4.44 InsectWingbeatSound 1.09 3.06 0.64 3.98 wafer 1.08 5.26 0.58 3.36 ItalyPowerDemand 0.60 0.25 0.39 0.16 Wine 1.25 2.00 0.26 0.45 LargeKitchenAppliances 19.88 20.38 12.03 15.98 WordsSynonyms 1.57 2.89 1.23 2.29 Lighting2 1.71 3.20 0.58 2.72 WordSynonyms 1.48 2.92 1.04 2.50 Lighting7 1.23 1.90 0.40 1.68 Worms 0.65 1.00 0.12 3.48 MALLAT 2.45 3.98 0.50 0.99 WormsTwoClass 0.64 2.77 0.11 18.75 Meat 2.10 3.32 0.72 0.71 yoga 1.23 5.26 0.65 2.92 Table 3.1: Alignment errors of shapeDTW vs dDTW. We use training data from each UCR dataset as the original time series, and simulate alignment pairs by scaling and streching the original time series (stretched by 30%). Then we run shapeDTW and dDTW to align these synthesized alignment pairs, and evaluate the alignment paths against the ground-truth by computing “Mean Absolute Devi- ation” scores. The mean and standard deviation of the “Mean Absolute Deviation” scores on each dataset is documented, with smaller means and stds in bold font. shapeDTW achieves lower “Mean Absolute Deviation” scores than dDTW on 56 datasets, showing its clear advantage for time series alignment. 70 10% 20% 30% 40% 50% 0 1 2 3 4 5 6 7 8 9 stretching percentages mean of ’Mean Absolute Deviation’ shapeDTW dDTW−symmetric1 dDTW−symmetric2 dDTW−symmetric5 dDTW−asymmetric dDTW−rabinerJuang 10% 20% 30% 40% 50% 0 2 4 6 8 10 12 stretching percentages mean of ’Mean Absolute Deviation’ shapeDTW DTW−symmetric1 DTW−symmetric2 DTW−symmetric5 DTW−asymmetric DTW−rabinerJuang 10% 20% 30% 40% 50% 0 5 10 15 20 25 stretching percentages mean of ’Mean Absolute Deviation’ shapeDTW wDTW−symmetric1 wDTW−symmetric2 wDTW−symmetric5 wDTW−asymmetric wDTW−rabinerJuang (a) shapeDTW vs. DTW (different step patterns) (b) shapeDTW vs. dDTW (different step patterns) (c) shapeDTW vs. wDTW (different step patterns) Figure 3.8: Align sequences under different step patterns. We align sequence-pairs by DTW/dDTW/wDTW under 5 different step patterns (Fig. 3.4), “symmet- ric1”, “symmetric2”, “symmetric5”, “asymmetric” and “rabinerJuang”, and com- paretheiralignmenterrorsagainstthoseobtainedbyshapeDTW.Asseen, different steppatternsusuallyreachdifferentalignmentresults, whichshowstheimportance of choosing an appropriate step pattern adapted to the application domain. In our case, “asymmetric” step pattern achieves slightly lower errors than “symmetric1” step pattern (under DTW, wDTW and dDTW), however, shapeDTW consistently wins DTW/dDTW/wDTW under the best step pattern - “asymmetric”. Absolute Deviation” scores in Table 3.1. shapeDTW has lower “Mean Absolute Deivation” scores on 56 datasets, and the mean of “Mean Absolute Deivation” on 84 datasets of shapeDTW and dDTW are 1.68/2.75 respectively, indicating shapeDTW achieves much lower alignment errors. This shows a clear superiority of shapeDTW to dDTW for sequence alignment. The key difference between shapeDTW and DTW/dDTW/wDTW is that whether neighborhood is taken into account when measuring similarities between two points. We demonstrate that taking local neighborhood information into account (shapeDTW) does benefit the alignment. Notes: before running shapeDTW and DTW variants alignment, two sequences in a simulated pair are z-normalized in advance; when computing “Mean Absolute Deviation”, we choose the original time series as the reference sequence, i.e., divide the area between two alignment paths by the length of the original time series. 71 MFCCs (a) (c) (b) audio MIDI-2-audio MFCCs 200 400 600 800 1000 1200 1400 1600 200 400 600 800 1000 1200 1400 MIDI audio alignment path GT DTW dDTW shapeDTW 0 1 2 3 4 01−AchGottundHerr 02−AchLiebenChristen 03−ChristederdubistTagundLicht 04−ChristeDuBeistand 05−DieNacht 06−DieSonne 07−HerrGott 08−FuerDeinenThron 09−Jesus 10−NunBitten mean absolute deviation (s) DTW dDTW shapeDTW Figure 3.9: Align audio and midi-2-audio sequences. (a) top: the audio waveform of the Chorale ’05-DieNacht’ and its 5D MFCCs features; bottom: the converted audio waveform from the MIDI score of the Chorale ’05-DieNacht’ and its corre- sponding5DMFCCsfeatures; (b)alignmentpaths: aligntwoMFCCssequencesby DTW, dDTW and shapeDTW, and the plot shows their alignment paths, together with the ground-truth alignment. As seen, the alignment paths of dDTW and shapeDTW are closer to the ground-truth than that of DTW. (c) “Mean Abso- lute Deviation” from the ground truth alignment: on 9 (10) out of 10 chorales, shapeDTW achieves smaller alignment errors than dDTW (DTW), showing that shapeDTW outperforms DTW/dDTW to align real sequence pairs as well. MIDI-to-audio alignment We showed the superiority of shapeDTW to align synthesized alignment pairs, and in this section, we further empirically demonstrate its effectiveness to align audio signals, which have ground-truth alignments. The Bach10 dataset [38] consists of audio recordings of 10 pieces of Bach’s Chorales, as well as their MIDI scores and the ground-truth alignment between the audio and the MIDI score. MIDI scores are symbolic representations of audio files, and by aligning symbolic MIDI scores with audio recordings, we can do musical information retrieval from MIDI input-data [64]. Many previous work used DTW 72 to align MIDI to audio sequences [64, 38, 44], and they typically converted MIDI data into audios as a first step, and the problem boils down to audio-to-audio alignment, which is then solved by DTW. We follow this convention to convert MIDI to audio first, but run shapeDTW instead for alignments. Each piece of music is approximately 30 seconds long, and in experiments, we segment both the audio and the converted audio from MIDI data into frames of 46ms length with a hopsize of 23ms, extract features from each 46ms frame win- dow, and in this way, the audio is represented as a multivariate time series with the length equal to the number of frames and dimension equal to the feature dimen- sions. There are many potential choices of frame features, but how to select and combine features in an optimal way to improve the alignment is beyond the scope of this paper, we refer the interested readers to [72, 44]. Without loss of generality, we use Mel-frequency cepstral coefficients (MFCCs) as features, due to its com- mon usage and good performance in speech recognition and musical information retrieval [87]. In our experiments, we use the first 5 MFCCs coefficients. After MIDI-to-audio conversion and MFCCs feature extraction, MIDI files and audio recordings are represented as 5-dimensional multivariate time series, with approximately length L≈ 1300. A typical audio signal, MIDI-converted audio signal, and their 5D MFCCs features are shown in Fig. 3.9. We align 5D MFCCs sequences by shapeDTW: although shapeDTW is designed for univariate time series alignments, it naturally extends to multivariate cases: first extract a sub- sequence from each temporal point, then encode subsequences by shape descrip- tors, and in this way, the raw multivariate time series is converted to a descriptor sequence. In the multivariate time series case, each extracted subsequence is multi- dimensional, having the same dimension as the raw time series, and to compute the 73 shape descriptor of a multi-dimensional subsequence, we compute shape descrip- tors of each dimension independently, concatenate all shape descriptors, and use it as the shape representation of that subsequence. We compare alignments by shapeDTW against DTW/dDTW, and all of them use the “symmetric1” step pattern. The length of subsequences in shapeDTW is fixed to be 20 (we tried 5,10, 30 as well and achieved quite similar results), and Derivative is used as the shape descriptor. The alignment qualities in terms of “Mean Absolute Deviation” on 10 Chorales are plotted in Fig. 3.9. To be consistent with the convention in the audio community, we actually report the mean-delayed-second between the alignment paths and the ground-truth. The mean-delayed-second is computed as: dividing “Mean Absolute Deviation” by the sampling rate of the audio signal. shapeDTW outperforms dDTW/DTW on 9/10 MIDI-to-audio alignments. This shows taking local neighborhood information into account does benefit the alignment. (b) (a) (c) (d) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Raw−Subsequence DTW 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 PAA DTW 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 DWT DTW 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 HOG1D DTW Figure 3.10: Classification accuracy comparisons between NN-DTW and NN- shapeDTWon84UCRtimeseriesdatasets. shapeDTWunder4shapedescriptors, Raw-Subsequence, PAA, DWT and HOG1D, outperforms DTW on 64/63/64/61 datasets respectively, and Wilcoxon signed rank test shows shapeDTW under all descriptors performs significantly better than DTW. Raw-Subsequence (PAA and DWTaswell)outperformsDTWonmoredatasetsthanHOG1Ddoes, butHOG1D achieveslargeaccuracyimprovementsonmoredatasets, concretely, HOG1Dboosts accuracies by more than 10% on 18 datasets, compared with on 12 datasets by Raw-Subsequence. 74 3.7.2 Time series classification We compare NN-shapeDTW with NN-DTW on 84 UCR time series datasets for classification. Since these datasets have standard partitions of training and test data, we experiment with these given partitions and report classification accuracies on the test data. In the above section, we explore the influence of different steps patterns, but here both DTW and shapeDTW use the widely adopted step pattern “symmet- ric1”(Fig. 3.4 (a)) under no temporal window constraints to align sequences. NN-DTW: each test time series is compared against the training set, and the label of the training time series with the minimal DTW distance to that test time series determines the predicted label. All training and testing time series are z- normalized in advance. shapeDTW: we test all 5 shape descriptors. We z-normalize time series in advance, sample subsequences from the time series, and compute 3 magnitude- aware shape descriptors, Raw-Subsequence, PAA and DWT, and 2 y-shift invariant shape descriptors, Slope and HOG1D. Parameter setting for 5 shape descriptors: (1) The length of subsequences to be sampled around temporal points is fixed to 30, as a result Raw-Subsequence descriptor is a 30D vector; (2) PAA and Slope uses 5 equal-lengthed intervals, therefore they have the dimensionality 5; (3) As men- tioned, HOG1D uses 8 bins and 2 non-overlapping intervals, and the scale factor σ is fixed to be 0.1. At last HOG1D is a 16D vector representation. NN-shapeDTW: first transform each training/testing time series to a shape descriptor sequence, and in this way, original univariate time series are converted into multivariate descriptor time series. Then apply NN-DTW on the multivariate time series to predict labels. 75 NN-shapeDTW vs. NN-DTW: we compare NN-shapeDTW, under 4 shape descriptors Raw-Subsequence, PAA, DWT and HOG1D, with NN-DTW, and plot their classification accuracies on 84 datasets in Fig.3.10. shapeDTW outperforms (including ties) DTW on 64/63/64/61 (Raw-Subsequence/PAA/DWT/HOG1D) datasets, and by running the Wilcoxon signedranktestbetweenperformancesofNN-shapeDTWandNN-DTW,weobtain p-values 5.5· 10 −8 /5.1· 10 −7 /4.8· 10 −8 /1.7· 10 −6 , showing that shapeDTW under all 4 descriptors performs significantly better than DTW. Compared with DTW, shapeDTWhasaprecedingshapedescriptorextractionprocess,andapproximately takes timeO(l·L), where l and L is the length of subsequence and time series respectively. Since generally l L, the total time complexity of shapeDTW is O(L 2 ), which is the same as DTW. By trading off a slight amount of time and space, shapeDTW brings large accuracy gains. Since PAA and DWT are approximations of Raw-Subsequence, and they have similar performances as Raw-Subsequence under the nearest classifier, we choose Raw-Subsequence as a representative for following analysis. Shape descriptor Raw- Subsequence loses on 20 datasets, on 18 of which it has minor losses (< 4%), and on the other 2 datasets, “Computers” and “Synthetic-control”, it loses by 10% and 6.6%. Time series instances from these 2 datasets either have high-frequency spikes or have many abrupt direction changes, making them resemble noisy signals very much. Possibly, comparing the similarity of two points using their noisy neighborhoods is not as good as using their single coordinate values (DTW), since temporal neighborhood may accumulate and magnify noise. HOG1D loses on 23 datasets, on 18 of which it has minor losses (< 5%), and on the other 5 datasets, “CBF”, “Computers”, “ItalyPowerDemand”, “Synthetic- control” and “Wine”, it loses by 7.7%, 5.6%, 5.3%, 14% and 11%. By visually 76 inspecting, time series from “Computers”, “CBF” and “Synthetic-control” are spiky and bumpy, making them highly non-smooth. This makes the first-order- derivative based descriptor HOG1D inappropriate to represent local structures. Time series instances from ’ItalyPowerDemand’ have length 24, while we sample subsequencesoflength30fromeachpoint, thismakesHOG1Ddescriptorsfromdif- ferent local points almost the same, such that HOG1D becomes not discriminative of local structures. This makes shapeDTW inferior to DTW. Although HOG1D loses on more datasets than Raw-Subsequence, HOG1D boosts accuracies by more than 10% on 18 datasets, compared with on 12 datasets by Raw-Subsequence. On datasets “OSUleaf” and “BirdChicken”, the accuracy gain is as high as 27% and 20%. By checking these two datasets closely, we find different classes have membership-discriminative local patterns (a.k.a shapelets [139]), however, these patterns differ only slightly among classes. Raw-Subsequence shape descriptor can not capture these minor differences well, while HOG1D is more sensitive to shape variations since it calculates derivatives. Both Raw-Subsequence and HOG1D bring significant accuracy gains, however, they boost accuracies to different extents on the same dataset. This indicates the importance of designing domain-specific shape descriptors. Nevertheless, we show that even by using simple and dataset-independent shape descriptors, we still obtain significant improvements over DTW. Classification error rates of DTW, Raw-Subsequence and HOG1D on 84 datasets are documented in Table.3.2. Superiority of Compound shape descriptors: as mentioned in Sec.3.5, a compound shape descriptor obtained by fusing two complementary descriptors may inherit benefits from both descriptors, and becomes even more discriminative of subsequences. As an example, we concatenate a y-shift invariance descriptor HOG1D and a magnitude-aware descriptor DWT using equal weights, resulting in 77 a compound descriptor HOG1D + DWT = (HOG1D, DWT). Then we evaluate classification performances of 3 descriptors under the nearest neighbor classifier, and plot the comparisons in Fig.3.11. HOG1D+DWT outperforms (including ties) HOG1D / DWT on 66/51 (out of 84) datasets, and by running the Wilcoxon signed rank hypothesis test between performances of HOG1D+DWT and HOG1D (DWT), we get p-values 5.5· 10 −5 /0.0034, showing the compound descriptor out- performs individual descriptors significantly under the confidence level 5%. We can generate compound descriptors by weighted concatenation, with weights tuned by cross-validation on training data, but this is beyond the scope of this paper. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 HOG1D+DWT HOG1D 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 HOG1D+DWT DWT Figure 3.11: Performance comparisons between the fused descriptor HOG1D+DWT and individual ones HOG1D/DWT. HOG1D+DWT outperforms HOG1D/DWT on 66/51 (out of 84) datasets, and statistical hypothesis tests show the improvements are significant. Texas Sharpshooter plot: although NN-shapeDTW performs better than NN-DTW, knowing this is not useful unless we can tell in advance on which problems it will be more accurate, as stated in [8]. Here we use the Texas sharpshooter plot [8] to show when NN-shapeDTW has superior performance on the test set as predicted from performance on the training set, compared with NN-DTW. We run leave-one-out cross validation on training data to mea- sure the accuracies of NN-shapeDTW and NN-DTW, and we calculate the expected gain: accuracy(NN-shapeDTW)/accuracy(NN-DTW). We then mea- sure the actual accuracy gain using the test data. The Texas Sharpshooter 78 plots between Raw-Subsequence/HOG1D and DTW on 84 datasets are shown in Fig.3.12. 87%/86% points (Raw-Subsequence/HOG1D) fall in the TP and TN regions, which means we can confidently predict that our algorithm will be supe- rior/inferior to NNDTW. There are respectively 7/7 points falling inside the FP region for descriptors Raw-Subsequence/HOG1D, but they just represent minor losses, i.e., actual accuracy gains lie within [0.9 1.0]. 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Expected Accuracy Gain (train) Actual Accuracy Gain (test) HOG1D vs. DTW 0.8 0.9 1.0 1.1 1.2 1.3 1.4 0.8 0.9 1.0 1.1 1.2 1.3 Expected Accuracy Gain (train) Actual Accuracy Gain (test) Raw−Subsequence vs. DTW FN TP FP TN FN TP FP TN Figure 3.12: Texas sharpshoot plot between Raw-Subsequence/HOG1D and DTW on 84 datasets. TP: true positive (our algorithm was expected from the training data to outperform NNDTW, and it actually did on the test data). TN: true negatives, FP: false positives, FN: false negatives. There are 87%/86% points (Raw-Subsequence/HOG1D vs. DTW) falling in the TP and TN regions, which indicates we can confidently predict that our algorithm will be superior/inferior to NNDTW. 3.7.3 Sensitivity to the size of neighborhood In the above experiments, we showed that shapeDTW outperforms DTW both qualitatively and quantitatively. But we are still left with one free-parameter: the size of neighborhood, i.e., the length of the subsequence to be sampled from each point. Let t i be some temporal point on the time seriesT ∈R L , and s i be the subsequencesampledatt i . When|s i | = 1,shapeDTW(undertheRaw-Subsequence shape descriptor) degenerates to DTW; when|s i | = L, subsequences sampled at differentpointsbecomealmostidentical, makepointsun-identifiablebytheirshape descriptors. This shows the importance to set an appropriate subsequence length. However, without dataset-specific domain knowledge, it is hard to determine the 79 length intelligently. Here instead, we explore the sensitivity of the classification accuracies to different subsequence lengths. We conduct experiments on 42 old UCR datasets. We use Raw-Subsequence as the shape descriptor, and NN-shapeDTW as the classifier. We let the length of subsequences to vary from 5 to 100, with stride 5, i.e., we repeat classification experiments on each dataset for 20 times, and each time set the length of subsequences to be 5×i, wherei is the index of experiments (1≤ i≤ 20, i∈Z). The test accuracies under 20 experiments are shown by a box plot ( Fig.3.13). On 33 out of 42 datasets, even the worst performances of NN-shapeDTW are better than DTW, indicating shapeDTW performs well under wide ranges of neighborhood sizes. 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Accuracies Haptics InlineSkate Beef Adiac ChlorineConcentration OSULeaf uWaveGestureLibrary−Y uWaveGestureLibrary−Z WordsSynonyms CinC−ECG−torso SonyAIBORobotSurface 50words uWaveGestureLibrary−X Cricket−Y MedicalImages Lighting7 FaceAll Cricket−X ECGFiveDays Cricket−Z SwedishLeaf ECG200 Lighting2 OliveOil yoga FaceFour SonyAIBORobotSurfaceII fish Gun−Point MoteStrain MALLAT TwoLeadECG FacesUCR ItalyPowerDemand Symbols DiatomSizeReduction Coffee wafer synthetic−control Trace CBF Two−Patterns Figure 3.13: Performances of shapeDTW are insensitive to the neighborhood size. The green stairstep curve shows dataset-wise test accuracies of NN-DTW, andtheboxplotshowsperformancesofNN-shapeDTWundertheshapedescriptor Raw-Subsequence. On each dataset, we plot a blue box with two tails: the lower and upper edge of each blue box represent 25th and 75th percentiles of 20 test accuracies (obtained under different neighborhood sizes, i.e., 5, 10, 15, ..., 100) on that dataset, with the red line inside the box marking the median accuracy and two tails indicating the best and worst test accuracies. On 36 out of 42 datasets, the median accuracies of NN-shapeDTW are larger than accuracies obtained by NN-DTW, and on 33 datasets, even the worst performances by NN-shapeDTW are better than NN-DTW. All these statistics show shapeDTW works well under wide ranges of neighborhood sizes. 80 classification error rates on 84 UCR datasets datasets DTW Raw-Subsequence HOG1D datasets DTW Raw-Subsequence HOG1D 50words 0.310 0.202 0.242 MedicalImages 0.263 0.254 0.264 Adiac 0.396 0.335 0.269 MiddlePhalanxOutlineAgeGroup 0.250 0.260 0.260 ArrowHead 0.297 0.194 0.177 MiddlePhalanxOutlineCorrect 0.352 0.240 0.250 Beef 0.367 0.400 0.267 MiddlePhalanxTW 0.416 0.429 0.429 BeetleFly 0.300 0.300 0.200 MoteStrain 0.165 0.101 0.110 BirdChicken 0.250 0.250 0.050 NonInvasiveFatalECG-Thorax1 0.209 0.223 0.219 Car 0.267 0.117 0.133 NonInvasiveFatalECG-Thorax2 0.135 0.110 0.140 CBF 0.003 0.016 0.080 OliveOil 0.167 0.133 0.100 ChlorineConcentration 0.352 0.355 0.355 OSULeaf 0.409 0.289 0.132 CinC-ECG-torso 0.349 0.248 0.209 PhalangesOutlinesCorrect 0.272 0.235 0.261 Coffee 0.000 0.036 0.036 Phoneme 0.772 0.761 0.736 Computers 0.300 0.400 0.356 Plane 0.000 0.000 0.000 Cricket-X 0.246 0.221 0.208 ProximalPhalanxOutlineAgeGroup 0.195 0.234 0.210 Cricket-Y 0.256 0.226 0.226 ProximalPhalanxOutlineCorrect 0.216 0.192 0.206 Cricket-Z 0.246 0.205 0.208 ProximalPhalanxTW 0.263 0.282 0.275 DiatomSizeReduction 0.033 0.039 0.069 RefrigerationDevices 0.536 0.549 0.507 DistalPhalanxOutlineAgeGroup 0.208 0.223 0.233 ScreenType 0.603 0.611 0.525 DistalPhalanxOutlineCorrect 0.232 0.247 0.228 ShapeletSim 0.350 0.328 0.028 DistalPhalanxTW 0.290 0.277 0.290 ShapesAll 0.232 0.163 0.112 Earthquakes 0.258 0.183 0.258 SmallKitchenAppliances 0.357 0.363 0.301 ECG200 0.230 0.140 0.100 SonyAIBORobotSurface 0.275 0.261 0.193 ECG5000 0.076 0.070 0.071 SonyAIBORobotSurfaceII 0.169 0.136 0.174 ECGFiveDays 0.232 0.079 0.057 Strawberry 0.060 0.059 0.051 FaceAll 0.192 0.217 0.238 SwedishLeaf 0.208 0.128 0.085 FaceFour 0.170 0.102 0.091 Symbols 0.050 0.031 0.039 FacesUCR 0.095 0.034 0.081 synthetic-control 0.007 0.073 0.153 FISH 0.177 0.051 0.051 ToeSegmentation1 0.228 0.171 0.101 FordA 0.438 0.316 0.279 ToeSegmentation2 0.162 0.100 0.138 FordB 0.406 0.337 0.261 Trace 0.000 0.010 0.000 Gun-Point 0.093 0.013 0.007 TwoLeadECG 0.096 0.078 0.006 Ham 0.533 0.457 0.457 Two-Patterns 0.000 0.000 0.001 HandOutlines 0.202 0.191 0.206 UWaveGestureLibraryAll 0.108 0.046 0.058 Haptics 0.623 0.575 0.562 uWaveGestureLibrary-X 0.273 0.224 0.263 Herring 0.469 0.375 0.500 uWaveGestureLibrary-Y 0.366 0.309 0.358 InlineSkate 0.616 0.587 0.629 uWaveGestureLibrary-Z 0.342 0.314 0.338 InsectWingbeatSound 0.645 0.533 0.584 wafer 0.020 0.008 0.010 ItalyPowerDemand 0.050 0.037 0.103 Wine 0.426 0.389 0.537 LargeKitchenAppliances 0.205 0.184 0.160 WordsSynonyms 0.351 0.245 0.260 Lighting2 0.131 0.131 0.115 WordSynonyms 0.351 0.245 0.260 Lighting7 0.274 0.178 0.233 Worms 0.536 0.503 0.475 MALLAT 0.066 0.064 0.062 WormsTwoClass 0.337 0.293 0.287 Meat 0.067 0.067 0.100 yoga 0.164 0.133 0.117 Table 3.2: Error rates of NN-DTW and NN-shapeDTW (under descriptors Raw-Subsequence and HOG1D) on 84 UCR datasets. The error rates on datasets where NN-shapeDTW outperforms NN-DTW are highlighted in bold font. Under- scored datasets are those on which shapeDTW has improved the accuracies by more than 10%. 3.8 Conclusion We have proposed an new temporal sequence alignment algorithm, shapeDTW, which achieves quantitatively better alignments than DTW and its variants. shapeDTWisaquitegenericframeworkaswell, andusescandesigntheirownlocal subsequence descriptor and fit it into shapeDTW. We experimentally showed that 81 shapeDTW under the nearest neighbor classifier obtains significantly improved classification accuracies than NN-DTW. Therefore, NN-shapeDTW sets a new accuracy baseline for further comparison. 82 Chapter 4 Metric Learning in Time Series 83 4.1 Abstract Quantifyingthesimilaritybetweentimeseriesremainsacentralprobleminpat- tern recognition, with applications that range from analysis of electrocardiographic data to human gesture analysis. We propose to learn multiple local Mahalanobis distance metrics to perform k-nearest neighbor (kNN) classification of temporal sequences. Temporal sequences are first aligned by dynamic time warping (DTW); given the alignment path, similarity between two sequences is measured by the DTW distance, which is computed as the accumulated distance between matched temporal point pairs along the alignment path. Traditionally, the Euclidean met- ric is used for distance computation between matched pairs, which ignores the data regularities and might not be optimal for some applications. Here we pro- pose to learn multiple Mahalanobis metrics, such that DTW distance becomes the sum of Mahalanobis distances. We adapt the large margin nearest neighbor (LMNN) framework to this purpose, and formulate multiple metric learning as a linear programming problem. Extensive sequence classification results show that our proposed approach is effective, is largely insensitive to the quality of the pre- ceding alignment, and reaches state-of-the-art performance on the UCR time series datasets. 4.2 Introduction Dynamic time warping (DTW) is an algorithm to align temporal sequences and measure their similarities. DTW has been widely used in speech recogni- tion ([100]), human motion synthesis ([62]), human activity recognition ([75]) and time series classification ([23]). DTW allows temporal sequences to be locally shifted, contractedandstretched, anditcalculatesaglobaloptimalalignmentpath 84 between two given sequences under certain restrictions. Therefore, the similarity between two sequences calculated under the optimal alignment is independent of, to some extent, non-linear variations in the time dimension. The similarity is often quantified by the DTW distance, which is the sum of point-wise distances along the alignment path, i.e., D(P,Q) = Σ (i,j)∈p d(i,j), where p is the alignment path between two sequencesP andQ, (i,j) is a pair of matched points on the alignment path andd(i,j) is the distance between i andj. The most widely used point-wise distance d(i,j) is the (squared) Euclidean distance. (a) input sequences (b) DTW alignment path (e) descriptors clusters (f) DTW distance under learned metrics (d) temporal point descriptors (c) DTW alignment Figure 4.1: Multiple local distance metrics learning in DTW. In the paper, we propose to learn multiple local Mahalanobis distance metrics to perform k-nearest neighbor (kNN) classification of temporal sequences. The similarity between two given sequences is measured by their DTW distance (f), which is calculated as the accumulated Mahalanobis distances between the matched point pairs along the alignment path. As a preceding step for our metric learning algorithm, DTW is used to compute the alignment path (b,c). Afterwards, we compute the distance between a matched point pair by the distance between their descriptors (d), and if we further partition the descriptor space into k clusters and define an individ- ual metric within each cluster and between any two clusters (e), then the DTW distance will take the form as in (f). We adapt LMNN ([132]) to formulate our multiple metric learning in DTW. Since DTW distance naturally measures the similarity between time series, it is widely used for time series classification. There is increasing acceptance that the 85 nearest neighbor classifier with the DTW distance as the similarity measure (1NN- DTW) is the choice for most time series classification problems, and it remains contemporaryandcompetitive([99,131,3,101,4]). Forexample,Bagnalletal([4]) compared 19 time series classification algorithms to 1NN-DTW and concluded that “many of the algorithms are in fact no better than 1NN-DTW”. Yet, to the best of our knowledge, the DTW distance is computed as the sum of point-wise (squared) Euclidean distances along the matching path, i.e.,D(P,Q) = Σ (i,j)∈p d(i,j), where d(i,j) =ki−jk 2 is the (squared) Euclidean distance between the matched points i and j. Although Euclidean distance is simple and sometimes effective, it is agnostic of domain knowledge and data regularities. Extensive research has shown that kNN performance can be greatly improved by learning a proper distance metric(e.g.,Mahalanobisdistance)fromlabeledexamples([132,11]). Thismotives us to learn local distance metrics and to calculate DTW distance as the sum of point-wise learned distances, i.e., ˆ D(P,Q) = Σ (i,j)∈p ˆ d(i,j), where ˆ d(i,j) = (i−j) T M ij (i−j),M ij is a positive semidefinite matrix to be learned and ˆ d(i,j) is the squared Mahalanobis distance. In the paper, instead of learning one uniform distance metric, we partition the feature space, and learn individual metrics within and between subspaces. When using DTW distance calculated under the learned metrics as the similarity measure, we show that 1NN classifier achieves improved performance. We follow the Large Margin Nearest Neighbor (LMNN) ([132]) approach to for- mulate local metric learning in DTW. In ([132]), the Mahalanobis metric is learned with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. Mathematically, the authors formulate the metric learning as a semidefinite programming problem. 86 In our case, we use the same max margin framework, with the important differ- ence that: examples in ([132]) are feature points in some fixed-dimension space, and distances between examples are squared Mahalanobis distances, while, in our case, examples are temporal sequences, and distances between examples are DTW distances. We term the local metric learning in DTW as metricDTW. We have to emphasize that although the learned local distances are metric distances, the DTW distance under those metrics is generally not a metric distance. Before computing the DTW distance, we have to align sequences, and along the alignment path the DTW distance is defined. In our work, we do not aim to learn to align sequences, instead, we use existing DTW techniques which we enhance by including local shape descriptors during the alignment process to align sequences first and treat these alignment paths as known. Therefore, the metric learning in DTW is independent of the preceding alignment process; in principle, any sequence alignment algorithms can be used before local metric learning. In the paper, different from the tradition, we compute the distance between a matched point pair by the distance between their descriptors, as opposed to the Euclideandistancebetweenthematchedpoints. Thedescriptorofatemporalpoint is a representation of the subsequence centering on that point, and it represents the structuralinformationaroundthatpoint(seeFig. 4.2). Inthisway, DTWdistance is computed as the accumulated descriptor distances along the alignment path. In our case, descriptors are further k-means clustered into groups, then multiple local distance metrics are learned within individual clusters and between any two clusters, such that DTW distances calculated under the learned metrics make kNN neighbors of temporal sequences always come from the same class, while sequences from different classes are separated by a large margin. The intuition behind this clustering is to allow for different weights be assigned to different local shapes. For 87 example, in EKG data, our algorithm might learn that alignment accuracy of the ventricular contraction should be assigned a higher weight than that of ventricular relaxation. In this perspective, our local metric learning framework is essentially learning the importance of subsequences of different shapes in an automatic and principled way. Our is depicted in Fig. 4.1. We extensively test the performance of metricDTW for time series classifica- tion on 70 UCR time series datasets ([23]), and experimental results show that (1) the learned local metrics, compared with the default Euclidean metric, yield higher 1NN classification accuracies significantly; (2) given alignment paths of different qualities, the subsequent metric learning consistently boosts classifica- tion accuracies significantly, showing that the proposed metric learning approach yields a benefit over a wide range of preceding alignment qualities; (3) our metric learning algorithm outperforms the reference time series classification algorithm (1NN-DTW) significantly on UCR datasets, as well as 15 of the 19 state-of-the-art algorithms benchmarked in ([4]). 4.3 Related work As mentioned, our local metric learning framework is essentially learning the importance of different subsequences in an automatic and principled way. There are several prior works focusing on mining representative and discriminative sub- sequences (image patches) from sequences (images). Time series shapelet is introduced in ([139]), and it is a time series subsequence which is discriminative of class-membership. The authors propose to enumerate all possible candidate subsequences, evaluate their qualities using information gain, andbuildadecisiontreeclassifieroutofthetoprankedshapelets. Miningshapelets 88 in their case is to search for more important subsequences, while discarding less important subsequences. In the vision community, there are several related works ([117, 36, 35]), all of which are devoted to discovering mid-level visual patches from images. Mid-level visual patch is conceptually similar to shapelet in time series, and it is a image patch which is both representative and discriminative for scene categories. They ([117, 36]) pose the discriminative patch search procedure as a discriminative clustering process, in which they selectively choose important patches but discarding other common patches. We are different from the above work in that we never have to greedily select important subsequences, instead, we take all subsequences into account and automatically learn their importance through metric learning. Our work is most similar to and largely inspired by LMNN ([132]). In ([132]), Weinbergre and Saul extend LMNN to learn multiple local distance metrics, which is exploited in our work as well. However, we are still sufficiently different: first the labeled examples in our case are temporal sequences; second, the DTW distance between two examples is jointly defined by multiple metrics, while in ([132]), dis- tance between two examples is determined by a single metric. In ([44]), Garreau et al propose to learn a Mahalanobis distance metric to perform DTW sequence alignment. First they need ground-truth alignments, which is not required in our case, and second they focus on alignment, instead of kNN classification. 4.4 Local distance metric learning in DTW As mentioned above, local metric learning needs sequence alignments as inputs. While in most scenarios, ground-truth sequence to sequence alignments are expen- sive or impossible to label, in experiments, we use DTW to align sequences first, 89 and use the computed alignments for the subsequent metric learning. In this sec- tion, we first briefly review the DTW algorithm for sequence alignment, and then introduce our multiple local metric learning algorithm for time series classification. 4.4.1 Dynamic Time Warping DTW is an algorithm to align temporal sequences under certain restrictions. Given two sequencesP andQ of possible different lengthsL P andL Q , namely P = (p 1 ,p 2 ,...,p L P ) T andQ = (q 1 ,q 2 ,...,q L Q ) T , and let d(P,Q)∈R L P ×L Q be the pairwise distance matrix, where d(i,j) is the distance between points p i and q j . The goal of temporal alignment betweenP andQ is to find an alignment path p such that the total cost along the matching path P (i,j)∈p d(i,j) is minimized. The alignment path p is constrained to satisfies boundary, monotonicity and step- pattern conditions ([108, 70, 44]): Searchingforanoptimalalignmentpathpundertheaboverestrictionsisequiv- alent to solve the following recursive formula: D(i,j) =d(i,j) + min{D(i− 1,j− 1), D(i,j− 1), D(i− 1,j)} (4.1) where D(i,j) is the accumulated distance from the matched point-pair (p 1 ,q 1 ) to the matched point-pair (p i ,q j ) along the alignment path, and d(i,j) is the dis- tance between pointsp i andp j . In all the following alignment experiments, we use the squared Euclidean distance to compute d(i,j). The above formula is a typical dynamic programming recursion, and can be solved efficiently inO(L P ×L Q ) time ([39]). The alignment path p is obtained by back-tracking. Notes: traditionally, when building the cumulative distance matrix D, we com- puted(i,j), the distance between a pair of points p i andq j , askp i −q j k 2 2 , e.g. in 90 the case of 1D time series,d(i,j) is equal to the scalar difference (squared) between two points p i and q j . Here we enhance this by using descriptors (see Fig. 4.2) to compute d(i,j): i.e., d(i,j) =k − → p i − − → q j k 2 2 , where − → p i and − → q j are descriptors of points p i and q j respectively. By back-tracking the enhanced cumulative distance matrix, we usually achieve better alignments ([147]). We have to emphasize that during the DTW sequence alignment period, we always use Euclidean metric to compute descriptor-to-descriptor distances, and there is no metric learning during alignment. 4.4.2 Local distance metric learning After obtaining the alignment path p by DTW , we can compute DTW dis- tance betweenP andQ in two ways: (1) directly return DTW distance as the accumulated distances between matched pairs along p, i.e., P (i,j)∈p d(p i ,q j ); (2) to measure the distance between a matched pair (p i ,q j ), we could use the distance between their descriptors, i.e.,d( − → p i , − → q j ), where − → p i and − → q j are descriptors of points p i and q j respectively. In this way, DTW distance betweenP andQ is calculated as the accumulated descriptor distances along p, i.e., P (i,j)∈p d( − → p i , − → q j ). Here, the descriptor at some point is a feature vector representation of the subsequence cen- tering at that point, and the descriptor is supposed to capture the neighborhood shape information around the temporal point (see Fig. 4.2 for the illustration of descriptors). Using their descriptor distance to measure two point similarity (dis- tance) makes much sense since two point similarity is usually better represented by their neighbor structural similarity, instead of by their single point to point distance. 91 Figure 4.2: Descriptor of temporal point. As shown,p i andq j are temporal points on sequences, and the descriptor of a point is defined to the representation of the subsequence centering on that point, e.g., the bold cyan subsequence ( − → p ) around p i is its descriptor, and any representation of − → p is called the descriptor of p i as well, like HOG-1D and derivative sequence. In the following experiments, we always adopt the second way to define the DTW distance, and we use three shape descriptors, namely the raw-subsequence, HOG-1D ([145]) and the gradient sequence ([71]). If the squared Euclidean distance is used, then DTW distance is calculated as D(P,Q) = P (i,j)∈p k − → p i − − → q j k 2 , which is essentially an equally weighted sum of distances between descriptors (subsequences); however, as shown in ([139]), some subsequences are more class-membership predictive, while others are less discriminative. Therefore, it makes more sense to calculate the DTW dis- tance as a weighed sum of distances between subsequences, i.e., D(P,Q) = P (i,j)∈p ω ij k − → p i − − → q j k 2 , where ω ij indicates the importance of subsequences − → p i and − → q i . If we make further generalization, the DTW distance can be calcu- lated as the sum of squared Mahalanobis distances between subsequences, i.e., D(P,Q) = P (i,j)∈p ( − → p i − − → q j ) T M c i c j ( − → p i − − → q j ), whereM c i c j is a positive semidefi- nite Mahalanobis matrix to be learned from the labeled data. Note that, instead of learning a global metric matrix, we learn multiple local metric matrices simul- taneously. The intuition behind is that differently-shaped subsequences have dif- ferent importance for classification, therefore, their between-distances should be computed under different metrics. In experiments, we first k-means partition the descriptors from all training sequences intok clusters, and then learn Mahalanobis 92 distance metrics within individual clusters and between any two different clusters. LetM c i c i ,M c i c j denote the metrics within the cluster c i and between two cluster c i and c j respectively, and then the distance between any two descriptors − → p i and − → q i is ( − → p i − − → q j ) T M c i c j ( − → p i − − → q j ), where c i and c j are clusters − → p i and − → q i belong to respectively. Tolearntheselocalmetricsfromlabeledsequencedata,wefollowLMNN([132]) closely and pose our problem as a max margin problem: the local Mahalanobis metricsaretrainedsuchthatthek-nearestneighborsofanysequencealwaysbelong to the same class while sequences of different classes are separated by a large margin. We use the exact notations in LMNN, and the only place to change is to replace the squared Mahalanobis point-to-point distance in ([132]) by the DTW distance. The adapted LMNN is: Minimize : (1−μ)Σ i,j i D(x i ,x j ) +μΣ j i,l (1−y il )ξ ijl Subjectto : (1)D(x i ,x l )−D(x i ,x j )≥ 1−ξ ijl (2)ξ ijl ≥ 0 (3)M c i c j ≡M c j c i ,M c i c j 0, c i ,c j ∈{1, 2,...,k} (4.2) Note that we enforce the learned matrices between two clusters c i and c j to be the same, i.e.,M c i c j ≡M c j c i , which makes distance mapping between c i and c j be a metric. We refer readers to ([132]) for notation meanings. In our experi- ments, we further simplify the form of Mahalanobis matrices, and constrain them to be not only diagonal but also with single repeated element on the diagonal, i.e., M c i c j =ω c i c j ·I. Under this simplification, learning a Mahalanobis matrix reduces to learning a scalar, resulting in (k 2 +k)/2 unknown scalars to learn. Under the 93 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 learned metrics Euclidean metric align: gradient metric learning: gradient 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 learned metrics Euclidean metric align: HOG−1D metric learning: HOG−1D 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 learned metrics Euclidean metric align: raw−subsequence metric learning: raw−subsequence Figure 4.3: Effectiveness of multiple local metrics learning. Three plots show the comparison between 1NN classifier performances under the Euclidean metric and the learned metrics. Under all three descriptors, we obtain significantly improved accuracies, indicating our proposed multiple local metriclearning approach is effec- tive. reduction, the original semidefinite programming problem (4.2) reduces to a lin- ear programming problem. In experiments, the balancing factor μ is tuned by cross-validation. Constraining the Mahalanobis matrix to be a single scalar makes our model more interpretable than by using a full or a generic diagonal Mahalanobis matrix: DTW distance is the accumulated descriptor distances along the alignment path. We hypothesize that some descriptors are more class-discriminative than others. If we can magnify their contribution, classification performance should increase. Learning scalar metrics between descriptors is like weighing different shapes dif- ferently, highly interpretable. Since the shape of a descriptor is determined by all dimensions, we use a single weighting factor. We tried to constrain the Maha- lanobis matrix to be a generic diagonal matrix, and achieved similar results as by constraining it to be a scalar, however the former is not as interpretable. 4.5 Experiments In this section, we evaluate the performance of the proposed local metric learn- ing method for time series classification on 70 UCR datasets ([23]), which provide 94 their standard training/test partitions for performance evaluation. Note how our work uses local sequence descriptors in two components of the overall approach: 1) during sequence alignment, and 2) during local metric learning. We empirically study: (1) whether multiple local metric learning boosts time seriesclassificationaccuraciesof1NNclassifier; (2)howthequalityofthepreceding alignments affects the subsequent metric learning performances; (3) the influence of hyper-parameter settings on the metric learning performances. 4.5.1 Experimental settings Temporal point descriptors: the descriptor at a temporal point is used to rep- resent its neighborhood structure. In DTW alignment period, we compute point- to-point distance by the squared Euclidean distance between their descriptors. Obviously, the searched alignment path depends on the used descriptors. Descrip- tors are used in the subsequent metric learning as well to define the DTW distance (see Sec. 4.4.2). In experiments, we use three subsequence descriptors, including raw- subsequence, HOG-1D ([145]) and the derivative sequence ([71]). (1) The raw- subsequences taken from temporal points are fixed to be of length 30; (2) HOG-1D is a representation of the raw subsequence, and we use two non-overlapping inter- vals, use 8 bins and set σ = 0.1, resulting in a 16D HOG-1D descriptor; (3) the derivative descriptor is simply the first order derivative sequence of the raw sub- sequence. We follow ([71]) exactly to compute derivative at each point, and the derivative descriptor is 30D by definition. Metric learning: k in kNN is set to 3. For each training time series, we compute its 3 nearest neighbors of the same class based on the DTW distances, which is computed under the default Euclidean metric. We set k in k-means to be 5, 95 align: HOG−1D align: raw−subsequence align: gradient −0.2 −0.1 0 0.1 0.2 0.3 improvement after metric learning 2 3 4 5 alignment errors raw−subsequence HOG−1D gradient stretch−30% stretch−40% stretch−50% Figure 4.4: Influence of the alignment qualities on the metric learning performance (a). The right plot shows DTW under different descriptors has different alignment performances. The left shows that under alignment paths returned by different descriptors, we execute the subsequent metric learning under the gradient descrip- tor, and plot the 1NN performance improvements of the learned metrics to the Euclidean metric. Even if the preceding alignments have different qualities, the subsequent metric learning always improves the 1NN performances significantly (p-values = 0.029/0.007/0.003 under alignments by the descriptor HOG-1D/raw- subsequence/gradient). partition the training descriptors into 5 clusters and local distance metrics are defined within and between these 5 clusters. The linear program (4.2) is solved by the CVX package ([52, 51]). During test, we use the label of its nearest neighbor in the training set as the predicted test label. This is consistent with the convention in the time series community, in which they use 1NN as classifier. 4.5.2 Effectiveness of local distance metric learning First, we fix the alignment, and explore the performance of local metric learn- ing. Then, we analyze the influence of the preceding alignment qualities on the performances of subsequent metric learning. We align time series by DTW under three descriptors, derivative, HOG-1D and raw-subsequence, respectively. Given the computed alignments, we learn local distance metrics under the same descriptor as used in the alignment by solving the LP problem (4.2), and plot 1NN classification accuracies in Fig. 4.3. Plots in Fig. 96 4.3 are scatter plots showing the comparison between 1NN classifier performances under the Euclidean metric and the learned metrics. Each red dot in the plot indicates one UCR dataset, whose x-mapping and y-mapping are accuracies under the Euclidean metric and the learned metrics respectively. By running the signed rank Wilcoxon test, we obtain p-values 0.015/0.027/0.003 for the descriptor HOG- 1D/raw-subsequence/gradient, showing that our proposed metric learning improve the 1NN classifier significantly under the confidence level 5%. Since the alignment path is the input for the metric learning step, bad align- ments may affect the performance. Nevertheless, we empirically show this is not the case over a range of alignment qualities. We perform metric learning under dif- ferent alignments, and evaluate whether significant improvements can be achieved under all cases. In experiments, we align time series under three descriptors, and then learn metrics under the gradient descriptor. We use boxplot to show the performance improvements, compared with using the default Euclidean metric, in Fig. 4.4(left). The blue box has two tails: the lower and upper edge of each blue box represent 25th and 75th percentiles, with the red line inside the box marking the median improvement and two tails indicating the best and worst improve- ments. Under three different alignments, the median improvements are all greater than 0 and the majority of improvements are above 0. By running the signed rank Wilcoxon test between the 1NN performances under the Euclidean metric and the learned metrics, we obtain p-values 0.029/0.007/0.003 under alignments by the descriptor HOG-1D/raw-subsequence/gradient. This empirically indicates that the subsequent metric learning is robust to the preceding alignment qualities. To show whether different descriptors do have different alignment qualities, we could compare DTW alignment paths under different descriptors against the ground-truth alignments. However, UCR datasets do not have the ground-truth 97 alignments, here we simulate time series alignment pairs by manually scaling and stretching time series, and the ground-truth alignment between the original time series and the stretched one is known by simulation; then we run DTW align- ment under different descriptors, evaluate the alignment error against the ground truth, and plot the results in 4.4(right). It shows different algorithms do perform differently. We refer the readers to the supplementary materials for simulation details. In the above experiments, we always align time series by DTW under descrip- tors. As shown in ([147]), DTW alignment under descriptors almost consistently outperform the ordinary DTW alignment in which the point-to-point distance is computed by the squared Euclidean distance between points in the time series, not between descriptors. To further show that the second step metric learning is largely independent of the quality of the first step alignment, here we align sequences by the ordinary DTW first, and then do the downstream local metric learning. GivenfixedordinaryDTWalignments, welearnlocalmetricsunderthree different descriptors, HOG-1D, raw-subsequence and gradient. The 1NN classifi- cation accuracy differences between after and before metric learning are plotted in Fig. 4.5. The Wilcoxon test returns p-values 0.019/0.034/0.006 for descriptors HOG-1D/raw-subsequence/gradient, showing significant improvements under the confidence level 5%. The results further show that even if the ordinary DTW does not achieve high qualitied alignments, the subsequent metric learning consistently improves the 1NN classification accuracies. 4.5.3 Effects of hyper-parameters There is one important hyper-parameter in the metric learning: the number of clusters of descriptors. In experiments, we align and learn local metrics both 98 HOG−1D raw−subsequence gradient −0.2 −0.1 0 0.1 0.2 0.3 improvement after metric learning metric learning given the alignments by the ordinary DTW Figure 4.5: Influence of the alignment qualities (b): given the alignments by the ordinary DTW, and learn local metrics under three different descriptors. The box- plotshowstheclassificationaccuracydifferencesof1NNclassifierbetweenafterand before metric learning. The Wilcoxon test returns p-values 0.019/0.034/0.006 for descriptors HOG-1D/raw-subsequence/gradient, showing significantly improved accuracies. Although the ordinary DTW usually does not achieve as better align- ments as DTW under descriptors, the downstream metric learning could consis- tently improve the classification accuracies. k=5 k=10 k=15 k=20 k=25 k=30 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 improvement after metric learning Figure4.6: Effectsofdifferentnumbersofdescriptorclustersonthemetriclearning performance. The boxplot shows the improvements after local metric learning under different k’s. Visually seen, all median improvements are above 0, and the majority improvements of each boxplot lie above 0 as well. The signed rank Wilcoxon test shows the significantly improved performances under all k’s. under the gradient descriptor, and during the metric learning, we set different numbers of descriptor clusters, i.e., k = {5, 10, 15, 20, 25, 30}, learn metrics by solving (4.2), and plot the 1NN performance improvements in Fig. 4.6. Under different k’s, the majority of the improvements are above 0, and the signed rank Wilcoxon test returns p-values 0.003/ 0.026/ 0.005/ 0.021/ 0.002/ 0.017 under k=5/10/15/20/25/30, showing significant improvements under varied k’s. 99 4.5.4 Comparison with state of the art algorithms As shown in ([99, 131, 3, 101, 4]), 1NN classifier with the ordinary DTW dis- tance as the similarity measure (1NN-DTW) is very hard to beat. Here we use 1NN-DTW as the baseline, and compare our algorithms to it. In 1NN-DTW, the alignment is computed by DTW as well, however, no descriptor is used, i.e., the point to point distance is directly computed by the squared Euclidean distance between those two points. The DTW distance between two aligned sequences is computed as the accumulated squared Euclidean point-to-point distances, with no descriptor used in the entire approach. In our case, we use the HOG-1D descriptor to align sequences and learn local metrics. We plot the time series classification performances in Fig. 4.7: our algorithm with (without) metric learning wins/draws/loses the baseline on 48/3/19 (47/3/20) datasets, and the signed rank Wilcoxon test returns p-values 1.1·10 −4 (1.8·10 −5 ), showingsignificantaccuracyimprovementover1NN-DTW.We document the classification error rates of three algorithms in the supplementary materials. We further compare our algorithm to 19 state-of-the-art algorithms surveyed in ([4]). In ([4]), every algorithm is compared with the baseline (the ordinary 1NN-DTW) by computing two scores: the percentage of the datasets it wins over 1NN-DTW and the mean of the classification accuracy differences (see Table 4 in ([4])). Our algorithm beats 1NN-DTW on 72.86% of datasets, with mean differ- ence 4.37%. From Table 4 in Bagnall et al ([4]), we thus outperform 15 of the 19 algorithms they test. We don’t outperform all 19, but we believe we still have mer- its: our algorithm is a simple NN classifier with no parameter to tune, while other algorithms improve accuracy using complicated data preprocessing (e.g., Fourier 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ours: with metric learning baseline 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 learned metrics Euclidean metric 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 ours: without metric learning baseline Figure 4.7: Comparison with 1NN-DTW. Left: comparison between the base- line and our algorithm under no metric learning; Right: comparison between the baseline and our algorithm under metric learning. The hypothesis test shows our algorithm wins the baseline significantly, and our algorithm with metric learning achieves the best performance on 70 UCR time series, compared with other single classifiers. datasets. transform, symbolization, feature crafting) and complicated classifiers (e.g., ran- dom forest, neural network, ensemble method). To be specific, the 4 methods that perform better than ours, COTE ([5]), EE ([85]), BOSS ([109]) and ST ([15]), are all ensemble classifiers: e.g., COTE involves pooling 35 classifiers into a sin- gle ensemble. Our method is a single classifier, and on some datasets it does not perform as well as these 4 ensemble classifiers; however, compared with ensembles, our method is more interpretable and is more time efficient. To conclude, our algorithm achieves the best accuracy within all the single classifiers. 4.6 Discussion and Conclusion In this paper, we propose to learn multiple local Mahalanobis distance metrics to perform k-nearest neighbor (kNN) classification of temporal sequences. We showed empirically that the metric learning process always improves the 1NN time series classification accuracies, and is robust to the qualities of the preceding DTW alignments. Our algorithm wins the ordinary 1NN-DTW algorithm significantly on 70 UCR time series datasets, and sets up a record for further comparison. 101 DTW time series classification has two consecutive steps: time series alignment and then classification. In this paper, the metric learning happens after the align- ment finishes. A naive extension would be to do alignment and metric learning in an iterative process. But as we tried, this deteriorated the classification per- formances. We hypothesize this is because we did not build alignment and metric learning into one objective function, such that the classification error from the metric learning step cannot be directly propagated into the 1st alignment step. A future research direction is thus how to do the alignment and learn metrics in an integrated fashion. 102 Chapter 5 Time Series Decomposition 103 5.1 Abstract We propose a novel univariate time series decomposition algorithm to parti- tion temporal sequences into homogeneous segments. Unlike most existing tempo- ral segmentation approaches, which generally build statistical models of temporal observations and then detect change points using inference or hypothesis testing techniques, ouralgorithmrequiresnodomainknowledge, isinsensitivetothechoice of design parameters and has low time complexity. Our algorithm first symbolizes the time series into a string, and then decomposes the string recursively, similar to the construction process of a decision-tree classifier. We extend this univari- ate decomposition algorithm to multivariate cases by decomposing each dimension as an univariate time series and then searching for temporal transition points in a coarse-to-fine manner. We evaluate and compare our algorithm to two state- of-the-art approaches on synthetic data, CMU motion capture data, and action videos. Experimental results demonstrate the effectiveness of our approach, which yields both significantly higher precision and recall of temporal transition points. 5.2 Introduction Temporal segmentation of time series aims to partition a sequence of obser- vations into several homogeneous and semantically meaningful segments. It is an important step in building intelligent systems, with wide applications in speaker segmentation [33], indexing of music signals [54], action recognition from videos [133, 43], and human motion retrieval [32]. Previous work on time series segmenta- tionismainlydividedintotwocategories. Manystudiesattempttobuildastatisti- cal model of observations and uses some inference techniques or hypothesis testing to detect temporal change points. Commonly used observation models include 104 267.4 285.3 377.6 148.4 163.6 Figure 5.1: Univariate time series decomposition: we decompose a time series instance recursively, resulting in a binary tree with homogeneous leaf segments. The value at each split node shows its confidence to be a true transition point. parametric [19], non-parametric (typically kernel-based [55, 48] or stochastic pro- cess models [107]) and probabilistic sequence models [29]. These methods have statistically grounded models; however, they usually make certain assumptions and have many hyper-parameters that require sound domain knowledge to initial- ize properly. Moreover, change-point inference algorithms are also not straight- forward and have high time complexity. Another type of temporal segmentation methods adopts clustering techniques to segment motion sequences into homo- geneous segments, e.g., [151]; however, this requires knowing the ground truth number of clusters, which is unavailable in real situations. Motivatedbythesedifficulties, wefirstproposeanunivariatetimeseriesdecom- position algorithm, which partitions the time series recursively into homogeneous parts. Our algorithm deviates from the previous methods primarily in that it does not require any statistical assumptions on observations nor domain knowledge to set hyper-parameters. Our method first symbolizes time series as strings, and then decomposes the strings recursively in a decision-tree classifier construction fashion. This yields a set of splitting points associated with weights indicating their possibilities to be transition points. We then extend to multivariate cases by decomposing each dimension as an univariate time series, resulting in sets of split candidates, and then searching for temporal transition points by post-processing 105 split candidates. We test our decomposition algorithm on the CMU 3D human motion capture data and 2D action videos [13], and obtain significantly higher precision and recall than two state-of-the-art segmentation algorithms: probabilis- tic Principal Component Analysis (PPCA) [7] and Hierarchical Aligned Cluster Analysis (HACA) [151]. Ourcontributionsareseveral-fold: (1)weintroducealocalsubsequencedescrip- tor, Histogram of Oriented Gradients for 1D signals (HOG-1D), which is shown qualitatively to capture local shapes very well; (2) we develop a highly generic temporal segmentation framework for univariate time series, which is easily exten- sible to multivariate cases. Different dimensions from multivariate time series are allowed to be multi-sourced and multi-modal; (3) our method is almost parameter- free: we do not require domain knowledge of statistical properties of observations and the algorithm itself is highly insensitive to the choice of one major design parameter; (4) Extensive temporal segmentation experiments on synthetic and real data show significantly better precision and recall than two state-of-the-art approaches. 5.3 Related work Temporal sequence segmentation research falls into two major categories: change-point detection and clustering. Most studies belong to the first cate- gory: Torre et al [29] proposed to segment facial expressions by combining spectral clustering techniques and probabilistic sequence modelling approaches. Harchaoui and Bach [55] introduced a kernel-based method for change-point analysis. They applied Kernel Fisher Discriminant Ratio to measure the homogeneities between segments, and built a hypothesis testing to predict and detect the change points. 106 Gong et al [48] proposed Kernelized Temporal Cut to segment temporal sequences. They use kernel tricks to map the distribution into the Reproducing Kernel Hilbert Space (RKHS). In [19], parametric distribution assumptions are made on tempo- ral observations, while in most cases, the assumed distribution may deviate much from the underlying true distribution. Saatci et al [107] combine Bayesian online change point detection with Gaussian Processes (GP) to create a non-parametric time series model to detect change points in an online manner. Xuan and Mur- phy [136] extend Bayesian change-point detection techniques of Fearnhead [40] to multivariate time series cases. They model the joint density distribution of obser- vations by undirected Gaussian graphical model, and estimate its structure and the temporal segmentation jointly. In [7], the authors propose an online algorithm to segment motion sequences into distinct activities. Their method is a temporal extension of probabilistic principal component analysis (PPCA) for change point detection. All the above work attempts to build a statistical model of temporal observations by either learning some joint distributions [29, 136], or exploiting pre- defined parametric [19, 7] and non-parametric models [48, 55, 107], and then uses some inference techniques or statistical hypothesis testing to detect change points. In the second category, Zhou et al [151] developed hierarchical aligned cluster analysis (HACA), which combines kernel k-means with the dynamic time align- ment kernel (DTAK) [94] to cluster time series data into disjoint segments. They formulate temporal segmentation as a clustering process. One major drawback is that the ground truth number of clusters has to be given, which is unrealistic in practice. Our method is fundamentally different from all the previous work: we do not have to know statistical properties of the observations since we do not model the observations statistically, and we do not have to know the number of true 107 clusters either. We handle the temporal segmentation in a novel way: we first symbolize the time series into a string, and then decompose the string recursively into homogeneous segments. 5.4 Methodology Here we develop a univariate time series decomposition algorithm, and then extend it to multivariate time series. First, we introduce a local descriptor, His- togram of Oriented Gradients of 1D signals (HOG-1D), to represent subsequences sampled from time series instances. HOG-1D is used to symbolize time series instances. Technical details of HOG-1D descriptor, univariate and multivariate time series decomposition algorithms are given in the following. 5.4.1 HOG-1D descriptor Histogramoforientedgradients(HOG)wasfirstintroducedbyDalalandTriggs [26] for object detection in 2D images. It is shown that local object appearances and shapes are well captured by HOG descriptors. HOG is successfully used for object detection and recognition [26, 41]. Based on the success of HOG, we introduce a HOG-1D descriptor for 1D time series data. We inherit key concepts from HOG, and adapt them to 1D temporal data. Given a subsequence s = (p 1 ,p 2 ,...,p l ) of lengthl, divide it inton constant- length overlapping or non-overlapping intervals I ={I 1 ,I 2 ,...,I n }. Within each interval I i , a 1D histogram of oriented gradients is accumulated over all temporal points in I i . Concatenating n interval histograms forms the descriptor of the subsequence s, and we term it the HOG-1D descriptor (Fig.5.2). The statistical 108 Figure 5.2: HOG-1D descriptor: a subsequence s is shown as green line. At each temporal point p i on s, centered gradient is estimated, with the blue arrow indi- cating its direction and magnitude. The subsequence is divided into 3 overlapping intervals, boxed by magenta, red and cyan rectangles. In each interval, a histogram of oriented gradients (HOG) is accumulated over all temporal points, and shown under that interval. Concatenation of 3 HOGs gives the HOG-1D descriptor for the subsequence s. Gradient orientations lie within (−90 0 , 90 0 ), and in this figure 8 evenly spaced orientation bins are used, resulting in a 24-D HOG-1D descriptor. nature of histograms makes HOG-1D less sensitive to observation noise, while the concatenation of sequential histograms captures temporal information well. Given an interval I = (p t 1 , p t 1 +1 ,...,p t 2 ), first compute the centered gradient at each temporal point p t , i.e., g t = 1 2 (p t+1 −p t−1 ). The orientation of g t lies within (−90 ◦ , 90 ◦ ). Then accumulate gradient votes within orientation bins. The orien- tation bins are evenly spaced within (−90 ◦ , 90 ◦ ). We exploit a kernel smoothed voting strategy, i.e., a gradient votes for all orientation bins, with voting magni- tude for the i th bin b i being|g t |· exp{− 1 2 (\(g t )−\(b i )) 2 /ˆ σ 2 }, where\(g t ) and \(b i ) are orientation angles of g t and b i , with ˆ σ a decay factor of the Gaussian smoothing kernel. In experiments, we use 8 bins and 2 non-overlapping intervals; thus HOG-1D is a 16D vector. 5.4.2 Univariate Time Series Decomposition Given an univariate time series T =t 1 t 2 ...t L , (t i ∈R) of length L, the goal of temporal segmentation is to partition T into segments, such that each segment is 109 homogeneous while consecutive segments are heterogeneous. In this paper, we pro- pose to partition time series recursively, resulting in a binary partition tree with each terminal node being a homogeneous segment. The decomposition process bears much similarity with the construction of a decision tree classifier, both grow- ing a tree in the greedy way and terminating the growing when certain pre-defined conditions are satisfied. The decomposition algorithm consists of two sequential steps: (1) time series symbolization, and (2) string decomposition. First symbolize a time series T into a string C, and then decompose C recursively into substrings. Symbolization: symbolization is a functional process that converts a time series T to a stringC, i.e.,ϕ : T→C. One widely used symbolization technique is Sym- bolic Aggregate approXimation (SAX) [83]. However, alphabetic symbols in SAX representation do not contain any local subsequence shape information, such that two entirely different subsequences can have the same symbol representation. Here we introduce a time series symbolization approach, which makes local subsequence shape information captured and expressed by symbols. This makes downstream decomposition easier. Given a time series T = t 1 t 2 ...t L of length L, we symbolize it to a discrete string C = c 1 c 2 ...c L of the same length L. First, we extract a subsequence s i of length l from each temporal point t i . The subsequence s i is centered on t i , with its lengthl typically much smaller thanL(lL). Note we have to pad both ends of T byb l 2 c with duplicates of t 1 (t L ) to make subsequences sampled at endpoints well defined. Now we obtain an intermediate subsequence representation of T: S = s 1 s 2 ...s L . Then compute HOG-1D descriptors of subsequences s i (1≤i≤L) and further clusterL HOG-1D descriptors intoK clusters by k-means. By assigning a unique label k (1≤ k≤ K) to each cluster and mapping each subsequence to 110 its cluster label, we convert the subsequence representation S = s 1 s 2 ...s L into a string representation C = c 1 c 2 ...c L , where c i ∈ {1, 2,...,K}. Since HOG-1D is descriptive of local shapes (shown qualitatively by t-SNE plot [123] in Fig.5.8), shape information of subsequences in S is encoded by their symbols in C. The string C is the input of the downstream decomposition algorithm. Since each symbol c i corresponds to the temporal point t i on the original time series T, the decomposition architecture on the string C can be transferred onto the raw time series T without any changes. Decomposition: given a symbolized string C = c 1 c 2 ...c L , we split it recursively into two substrings until a termination criterion is triggered, resulting in a directed binary tree. Intuitively, we wish that all terminal substrings in the binary tree be as pure as possible, such that their corresponding segments in the raw time series are as homogeneous as possible. This splitting process shares much commonality with the construction procedure of a decision tree classifier. The major difference is: decision tree searches for the best splitting-rule to split the sample spaces into smaller parts with the maximal homogeneity, while our decomposition algorithm searchesforthebestsplitpointsuchthatleftandrightsubstringshavethemaximal homogeneity. Our decomposition process involves two major elements: (1) split goodness, and (2) termination rule, detailed below. Goodness of split : A parent stringC p can be split into left and right substrings C l and C r from any temporal point s (s∈ C p ). Let φ(s, C p ) be the goodness of the split point s, searching for the optimal split point boils down to finding a point with maximal goodness, i.e., s ∗ = arg max s∈Cp φ(s, C p ). Here we use information gain to evaluate split goodness. The information gain at a potential split point s is measured by the drop of entropy during the splitting, concretely φ(s, C p ) = ΔE(s, C p ) =E(C p )−[p l E(C l ) +p r E(C r )], whereE(C p ), E(C l ), E(C r ) 111 are respectively the entropy of the parent stringC p and its two child substringsC l andC r ,p l andp r are the length proportions of left/right substrings, i.e., p l = |Cr| |Cp| , p r = |Cr| |Cp| (p l +p r = 1). Entropy E(C) of a string C is defined as: E(C) = − P K i=1 p C i logp C i , ifp C i = 0,p C i logp C i = 0, wherep C i is the frequency of symboli in the string C. Therefore, searching for the optimal splitting point s ∗ is equivalent to solve the following maximization problem: arg max s∈Cp {E(C p )− [p l E(C l ) +p r E(C r )]} (5.1) This is solved by evaluating every point on C p , which takesO(|C p |) time. Weight of the optimal Split Point :Wedefinetheweightω C s ∗ oftheoptimalsplit points ∗ of the stringC to be the information gain ats ∗ weighted by the length of C, i.e., ω C s ∗ =|C|· ΔE(s ∗ , C) (5.2) Intuitively, the weight of an optimal split point at a parent string tends to be larger than the weights of optimal split points at the left and right children substrings. The weight measures the possibility of a potential split point to be a ground truth transition point. Termination Rule : the growing of the decomposition tree continues until a certain termination criterion is triggered. Just as in a decision tree construction, typical stopping conditions could be: the maximum tree depth has been reached, the substrings are absolutely pure, the best splitting gain is less than a certain threshold, etc. For motion segmentation, a practical termination rule is stopping splitting when the length of the substring is shorter than the minimal length of homogeneous motion segments. The univariate time series decomposition algo- rithm is given in Alg.3, which returns a set of splitting points S with weights W. 112 Algorithm 3 Univariate time series decomposition Inputs: a time seriesT =t 1 t 2 ...t L ; subsequence lengthl; # of clusters K; minimal length of terminal substrings ` Symbolization: Subsequence Sampling at each t i : S = s 1 s 2 ...s L ; HOG-1D descriptor computation of s i : D = d 1 d 2 ...d L ; kmeans clustering of d i (i∈{1, 2,...,L) into K clusters; Symbolization of T: C =c 1 c 2 ...c L (c i ∈{1, 2,...,K}). Decomposition: initialize: Q ={C},W =∅,S =∅ WhileQ6=∅ c p ←Q.dequeue(); if|c p |<` continue; else search s ∗ of c p using Eq.5.1,S←S.add(s ∗ ); compute ω Cp s ∗ of s ∗ using Eq.5.2,W←W.add(ω Cp s ∗ ); split c p from s ∗ into left and right substrings c l and c r ; Q←Q.enqueue(c l ), Q.enqueue(c r ) end if end while output:S, W 5.4.3 Multivariate Time Series Decomposition A m-dimensional multivariate time series MT is defined as m synchronized univariatetimeseries,i.e.,MT ={T 1 ;T 2 ;...;T m }. DifferentdimensionsofMT may becorrelatedorindependent,whichdependsonthesourceofthedata[112,63]. The correlationanalysisisbeyondthescopeofthispaper, andhere, weprocessdifferent dimensions independently. This makes the univariate time series decomposition algorithmeasilygeneralizabletomultivariatetimeseriesdatacomingfromdifferent sources, while we still achieve quite promising temporal segmentation results of multivariate time series data in experiments. Multivariate time series decomposition consists of two major steps: (1) treat each dimension T i (i∈{1, 2,...,m}) as an univariate time series and decompose it 113 by Alg.3, which returns a set of split candidates S i (i∈{1, 2,...,m}); (2) search transition points of the multivariate time series by post-processing split candidates S ={S 1 ,S 2 ,...,S m }. Here we propose a coarse-to-fine search strategy. Coarse Search:AsobservedinrowbofFig.5.3, splitcandidatestendtogroup into clusters, which shows decomposition consistency among different dimensions. Here, we adopt a straightforward way to detect transition points: first detect clusters in the split candidates space and then use cluster centers as transition points. In practice, we threshold to retain only the topk ∗ most possible transition points. Cluster split candidates by mean-shift: one major issue of directly applying mean-shift in the raw candidates space is the difficulty of bandwidth selection, and using uniform bandwidth for all sample points often fails [93, 24]. Here, we first estimate Gaussian kernel density of split candidates, and then apply mean-shift withauniformbandwidthonpointssampledaccordingtothatdensitydistribution to do clustering. This works around the difficulty of adaptively choosing the band- width, while still achieving great clustering performance. Let S ={s 1 ,s 2 ,...,s N } be the set of split candidates on the 1D temporal axis, with W ={ω 1 ,ω 2 ,...,ω N } being their associated weights. The Gaussian kernel density estimator f σ (s) of S is: f σ (s) = 1 Z X s i ∈S ω i ·N (s i ,σ) (5.3) whereN (s i ,σ) = 1 √ 2πσ exp{−0.5· (s−s i ) 2 /σ 2 } is a Gaussian kernel with stan- dard deviation σ, andZ = P s i ∈S ω i is a normalization factor making f σ (s) being a probability density. The Gaussian kernel density of split candidates is shown in row c of Fig.5.3. Now we sample a large amount of points from the distribution f σ (s) and apply mean-shift algorithm with an uniform bandwidth ζ to cluster the 114 sampled points, resulting inK clusters. Ideally, we can use allK cluster-centers as transition points. But in practice, we prune weaker clusters, which are less likely to contain transition points, and use centers of the retained strong clusters as transition points. Prune weaker clusters: LetC i ,M i andI i be thei th (i∈{1, 2,...,K}) cluster, its center and temporal span. Temporal span of a cluster is defined by the temporal span of its membership points. By the nature of mean-shift clustering, distinct clusters have non-overlapping temporal spans, i.e., I i ∩I j =∅. Now define the energy of a cluster C i to be the sum of weights of split candidates falling within its temporal spanI i , i.e., E i = X s k ∈S∩s k ∈I i ω s k (5.4) Intuitively, the energy of a cluster indicates how likely it contains a tran- sition point. The total energy of K clusters is P K i=1 E i . Given an energy cut-off threshold τ (0 < τ ≤ 1), we only retain the first k ∗ strong clus- ters whose sum energy takes up to τ percent of the total energy, i.e., k ∗ = arg min k n P k−1 i=1 E i <τ P K i=1 E i ∩ P k i=1 E i ≥τ P K i=1 E i o , whereK clusters are sorted descendingly according to their energies. Weaker clusters less likely to contain transition points are pruned after thresholding. Let the retained clusters be C ={C 1 ,C 2 ,...,C k ∗}, and return their centers as transition points T of the multi- variate time series MT, i.e., T ={M 1 ,M 2 ,...,M k ∗}. Coarse search procedure is illustrated in Fig.5.3. Fine Search: in most cases, mean-shift cluster centers overlap well with the ground truth, and using their centers as transition points is adequate. However, in rare scenarios, mean-shift cluster centers may drift from the ground truth (e.g., the 3rd cluster in Fig.5.4). Here we introduce a dynamic programming algorithm to 115 a b c d e f Figure 5.3: Coarse search procedure: (a) an action video sequence. (b) split candidates: for visualization we add a random y-value to each point. Darker points have higher weight. Split candidates tend to cluster into groups, showing decompositionconsistencyamongdifferentdimensions. (c)Gaussiankerneldensity of split points. (d) mean-shift clusters (red vertical line indicates cluster center) by clustering points sampled according to the density distribution. (e) prune the weaker clusters (e.g., the 2nd one), and return centers of the retained clusters as transition points. (f) ground truth transition points. fine search transition points. As a first step, we retain the topk ∗ strong mean-shift clusters by thresholding, and then do fine searching from these k ∗ clusters. Transition Candidates: for each retained clusterC i , we first estimate its Gaus- sian kernel densityf σ (C i ) using split candidates falling inside its temporal span by following Eq.5.3, then approximate f σ (C i ) by a Gaussian distributionN (μ i , σ i ), and at last treat all points within the span [μ i −2·σ i ,μ i +2·σ i ] as transitions can- didatesP i . By repeating this process for allk ∗ clusters, we getk ∗ sets of transition candidates P ={P 1 ,P 2 ,...,P k ∗}, with different sets disjoint from each other, i.e., P 1 ∩P 2 =∅. Row d in Fig.5.4 shows transition candidates in each cluster. dp Fine Search: we augment P by adding the starting and ending points (1 and L) of MT to the transition candidates, i.e., Let P 0 ={1} and P k ∗ +1 ={L}, then P←{P 0 ,P,P k ∗ +1 }. Now we have k ∗ + 2 disjoint transition candidate sets, and the objective of fine search is: find one transition point from each candidate set, resulting ink ∗ + 2 transition points, which partitionMT intok ∗ + 1 segments, such that the sum of homogeneities of k ∗ + 1 segments is maximized. Note 1 and 116 a b c d e f Figure 5.4: Fine search procedure: (a): a CMU motion sequence. (b) and (c): split candidates and their Gaussian kernel density. (d) transition candidates from each cluster, and (e) predicted transition points by running dynamic programming algorithm. (f): ground truth transitions. L are always returned as transition points since they are the only point inP 0 and P k ∗ +1 . Reorderk ∗ +2 candidate sets by their temporal order, and let the reordered candidate sets be ˜ P ={ ˜ P 0 , ˜ P 1 , ˜ P 2 ,..., ˜ P k ∗, ˜ P k ∗ +1 }, then fine search is formulated as: arg max p 0 ∈ ˜ P 0 ,p 1 ∈ ˜ P 1 ,...,p k ∗ +1 ∈ ˜ P k ∗ +1 k ∗ X t=0 H(p t , p t+1 ) (5.5) WhereH(p t , p t+1 ) is the homogeneity of segment [p t ,p t+1 ]. Define the homo- geneityofasegmentonanunivariatetimeseriestobetheentropyofitscorrespond- ing string, weighted by the string length. Then the homogeneity of a segmentS ij = [p i ,p j ] on a m-dimensional multivariate time series is the sum of homogeneities in m dimensions:H(S ij ) =H(p i , p j ) =|C ij |·(−E(C ij )) = (p j −p i )· P m d=1 −E(C d ij ) , whereC d ij is the corresponding string in thed th dimension andE(C d ij ) is its entropy. A brute force search to solve Prob. 5.5 becomes infeasible whenk ∗ is too large, and weintroduceadpalgorithmtoexhaustivelysearchallpossiblek ∗ +1-segmentspar- titions in polynomial time. LetH(p i ) =H(p 1 , p i ) be the accumulated homogeneity from the beginningp 1 ofMT top i , then the forward dp process is progressed with computing: 117 Algorithm 4 Multivariate time series decomposition Inputs:MT ={T 1 ;T 2 ;...;T m }, energy cutoff threshold τ Decompose each dimension separately : Initialize:S =∅,W =∅. for i=1:m do decompose T i by running Alg.3, S←S.add(S i ),W←W.add(W i ) end for Coarse search transition points : Estimate kernel density ofS by Eq.5.3: f σ (S); run mean-shift clustering on f σ (S):K clustersC i ; retain strong clusters : C ={C 1 ,C 2 ,...,C k ∗}; return centers of C as transition points. Fine search transition points : Retained clusters: C ={C 1 ,C 2 ,...,C k ∗}; generate transition candidates: P ={P 1 ,P 2 ,...,P k ∗}; augment & reorder P: ˜ P ={ ˜ P 0 , ˜ P 1 , ˜ P 2 ,..., ˜ P k ∗, ˜ P k ∗ +1 }; build accumulated homogeneities using Eq.5.6: H(p j ); backtrack to get transition points. H(p j ) = max{H(p i , p j ) +H(p i )} (5.6) To enforce the constraint that only one point can be and should be chosen from each each candidate setP i ,H(p j , p i ) is set to be∞ ifp i andp j come from the same or two non-consecutive candidate sets, i.e.,H(p i , p j ) = (p j −p i )· P m d=1 E(C d ij ), if ∃t, p i ∈ ˜ P t ∧p j ∈ ˜ P t+1 ; otherwiseH(p i , p j ) =∞. The forward process builds accumulated homogeneities ending at each point, and H(p e ) (p e = L) is the homogeneity of the multivariate time series MT. By running backtrack, we obtain an optimal sequence of transition points {1, p 1 , p 2 ,...,p k ∗, L}, which partitions MT into k ∗ + 1 segments and maximizes their homogeneities. The fine search procedure is shown in Fig.5.4. Multivariate time series decomposition algorithm is given in Alg.4. 118 5.5 Experiments We validate our time series decomposition methods in three experimental set- tings: (1)synthetictimeseries, (2)CMUmotionsequencesand(3)videosequences of human actions [13]. We compare with two state-of-the-art approaches: HACA [151] and PPCA [7]. With noisy data, split points chosen by an algorithm may not coincide with the ground truth exactly. To count the number of retrieved split points in a sensible way, we assign a numeric valueℵ s (ℵ s ∈ [0, 1] ) to each split point s: ℵ s = exp{−0.5· (s−g s ) 2 σ 2 } (5.7) whereg s is the ground truth point closest tos andσ is a decay factor.ℵ s measures the closeness ofs to the ground truth, and gives a soft count of s. In this way the number of total retrieved points isℵ = P s i ℵ s i . In all experiments, we compute the soft-count of each point by Formula 5.7. Synthetic time series: We simulate a univariate time series by concatenating ndifferentperiodicalcurvesegments,andrunAlg.3todecomposeituntilleafnodes have unit length. We rank the returned split points according to their weights in a descending way, retain the topn−1 ones, compare them againstn−1 ground truth points, and compute a recall score. We do two kinds of simulations: in setting one elementary periodical segments are sine or cosine, while in setting two they are superposition of several sine and cosine curves with different frequencies. Setting One: Stimuli are concatenations of sine and cosine segments with dif- ferent frequencies or magnitudes, and i.i.d. Gaussian noise is added (see Supple- mentary material). We simulate 50 time series instances with length 2713~4483. When applying Alg.3, we have only one free parameter - K in kmeans when doing symbolization. In experiments, instead of setting K intelligently, we explore how 119 K=10 K=50 K=100 K=500 K=1000 K=2000 0 200 400 600 800 1000 1200 1400 0.4 0.5 0.6 0.7 0.8 0.9 1.0 # of clusters recall Recall under different numbers of clusters setting one setting two Figure 5.5: K-insensitivity. Left: mean recall under different numbers of clusters: in both simulation settings, our algorithm achieves high recall under wide range of K values, showing our algorithm is K-insensitive. Middle: (top) a real-case y- acceleration time series with length 3000, and we can observe a salient transition point near the middle. (bottom) Information gain curves under different K values. The peak of each curve (exclude K=2000) consistently finds that salient transition point. Right: a decomposition example in setting two. sensitive our decomposition algorithm is to the choice of K. We let K vary from 10 to 1350 (at stride 20), compute recall, and plot the results in Fig.5.5. The blue curve shows mean recall of 50 stimuli, and we obtain high recall values (≥ 80%) when K is within [10, 190], showing our algorithm is insensitive to the choice of K. Setting Two: Employs a similar approach but uses superposition of several sine and cosine curves in each segment (see Supplementary material). Concatenating n = 5 segments yields a time series instance with 4 ground truth split points. In experiments, weagainsimulate50timeseriesinstanceswithlength2456~7106, and run the decomposition algorithm under different K values. The recall is plotted in Fig.5.5 (green curve). Similarly, our decomposition algorithm is highly insensitive to different K choices, and maintain a high recall score within a wide range of K. The middle plot in Fig.5.5 gives an example of K-insensitivity: under different K values, peaks of information gain curves almost consistently indicate the ground truth transition point. 120 CMU motion capture data: The CMU human motion data were captured at 120Hz frequency. The motion data consist of absolute root positions and orien- tations and relative angles of 29 joints at each time stamp. To be consistent with settings in [7, 151], we use relative angles of 14 joints, and transform them into 4D quaternions. Therefore motion at each time stamp is represented by a 56D vector. A motion sequenceM is represented by a sequence of motion vectors changing with time, i.e.,M ={m 1 ,m 2 ,...,m t }(m i ∈R 56 ). If each dimension of the 56D motion vector is treated as an attribute, thenM can be reformulated as a 56D multivariate time seriesM = {T 1 ,T 2 ,...,T 56 } with each univariate time series T i recording the i th attribute changing with time. Then we can run Alg.4 to decomposeM and search its transition points. We use 14 motion sequences performed by subject 86. Each motion sequence containsseveraldifferentactivities(walking, running, punching, stretching, sitting, dragging ...). Parameter setting for our algorithm: (1) K = 40 in kmeans clustering during symbolization: since we showed decomposition results are K-insensitive, we fix K to be 40 for all 14 sequences; (2)` = 200 in Ago.3: this is the minimal length of a pure motion; (3) energy cutoff threshold τ = 0.85: use τ = 0.85 to prune weaker clusters. For HACA [151] and PPCA [7], we use the parameters given by the authors in their paper (see Supplementary material). Precision and recall of three different algorithms are documented in Table.5.1, and our algorithm obtains higher precision and recall on most sequences. By running wilcoxon signed-rank test, we get p-values 0.041/0.003 (HACA/PPCA) on recall and 0.037/0.0002 (HACA/PPCA) on precision, which shows our algorithm is significantly better at the confidence level 5%. Both HACA and PPCA have many hyper-parameters, which requires dataset-specific knowledge to set properly. 121 Precision 1 2 3 4 5 6 7 8 9 10 11 12 13 14 HACA [151] 0.81 0.52 1.00 0.53 0.87 0.82 0.97 0.72 0.75 0.57 0.73 0.55 0.71 0.66 PPCA [7] 0.46 0.47 0.51 0.50 0.76 0.54 0.34 0.55 0.79 0.49 0.80 0.44 0.54 0.43 Ours 0.83 0.98 0.67 0.71 0.99 0.78 0.98 0.87 0.74 0.88 0.99 0.82 0.73 0.83 Recall HACA 0.81 0.58 1.00 0.83 0.78 0.82 0.97 0.80 0.75 0.57 0.73 0.55 0.62 0.55 PPCA 0.46 0.68 0.51 0.56 0.76 0.64 0.38 0.61 0.98 0.79 0.80 0.69 0.61 0.43 Ours 0.96 0.98 0.92 0.71 0.99 0.71 0.98 0.87 0.74 0.88 0.99 0.82 0.64 0.83 Table 5.1: Precision and Recall on 14 CMU motion sequences. Our algorithm outperforms the other two significantly. Weizmannvideodata: Weshowourdecompositionalgorithmcanbeapplied to segment a video sequence containing different actions as well, and test on the Weizmann action database [13]. TheWeizmanndatabasecontains90videosof10subjects, witheachperforming 9 actions. Each action video has a length of 25~130 frames. Since the foreground mask is given per frame, we extract the minimal bounding box containing the actor, use it to crop the foreground mask and re-normalize the cropped binary mask image to size 64× 32. Then we compute the Euclidean distance transform [90, 150] of the normalized mask and use it as the feature for the raw frame. The Euclidean distance transform of a sample frame is shown Fig.5.6. Since each clip contains one pure action, we synthesize test videos by concate- nating these pure clips. We generate 30 test video sequences, with each containing 30randomclips, undertheconstraintthattwoconsecutiveclipsmusthavedifferent actions. Since each frame is represented by a Euclidean distance transform feature vectorF (F∈R 2048 ), the testing videoM of lengtht is represented by a sequence of distance transform feature vectors, i.e.,M ={F 1 ,F 2 ,...,F t }. We further use PCA to reduce feature dimensions to 50, i.e.,M ={ ˆ F 1 , ˆ F 2 ,..., ˆ F t }, ˆ F i ∈R 50 . Similarly we reformulateM as a 50D multivariate time series and decompose it by Alg.4. 122 a b c d Figure 5.6: Video feature: (a) a sample frame. (b) the foreground mask. (c) the normalized binary mask. (d) the Euclidean distance transform image. Parameter settings in our algorithm: (1) K = 40: fix K since its wide applicable ranges; (2) ` = 25 in Ago.3: this is the minimal length of an action video; (3) τ = 0.95: prune very few weaker clusters. We tune parameters for both HACA and PPCA, and use the optimal parameters to report precision and recall (see sup- plementary material for parameter tuning). Precision and recall on 30 synthesized sequences are shown in Fig.5.7, and our algorithm almost consistently outperforms others on both metrics. Wilcoxon signed-rank test gives p-value 1.7·10 −6 /2.0·10 −6 (HACA/PPCA) on recall and 2.4· 10 −6 /1.7· 10 −6 (HACA/PPCA) on precision, showing significant better performance of our algorithm. Some temporal segmen- tation results on Weizmann and CMU datasets are shown in Fig.5.8, and see supplementary material for full results. 5.6 Conclusion In this paper, we formulate temporal segmentation as a recursive decomposi- tion process, which results in a directed binary tree, with each leaf node being homogeneous. Our method is domain-knowledge free, insensitive to design param- eters and has low time complexity, and it is applicable to both univariate and multivariate time series data. Extensive results show the efficacy of our methods: it obtains both high precision and recall scores. 123 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0.6 0.7 0.8 0.9 1.0 simulations precision Precision on Weizmann synthetic sequences HACA PPCA ours 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 0.5 0.6 0.7 0.8 0.9 1.0 simulations recall Recall on Weizmann synthetic sequences Figure 5.7: Recall and Precision on Weizmann datasets: our algorithm performs significantly better than HACA and PPCA. 348.1 362.3 145.1 209.4 86.3 GT seq-10 t-SNE visualization of HOG-1D ours HACA PPCA GT seq-2 ours HACA PPCA GT ours HACA PPCA GT ours HACA PPCA Figure 5.8: Temporal segmentation comparison. GT is the ground truth tran- sitions, and red vertical bars indicate predicted transitions. Left: segmentation results of 2 CMU motion sequences (top) and 2 Weizmann video sequences (bot- tom). The full results on 14 CMU motion sequences and 30 Weizmann videos are in the supplementary material. Right: (top) t-SNE [123] visualization of HOG-1D descriptors, we generate this plot by placing the original subsequences on their t- SNEreduced2Dcoordinates. Similarly-shapedsubsequencesareproximatetoeach other, showing the shape descriptiveness of HOG-1D. (bottom) a typical motion sequence decomposition example. 124 Chapter 6 Object Recognition: multi-task learning 125 6.1 Abstract Despite significant recent progress, the best available computer vision algo- rithms still lag far behind human capabilities, even for recognizing individual dis- crete objects under various poses, illuminations, and backgrounds. Here we present a new approach to using object pose information to improve deep network learning. While existing large-scale datasets, e.g. ImageNet, do not have pose information, we leverage the newly published turntable dataset, iLab-20M, which has∼22M images of 704 object instances shot under different lightings, camera viewpoints and turntable rotations, to do more controlled object recognition experiments. We introduce a new convolutional neural network architecture, what/where CNN (2W-CNN), built on a linear-chain feedforward CNN (e.g., AlexNet), augmented by hierarchical layers regularized by object poses. Pose information is only used as feedback signal during training, in addition to category information, but is not needed during test. To validate the approach, we train both 2W-CNN and AlexNet using a fraction of the dataset, and 2W-CNN achieves 6% performance improve- ment in category prediction. We show mathematically that 2W-CNN has inherent advantagesoverAlexNetunderthestochasticgradientdescent(SGD)optimization procedure. Furthermore, we fine-tune object recognition on ImageNet by using the pretrained 2W-CNN and AlexNet features on iLab-20M, results show significant improvement compared with training AlexNet from scratch. Moreover, fine-tuning 2W-CNN features performs even better than fine-tuning the pretrained AlexNet features. These results show that pretrained features on iLab-20M generalize well to natural image datasets, and 2W-CNN learns better features for object recogni- tion than AlexNet. 126 input conv1 conv2 conv3 conv4 conv5 fc layers fc6 fc7 object pose object pose (a) 2W-CNN-I (b) 2W-CNN-MI Figure 6.1: 2W-CNN architecture. The orange architecture is AlexNet, and we build two what/where convolutional neural network architectures from it: (a) 2W- CNN-I: object pose information (where) is linked to the top fully connected layer (fc7) only; (b) 2W-CNN-MI: object pose labels have direct pathways to all con- volutional layers. The additionally appended pose architectures (green in (a) and blue in (b)) are used in training to regularize the deep feature learning process, and in testing, we prune them and use the remaining AlexNet for object recognition (what). Hence, although feedforward connection arrows are shown, all blue and green connections are only used for backpropagation. 6.2 Introduction Deep convolutional neural networks (CNNs) have achieved great success in image classification [74, 121], object detection [111, 46], image segmentation [22], activity recognition [68, 114] and many others. Typical CNN architectures, includ- ing AlexNet [74] and VGG [115], consist of several stages of convolution, activation and pooling, in which pooling subsamples feature maps, making representations locally translation invariant. After several stages of pooling, the high-level feature representations are invariant to object pose over some limited range, which is gen- erally a desirable property. Thus, these CNNs only preserve “what” information but discard “where” or pose information through the multiple stages of pooling. However, as argued by Hinton et al. [61], artificial neural networks could use local “capsules” to encapsulate both “what” and “where” information, instead of using a single scalar to summarize the activity of a neuron. Neural architectures designed 127 in this way have the potential to disentangle visual entities from their instantiation parameters [103, 148, 49]. In this paper, we propose a new deep architecture built on a traditional Con- vNet (AlexNet, VGG), but with two label layers, one for category (what) and one for pose (where; Fig. 6.1). We name this a what/where convolutional neural network (2W-CNN). Here, object category is the class that an object belongs to, and pose denotes any factors causing objects from the same class to have different appearances on the images. This includes camera viewpoint, lighting, intra-class object shape variances, etc. By explicitly adding pose labels to the top of the net- work, 2W-CNN is forced to learn multi-level feature representations from which both object categories and pose parameters can be decoded. 2W-CNN only differs from traditional CNNs during training: two streams of error are backpropagated into the convolutional layers, one from category and the other from pose, and they jointly tune the feature filters to simultaneously capture variability in both category and pose. When training is complete, we prune all the auxiliary layers in 2W-CNN, leaving only the base architecture (traditional ConvNet, with same number of degrees of freedom as the original), and we use it to predict the category label of a new input image. By explicitly incorporating “where” information to regularize the feature learning process, we experimentally show that the learned feature representations are better delineated, resulting in better categorization accuracy. 128 6.3 Related work This work is inspired by the concept revived by Hinton et al. [61]. They intro- duced “capsules” to encapsulate both “what” and “where” into a highly informa- tive vector, and then feed both to the next layer. In their work, they directly fed translation/transformation information between input and output images as known variables into the auto-encoders, and this essentially fixes “where” and forces “what” to adapt to the fixed “where”. In contrast, in our 2W-CNN, “where” is an output variable, which is only used to back-propagate errors. It is never fed forward into other layers as known variable. In [148], the authors proposed ’stacked what-where auto-encoders’ (SWWAE), which consists of a feed-forward ConvNet (encoder), coupled with a feed-back DeConvnet (decoder). Each pooling layer from the ConvNet generates two sets of variables, “what” which records the features in the receptive field and is fed into the next layer, and “where” which remembers the position of the interesting features and is fed into the correspond- ing layer of the DeConvnet. Although they explicitly build “where” variables into the architecture, “where” variables are always complementary to the “what” vari- ables and only help to record the max-pooling switch positions. In this sense, “where” is not directly involved in the learning process. In 2W-CNN, we do not have explicit “what” and “where” variables, instead, they are implicitly expressed by neurons in the intermediate layers. Moreover, “what” and “where” variables from the top output layer are jointly engaged to tune filters during learning. A recent work [49] proposes a deep generative architecture to predict video frames. The representation learned by this architecture has two components: a locally sta- ble “what” component and a locally linear “where” component. Similar to [148], “what” and “where” variables are explicitly defined as the output of ’max-pooling’ and ’argmax-pooling’ operators, as opposed to our implicit 2W-CNN approach. 129 In [135], the authors propose to learn image descriptors to simultaneously rec- ognize objects and estimate their poses. They train a linear chain feed-forward deep convolutional network by including relative pose and object category simi- larity and dissimilarity in their cost function, and then use the top layer output as image descriptor. However, [135] focus on learning image descriptors, then rec- ognizing category and pose through a nearest neighbor search in descriptor space, while we investigate how explicit, absolute pose information can improve category learning. [6] introduces a method to separate manifolds from different categories while being able to predict object pose. It uses HOG features as image represen- tations, which is known to be suboptimal compared to statistically learned deep features, while we learn deep features with the aid of pose information. Figure 6.2: Left: turntable setup; Right: one exemplar car shot under different viewpoints. In sum, our architecture differs from the above in several aspects: (1) 2W-CNN is a feed-forward discriminative architecture as opposed to an auto-encoder; (2) we do not explicitly define “what” and “where” neurons, instead, they are implicitly expressed by intermediate neurons; (3) we use explicit, absolute pose information, only during back-propagation, and not in the feed-forward pass. Our architecture, 2W-CNN, also belongs to the framework of multi-task learn- ing (MTL), where the basic notion is that using a single network to learn two or more related tasks yields better performance than using one dedicated network for each task [17, 9, 95]. Recently, several efforts have explored multi-task learning 130 using deep neural networks, for face detection, phoneme recognition, scene clas- sification and pose estimation [144, 142, 110, 65, 120]. All of them use a similar linear feed-forward architecture, with all task label layers appended onto the last fully connected layer. In the end, all tasks in these applications share the same representations. Although they [110, 144] do differentiate principal and auxiliary tasks by assigning larger/smaller weights to principal/auxiliary task losses in the objective function, they never make a distinction of tasks when designing the deep architecture. Our architecture, 2W-CNN-I (see 6.5.1 for definition), is similar to theirs, however, 2W-CNN-MI (see 6.5.1 for definition) is very different: pose is the auxiliary task, and it is designed to support the learning of the principal task (object recognition) at multiple levels. Concretely, auxiliary labels (pose) have direct pathways to all convolutional layers, such that features in the intermediate layers can be directly regularized by the auxiliary task. We experimentally show that 2W-CNN-MI, which embodies a new kind of multi-task learning, is supe- rior to 2W-CNN-I for object recognition, and this indicates that 2W-CNN-MI is advantageous to the previously published deep multi-task learning architectures. 6.4 A brief introduction of the iLab-20M dataset iLab-20M[14]wascollectedbyhypothesizingthattrainingcanbegreatlyimproved by using many different views of different instances of objects in a number of cat- egories, shot in many different environments, and with pose information explic- itly known. Indeed, biological systems can rely on object persistence and active vision to obtain many different views of a new physical object. In monkeys, this is believed to be exploited by the neural representation [82], though the exact mechanism remains poorly understood. 131 iLab-20M is a turntable dataset, with settings as follows: the turntable con- sists of a 14”-diameter circular plate actuated by a robotic servo mechanism. A CNC-machined semi-circular arch (radius 8.5”) holds 11 Logitech C910 USB web- cams which capture color images of the objects placed on the turntable. A micro- controller system actuates the rotation servo mechanism and switches on and off 4 LED lightbulbs. Lights are controlled independently, in 5 conditions: all lights on, or one of the 4 lights on. Objects were mainly Micro Machines toys (Galoob Corp.) and N-scale model train toys. These objects present the advantage of small scale, yet demonstrate a high level of detail and, most remarkably, a wide range of shapes (i.e., many different molds were used to create the objects, as opposed to just a few molds and manydifferentpaintingschemes). Backgroundswere125colorprintoutsofsatellite imagery from the Internet. Every object was shot on at least 14 backgrounds, in a relevant context (e.g., cars on roads, trains on railtracks, boats on water). In total, 1,320 images were captured for each object and background combi- nation: 11 azimuth angles (from the 11 cameras), 8 turntable rotation angles, 5 lighting conditions, and 3 focus values (-3, 0, and +3 from the default focus value of each camera). Each image was saved with lossless PNG compression (∼1 MB per image). The complete dataset hence consists of 704 object instances (15 categories), each shot on 14 or more backgrounds, with 1,320 images per object/background combination, or almost 22M images. The dataset is freely available and distributed on several hard drives. One exemplar car instance shot under different viewpoints are shown in Fig. 6.2(right). 132 6.5 Network Architecture and its Optimization In this section, we introduce our new architecture, 2W-CNN, and some proper- ties of its critical points achieved under the stochastic gradient descent (SGD) optimization. 6.5.1 Architecture Our architecture, 2W-CNN, can be built on any of the CNNs, and, here, with- out loss of generality, we use AlexNet [74] (but see Supplementary Materials for results using VGG as well). iLab-20M has detailed pose information for each image. In the testing presented here, we only consider 10 categories, and the 8 turntable rotations and 11 camera azimuth angles, i.e., 88 discrete poses. It would be straightforward to use more categories and take light source, focus, etc. into account as well. Our building base, AlexNet, is adapted here to be suitable for our dataset: two changes are made, compared to AlexNet in [74] (1) we change the number of units on fc6 and fc7 from 4096 to 1024, since we only have ten categories here; (2) we append a batch normalization layer after each convolution layer (see the supplementary materials for architecture specifications). We design two variants of our approach: (1) 2W-CNN-I (with I for injection), a what/where CNN with both pose and category information injected into the top fully connected layer; (2) 2W-CNN-MI (multi-layer injection), a what/where CNN with category still injected at the top, but pose injected into the top and also directly into all 5 convolutional layers. Our motivation for multiple injection is as follows: it is generally believed that in CNNs, low- and mid-level features are 133 learned mostly in lower layers, while, with increasing depth, more abstract high- level features are learned [141, 149]. Thus, we reasoned that detailed pose infor- mation might also be used differently by different layers. “Multi-layer injection” in 2W-CNN-MI is similar to skip connections in neural networks. Skip connection is a more generic terminology, while 2W-CNN-MI uses a specific pattern of skip connections designed specifically to make pose errors directly back propagate into lower layers. Our architecture details are as follows. 2W-CNN-I is built on AlexNet, and we further append a pose layer (88 neurons) to fc7. The architecture is shown in Fig. 6.1. 2W-CNN-I is trained to predict both what and where. We treat both prediction tasks as classification problems, and use softmax regression to define the individual loss. The total loss is the weighted sum of individual losses: L =L(object) +λL(pose) (6.1) where λ is a balancing factor, set to 1 in experiments. Although we do not have explicit what and where neurons in 2W-CNN-I, feature representations (neuron responses) at fc7 are trained such that both object category (what) and pose information (where) can be decoded; therefore, neurons in fc7 can be seen to have implicitly encoded what/where information. Similarly, neurons in intermediate layers also implicitly encapsulate what and where, since both pose and category errors at the top are back-propagated consecutively into all layers, and features learned in the low layers are adapted to both what and where. 2W-CNN-MI is built on AlexNet as well, but in this variant we add direct pathwaysfromeachconvolutionallayertotheposelayer, suchthatfeaturelearning at each convolutional layer is directly affected by pose errors (Fig. 6.1). Concretely, we append two fully connected layers to each convolutional layer, including pool1, 134 pool2, conv3 and conv4, and then add a path from the 2nd fully connected layer to the pose category layer. Fully connected layers appended to pool1 and pool2 have 512 neurons and those appended to conv3 and conv4 have 1024 neurons. At last, we directly add a pathway from fc7 to the pose label layer. The reason we do not append two additional fully connected layers to pool5 is that the original AlexNet already has fc6 and fc7 on top of pool5; thus, our fc6 and fc7 are shared by the object category layer and the pose layer. The loss function of 2W-CNN-MI is the same as that of of 2W-CNN-I (Eq. 6.1). In 2W-CNN-MI, activations from 5 layers, namely pool1fc2, pool2fc2, conv3fc2, conv4fc2 and fc7, are all fed into the pose label layer, and responses at the pose layer are the accumulated activations from all 5 pathways, i.e., a(L P ) = X l a(l)·W l−L P (6.2) wherel is one from those 5 layers,W l−L P is the weight matrix betweenl and pose label layerL P , and a(l) are feature activations at layer l. 6.5.2 Optimization We use stochastic gradient descent (SGD) to minimize the loss function. In practice, it either finds a local optimum or a saddle point [28, 96] for non-convex optimization problems. How to escape a saddle point or reach a better local optimum is beyond the scope of our work here. Since both 2W-CNN and AlexNet are optimized using SGD, readers may worry that object recognition performance differences between 2W-CNN and AlexNet might be just occasional and depending on initializations, while here we show theoretically that it is easier to find a better critical point in the parameter space of 2W-CNN than in the parameter space of AlexNet by using SGD. 135 object pose X1 X0 X2 X3 ω2 ω1 ω0 ω3 input Figure 6.3: A simplified CNN used in proof. We prove that, in practice, a critical point of AlexNet is not a critical point of 2W-CNN, while a critical point of 2W-CNN is a critical point of AlexNet as well. Thus, if we initialize weights in a 2W-CNN from a trained AlexNet (i.e., we initializeω 1 ,ω 2 from the trained AlexNet, while initializingω 3 by random Gaussian matrices in Fig. 6.3), and continue training 2W-CNN by SGD, parameter solutions will gradually step away from the initial point and reach a new (better) critical point. However, if we initialize parameters in AlexNet from a trained 2W-CNN and continue training, the parameter gradients in AlexNet at the initial point are already near zero and no better critical point is found. Indeed, in the next section we verify this experimentally. LetL =f(ω 1 ,ω 2 ) and ˆ L =f(ω 1 ,ω 2 ) +g(ω 1 ,ω 3 ) be the softmax loss functions of AlexNet and 2W-CNN respectively, where f(ω 1 ,ω 2 ) inL and ˆ L are exactly the same, we show in practical cases that: (1) if (ω 0 1 ,ω 0 2 ) is a critical point ofL, then (ω 0 1 ,ω 0 2 ,ω 3 ) is not a critical point of ˆ L; (2) on the contrary, if (ω 00 1 ,ω 00 2 ,ω 00 3 ) is a critical point of ˆ L, (ω 00 1 ,ω 00 2 ) is a critical point ofL as well. Here we prove (1) but refer the readers to the supplementary materials for the proof of (2). ∂L ∂ω 0 1 = ∂f ∂ω 0 1 = − → 0 ∂L ∂ω 0 2 = ∂f ∂ω 0 2 = − → 0 ∂ ˆ L ∂ω 0 1 = ∂f ∂ω 0 1 + ∂g ∂ω 0 1 6= − → 0 ∂ ˆ L ∂ω 0 2 = ∂f ∂ω 0 2 = − → 0 ∂ ˆ L ∂ω 3 = ∂g ∂ω 3 6= − → 0 (6.3) Proof: assume there is at least one non-zero entry in x 1 ∈R 1024 (in practice x 1 6= − → 0 , x 0 6= − → 0 , at least one non-zero entry, see supplementary materials), and 136 let x nz 1 be that non-zero element (nz is its index, nz∈{1, 2,..., 1024}). Without loss of generality, we initialize all entries inω 3 ∈R 88×1024 be to 0, except one entry ω nz,1 3 ,whichistheweightbetweenx nz 1 andx 1 3 . Sincewehavex 3 =ω 3 x 1 , (x 3 =R 88 ), then x 1 3 =ω nz,1 3 ·x nz 1 6= 0, while x i 3 = 0 (i6= 1). In the case of softmax regression, we have: ∂ ˆ L ∂x c 3 = 1− e x c 3 / P i e x i 3 ; ∂ ˆ L ∂x i 3 = −e x i 3 / P i e x i 3 when i 6= c, where x i 3 is the i th entry in the vector x 3 , and c is the index of the ground truth pose label. Since x 1 3 6= 0 and x i 3 = 0(i 6= 1), ∂ ˆ L ∂x i 3 6= 0 (i∈{1, 2,..., 88}). By chain rule, we have ∂ ˆ L ∂ω mn 3 = ∂ ˆ L ∂x n 3 · ∂x n 3 ∂ω mn 3 = ∂ ˆ L ∂x n 3 ·x m 1 , and therefore, as long as not all entries in x 1 are 0, i.e., x 1 6= − → 0 , we have ∂ ˆ L ∂ω 3 = ∂ ˆ L ∂x 3 ·x 1 6= − → 0 . To show ∂ ˆ L ∂ω 0 1 = ∂f ∂ω 0 1 + ∂g ∂ω 0 1 6= − → 0 , we only have to show ∂g ∂ω 0 1 6= − → 0 (since ∂f ∂ω 0 1 = − → 0 ). By chain rule, we have ∂g ∂ω 0 1 = ∂g ∂x 1 · ∂x 1 ∂ω 0 1 , where ∂g ∂x i 1 = 0, when i6= nz; ∂g ∂x nz 1 = ω nz,1 3 −ω nz,1 3 ·e x 1 3 / P i e x i 3 6= 0. Let the weight matrix between x 0 and x 1 be ω 0 0 , and by definition, ω 0 0 ∈ω 0 1 . Now we have ∂x nz 1 ∂ω 0 0 6= − → 0 , otherwise x 0 = − → 0 . Therefore ∂g ∂ω 0 1 = ∂g ∂x 1 · ∂x 1 ∂ω 0 1 6= − → 0 , since ∂g ∂x 1 6= − → 0 and ∂x 1 ∂ω 0 1 6= − → 0 . However, a critical point of 2W-CNN is a critical point of AlexNet, i.e., Eq. 6.4, see supplementary materials for the proof, and next section for experimental vali- dation. ∂ ˆ L ∂ω 00 1 = ∂f ∂ω 00 1 + ∂g ∂ω 00 1 = − → 0 ∂ ˆ L ∂ω 00 2 = ∂f ∂ω 00 2 = − → 0 ∂ ˆ L ∂ω 00 3 = ∂g ∂ω 00 3 = − → 0 ∂L ∂ω 00 1 = ∂f ∂ω 00 1 = − → 0 ∂L ∂ω 00 2 = ∂f ∂ω 00 2 = − → 0 (6.4) 6.6 Experiments In experiments, we demonstrate the effectiveness of 2W-CNN for object recog- nition against linear-chain deep architectures (e.g., AlexNet) using the iLab-20M 137 dataset. We do both quantitative comparisons and qualitative evaluations. Fur- ther more, to show the learned features on iLab-20M are useful for generic object recognitions, we adop the “pretrain - fine-tuning” paradigm, and fine tune object recognition on the ImageNet dataset [31] using the pretrained AlexNet and 2W- CNN-MI features on the iLab-20M dataset. 6.6.1 Dataset setup Object categories: we use 10 (out of 15) categories of objects in our experiments (Fig. 6.6), and, within each category, we randomly use 3/4 instances as training data, andtheremaining1/4instancesfortesting. Underthispartition, instancesin test are never seen during training, which minimizes the overlap between training and testing. Pose: here we take images shot under one fixed light source (with all 4 lights on) and camera focus (focus = 1), but all 11 camera azimuths and all 8 turntable rotations (88 poses). Weendupwith0.65M(654,929)imagesinthetrainingsetand0.22M(217,877) in the test set. Each image is associated with 1 (out of 10) object category label and 1 (out of 88) pose label. 6.6.2 CNNs setup We train 3 CNNs, AlexNet, 2W-CNN-I and 2W-CNN-MI, and compare their performances on object recognition. We use the same initialization for their com- mon parameters: we first initialize AlexNet with random Gaussian weights, and re-use these weights to initialize the AlexNet component in 2W-CNN-I and 2W- CNN-MI. We then randomly initialize the additional parameters in 2W-CNN-I / 2W-CNN-MI. 138 No data augmentation: to train AlexNet for object recognition, in practice, one often takes random crops and also horizontally flips each image to augment the training set. However, to train 2W-CNN-I and 2W-CNN-MI, we could take random crops but we should not horizontally flip images, since flipping creates a new unknown pose. For a fair comparison, we do not augment the training set, such that all 3 CNNs use the same amounts of images for training. Optimization settings: we run SGD to minimize the loss function, but use different starting learning and dropout rates for different CNNs. AlexNet and 2W- CNN-I have similar amounts of parameters, while 2W-CNN-MI has 15 times more parameters during training (but remember that all three models have the exact same number of parameters during test). To control overfitting, we use a smaller starting learning rate (0.001) and a higher dropout rate (0.7) for 2W-CNN-MI, while for AlexNet and 2W-CNN-I, we set the starting learning rate and dropout rate to be 0.01 and 0.5. Each network is trained for 30 epochs, and approximately 150,000 iterations. To further avoid any training setup differences, within each training epoch, we fix the image order. We train CNNs using the publicly available Matconvnet [124] toolkit on a Nvidia Tesla K40 GPU. 6.6.3 Performance evaluation In this section, we evaluate object recognition performance of the 3 CNNs. As mentioned, for both 2W-CNN-I and 2W-CNN-MI, object pose information is only usedintraining, buttheassociatedmachineryisprunedawaybeforetest(Fig.6.1). Since all three architectures are trained by SGD, the solutions depend on ini- tializations. To alleviate the randomness of SGD, we run SGD under different initializations and report the mean accuracy. We repeat the training of 3 CNNs under 5 different initializations, and report their mean accuracies and standard 139 deviations in Table 6.1. Our main result is: (1) 2W-CNN-MI and 2W-CNN-I outperform AlexNet by 6% and 5%; (2) 2W-CNN-MI further improves the accu- racy by 1% compared with 2W-CNN-I. This shows that, under the regularization of additional pose information, 2W-CNN learns better deep features for object recognition. AlexNet 2W-CNN-I 2W-CNN-MI accuracy 0.785 (±0.0019) 0.837 (±0.0022) 0.848 (±0.0031) mAP 0.787 0.833 0.850 Table 6.1: Object recognition performances of AlexNet, 2W-CNN-I and 2W- CNN-MI on iLab-20M dataset. 2W-CNN-MI performed significantly better than AlexNet (t-test,p< 1.6· 10 −5 ), and 2W-CNN-I (p< 6.3· 10 −5 ) as well. 2W-CNN- MI was also significantly better than 2W-CNN-I (p<.013). As proven in Sec. 6.5.2, in practice a critical point of AlexNet is not a critical point of 2W-CNN, while a critical point of 2W-CNN is a critical point of AlexNet. We verify critical points of two networks using experiments: (1) we use the trained AlexNet parameters to initialize 2W-CNN, and run SGD to continue training 2W- CNN; (2) conversely, we initialize AlexNet from the trained 2W-CNN and continue training for some epochs. We plot the object recognition error rates on test data against training epochs in Fig. 6.4. As shown, 2W-CNN obviously reaches a new andbettercriticalpointdistinctfromtheinitialization, whileAlexNetstaysaround the same error rate as the initialization. 6.6.4 Decoupling of what and where 2W-CNNs are trained to predict what and where. Although 2W-CNNs do not have explicit what and where neurons, we experimentally show that what and where information is implicitly expressed by neurons in different layers. 140 1 3 5 7 9 11 13 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 training epoch test error rate trained 2W−CNN−MI alexnet: initialized from 2W−CNN−MI trained alexnet 2W−CNN−MI: initialized from alexnet Figure 6.4: Critical points of CNNs. We initialize one network from the trained other network, continue training, and record test errors after each epoch. Starting from a critical point of AlexNet, 2W-CNN-MI steps away from it and reaches a new and better critical point, while AlexNet initialized from 2W-CNN-MI fails to further improve on test performance. Onemightspeculatethatin2W-CNN,morethaninstandardAlexNet,different units in the same layer might become either pose-defining or identity-defining. A pose-defining unit should be sensitive to pose, but invariant to object identity, and conversely. To quantify this, we use entropy to measure the uncertainty of each unit to pose and identity. We estimate pose and identity entropies of each unit as follows: we use all test images as inputs, and we calculate the activation (a i ) of each image for that unit. Then we compute histogram distributions of activations against object category (10 categories in our case) and pose (88 poses), and let two distributions beP obj andP pos respectively. The entropies of these two distributions, E (P obj ) and E (P pos ), are defined to be the object and pose uncertainty of that unit. For units to be pose-defining or identity-defining, one entropy should be low and the other high, while for units with identity and pose coupled together, both entropies are high. Assume there are n units on some layer l (e.g., 256 units on pool5), each with identity and pose entropyE i (obj) andE i (pos) (i∈{1, 2,...,n}), and we organizen 141 identity entropies into a vectorE (obj) = [E 1 (obj),E 2 (obj),...,E n (obj)] andn pose entropies into a vector E (pos) = [E 1 (pos),E 2 (pos),...,E n (pos)]. If n units are pose/identity decoupled, then E (obj) and E (pos) are expected to be negatively correlated. Concretely, for entries at their corresponding locations, if one is large, the other is desirable to be small. We define the correlation coefficient in Eq. 6.5 betweenE (obj) andE (pos) to be the decouple-ness ofn units on the layerl, more negative it is, the better units are pose/identity decoupled. γ =corrcoef (E (obj),E (pose)) (6.5) We compare the decouple-ness of units from our 2W-CNN architecture against those from AlexNet. We take all units from the same layer, including pool1, pool2, conv3, conv4, pool5, fc6 and fc7, compute their decouple-ness and plot them in Fig. 6.5. It reveals: (1) units in 2W-CNN from different layers have been better what/where decoupled, some are learned to capture pose, while others are learned to capture identity; (2) units in the earlier layers (e.g., pool2, conv3, conv4) are better decoupled in 2W-CNN-MI than in 2W-CNN-I, which is expected since pose errors are directly back-propagated into these earlier layers in 2W-CNN-MI. This indicates as well pose and identity information are implicitly expressed by different units, although we have no explicit what/where neurons in 2W-CNN. 6.6.5 Feature visualizations We extract 1024-dimensional features from fc7 of 2W-CNN and AlexNet as image representations, and use t-SNE [123] to compute their 2D embeddings and plot results in Fig. 6.6. Seen qualitatively, object categories are better separated by 2W-CNN representations: For example, “military car” (magenta pentagram) and 142 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 correlation coefficients pool1 pool2 conv3 conv4 pool5 fc6 fc7 alexnet 2W−CNN−I 2W−CNN−MI coupling decoupling coupling decoupling coupling decoupling coupling decoupling coupling decoupling Figure 6.5: Decoupling of what and where. This figure shows the pose/identity decouple-ness of units from the same layer. 2W-CNN makes pose and identity better decoupled than AlexNet, which indicates neurons at intermediate layers of 2W-CNN implicitly segregate what and where information. car f1car helicoper military monster pickup plane semi tank van 2W-CNN-MI alexnet Figure 6.6: t-SNE visualization of fc7 features. The learned deep features at fc7 are better delineated by 2W-CNN than AlexNet. “tank” (purple pentagram) representations under 2W-CNN have a clear bound- ary, while their AlexNet representation distributions penetrate into each other. Similarly, “van” (green circle) and “car” (brown square) are better delineated by 2W-CNN as well. We further visualize receptive fields of units at different layers of AlexNet and 2W-CNN. The filters of conv1 can be directly visualized, while to visualize RFs of units on other layers, we adopt methods used in [149]: we use all test images as input, compute their activation responses of each unit on each layer, and average the top 100 images with the strongest activations as a receptive field visualization of each unit. Fig. 6.7 shows the receptive fields of units on conv1, poo2, conv3 143 and pool5 of two architectures, AlexNet on top and 2W-CNN-MI on bottom. It suggestsqualitativelythat: (1)2W-CNNhasmoredistinctiveandfewerdeadfilters on conv1; (2) AlexNet learns many color filters, which can be seen especially from conv1, pool2 and conv3. While color benefits object recognition in some cases, configural and structural information is more desirable in most cases. 2W-CNN learns more structural filters. conv1 conv3 pool2 pool5 Figure 6.7: Visualization of receptive fields of units at different layers. The top (bottom) row shows receptive fields of AlexNet (2W-CNN). 6.6.6 Extension to ImageNet object recognition ImageNet has millions of labeled images, and thus pre-training a ConvNet on another dataset has been shown to yield insignificant effects [60, 80]. To show that the pretrained 2W-CNN-MI and AlexNet on iLab-20M learns useful features for generic object recognition, we fine-tune the learned weights on ImageNet when we canonlyaccessasmallamountoflabeledimages. Wefine-tuneAlexNetusing5,10, 20, 40 images per class from the ILSVRC-2010 challenge. AlexNet is trained and 144 evaluated on ImageNet under three cases: (1) from scratch (use random Gaussian initialization), (2) from pretrained AlexNet on iLab-20M, (3) from pretrained 2W- CNN-MI on iLab-20M. When we pretrain 2W-CNN-MI on the iLab-20M dataset, we set the units on fc6 and fc7 back to 4096. AlexNet used in pretraining and finetuning follows exactly as the one in [74]. We report top-5 object recognition accuracies in Table 7.3. # of images/class 5 10 20 40 AlexNet (scratch) 1.47 4.15 16.45 25.89 AlexNet (AlexNet-iLab20M) 7.74 12.54 19.42 28.75 AlexNet (2W-CNN-MI-iLab20M) 9.27 14.85 23.14 31.60 Table 6.2: Top-5 object recognition accuracies (%) on the test set of ILSVRC- 2010, with 150 images per class and a total of 150K test images. First, fine-tuning AlexNet from the pretrained features on the iLab-20M dataset clearly outperforms training AlexNet from scratch, which shows features learned on the iLab-20M dataset generalizes to ImageNet as well. Second, fine-tuning from the pretrained 2W-CNN-MI (2W-CNN-MI-iLab20M) performs even better than from the pre- trained AlexNet (AlexNet-iLab20M), which shows our 2W-CNN-MI architecture learns even more effective features for object recognition than AlexNet. Quantitative results: we have two key observations (1) when a limited number of labeled images is available, fine-tuning AlexNet from the pretrained features on the iLab-20M dataset outperforms training AlexNet from scratch, e.g., the relative improvement is as large as∼ 530% when we have only 5 samples per class, when more labeled images are available, the improvement decreases, but we still achieve ∼ 22% improvements when 40 labeled images per class are used. This clearly showsfeatureslearnedontheiLab-20Mdatasetgeneralizewelltothenaturalimage dataset ImageNet. (2) fine-tuning from the pretrained 2W-CNN-MI on iLab-20M performs even better than from the pretrained AlexNet on iLab-20M, and this shows that 2W-CNN-MI learns even better features for general object recognition 145 Figure 6.8: Pose estimation of ImageNet images using trained 2W-CNN-MI on the iLab-20M dataset. Given a test image, 2W-CNN-Mi trained on the iLab-20M dataset could predict one discrete pose (out of 88). In the figure, each row shows the top 10 vehicle images which have the same predicted pose label. Qualitatively, images on the same row do have similar view points, showing 2W-CNN-MI gener- alizes well to natural images, even though it is trained on our turntable dataset. than AlexNet. These empirical results show that training object categories jointly with pose information makes the learned features more effective. Qualitative results: the trained 2W-CNN-MI on iLab-20M could predict object pose as well; here, we directly use the trained 2W-CNN-MI to predict pose for each test image from ILSVRC-2010. Each test image is assigned a pose label (one out of 88 discrete poses, in our case) with some probability. For each discrete pose, we choose 10 vehicles, whose prediction probabilities at that pose are among the top 10, and visualize them in Fig. 6.8. Each row in Fig. 6.8 shows top 10 vehicles whose predicted pose label are the same, and as observed, they do have very similar camera viewpoints. This qualitative result shows pose features learned by 2W-CNN-MI generalize to ImageNet as well. 146 6.7 Conclusion Although in experiments we built 2W-CNN on AlexNet, we could use any feed- forwardarchitecture as abase (e.g., see results usingVGG in Supplementary Mate- rials). Our results show that better training can be achieved when explicit absolute pose information is available. We further show that the pretrained AlexNet and 2W-CNN features on iLab-20M generalizes to the natural image dataset ImageNet, and moreover, the pretrained 2W-CNN features are shown to be advantageous to the pretrained AlexNet features in real dataset as well. We believe that this is an important finding when designing new datasets to assist training of object recogni- tionalgorithms, incomplementwiththeexistinglargetestdatasetsandchallenges. 147 Chapter 7 Object Recognition: disentangling CNN 148 7.1 Abstract Most ConvNets formulate object recognition from natural images as a single task classification problem, and attempt to learn features useful for object cate- gories, but invariant to other factors of variation such as pose and illumination. They do not explicitly learn these other factors; instead, they usually discard them by pooling and normalization. Here, we take the opposite approach: we train ConvNets for object recognition by retaining other factors (pose in our case) and learning them jointly with object category. We design a new multi-task leaning (MTL) ConvNet, named disentangling CNN (disCNN), which explicitly enforces the disentangled representations of object identity and pose, and is trained to predict object categories and pose transformations. disCNN achieves significantly better object recognition accuracies than the baseline CNN trained solely to pre- dict object categories on the iLab-20M dataset, a large-scale turntable dataset with detailed pose and lighting information. We further show that the pretrained fea- tures on iLab-20M generalize to both Washington RGB-D and ImageNet datasets, and the pretrained disCNN features are significantly better than the pretrained baseline CNN features for fine-tuning on ImageNet. 7.2 Introduction Images are generated under factors of variation, including pose, illumination etc. Recently, deep ConvNet architectures learn rich and high-performance fea- tures by leveraging millions of labelled images. They have achieved state-of-the- art object recognition performance. Contemporary CNNs, such as AlexNet [74], VGG [116], GoogLeNet [122] and ResNet [57], pose object recognition as a single task learning problem, and learn features that are sensitive to object categories 149 but invariant to other nuisance information (e.g., pose and illumination) [119] as much as possible. To achieve this, current CNNs usually stack several stages of subsampling/pooling [81] and apply normalization operations [74, 66] to make rep- resentations invariant to small pose variations and illumination changes. However, as argued by Hinton et al [61], to recognize objects, neural networks should use “capsules” to encode both identity and other instantiation parameters (including pose, lighting and shape deformations). In [12, 104], authors argue as well that image understanding is to tease apart these factors, instead of emphasizing one and disregarding the others. In this work, we formulate object recognition as a multi-task learning (MTL) problem by taking images as inputs and learning both object categories and other image generating factors (pose in our case) simultaneously. Thanks to the avail- ability of both identity and 3D pose labels in the iLab-20M dataset of 22 million images of objects shot on a turntable, we use object identity and pose during training, and then investigate further generalization to other datasets which lack pose labels (Washington RGB-D and ImageNet). Contrary to the usual way to learn representations invariant to pose changes, we take the opposite approach by retaining the pose information and learning it jointly with object identities during the training process. We leverage the power of ConvNets for high performance representation learn- ing, and build our MTL framework on it. Concretely, our architecture is a two- streams ConvNet which takes a pair of images as inputs and predicts both the objectcategoryandtheposetransformationbetweenthetwoimages. Bothstreams share the same CNN architecture (e.g., AlexNet) with the same weights and the same operations on each layer. Each stream independently extracts features from oneimage. Inthetoplayer, weexplicitlypartitiontherepresentationunitsintotwo 150 groups, with one group representing object identity and the other its pose. Object identity representations are passed down to predict object categories, while two pose representations are concatenated to predict the pose transformation between images (Fig. 7.1). By explicitly partitioning the top CNN layer units into groups, we learn the ConvNet in a way such that each group extracts features useful for its own task and explains one factor of variation in the image. We refer our architecture as disentangling CNN (disCNN), with disentangled representations for identity and pose. During training, disCNN takes a pair of images as inputs, and learns features by using both object categories and pose-transformations as supervision. The goal of disCNN is to recognize objects, therefore, in test, we take only one stream of the trained disCNN, use it to compute features for the test image, and only the identity representations in top layer are used and fed into the object category layer for categorization. In other words, pose representations are not used in test, and the pose-transformation prediction task in the training is auxiliary to the object recognition task, but essential for better feature learning. 7.3 Related work ConvNets: over the past several years, convolutional neural networks [81] have pushed forward the state-of-the-art in many vision tasks, including image classifi- cation [74, 116, 122, 57], object detection [111, 46], image segmentation [22, 88], activity recognition [114, 47], etc. These tasks leverage the power of CNNs to learn rich features useful for the target tasks, and [2] show features learned by CNNs on one task can be generalized to other tasks. We aim to learn feature representations 151 for different image generating factors, and we employ ConvNets as our building base. Multitask learning: several efforts have explored multi-task learning using deep neural networks, for face detection, phoneme recognition, and scene classification [110, 144, 142, 65]. All of them use a similar linear feed-forward architecture, with all task label layers directly appended onto the top layer. In the end, all tasks in these applications share the same representations. More recently, Su et al [120] use a CNN to estimate the camera viewpoint of the input image. They pose their problem as MTL by assuming that viewpoint estimate is object-class-dependent, andstackclass-specificviewpointlayersontothetopofCNN.Ourworkdiffersfrom the above in that: we use two-stream CNNs and we explicitly partition the top layer representation into groups, with each group representing one task; therefore we have task-exclusive representations while in above works, all tasks share the same top layer representations. Disentangling: As argued by Bengio [12], one of the key challenge to under- standing images is to disentangle different factors, e.g. shape, texture, pose and illumination, that generate natural images. Reed et al [104] proposed the dis- entangling Boltzmann Machine (disBMs), which augments the regular RBM by partitioning the hidden units into distinct factors of variation and modelling their high-order interactions. In [152], the authors build a stochastic multi-view percep- tron to factorize the face identity and its view representations by different sets of neurons, in order to achieve view-invariant face recognition. Our work is similar to the above two in that we explicitly partition the representations into distinct groups to force different factors disentangled; however, our model is determinis- tic and scales to large datasets, while the above methods are restricted to small datasets and often require expensive sampling inferences. 152 Dosovitskiy et al [37] proposed to use CNN to generate images of objects given object style, viewpoint and color. Their model essentially learns to simulate the graphics rendering process, but does not directly apply to image interpretation. Kulkarni et al [76] presented the Inverse Graphics Network (IGN), an encoder- decoder that learns to generate new images of an object under varying poses and lighting. The encoder of IGN learns a disentangled representation of transfor- mations including pose, light and shape. Yang et al [138] proposed a recurrent convolutional encoder-decoder network to render 3D views from a single image. They explicitly split the top layer representations of the encoder into identity and pose units. Our work is similar to [76, 138] by using distinct units to represent different factors, but we differ in that: (1) our architecture is a MTL CNN which maps images to discrete labels, while theirs are autoencoders mapping images to images; (2) our model directly applies to large numbers of categories with complex images, but [76, 138] only tested their models on face and chair datasets with pure backgrounds. Our work is most similar to [2], which shows that freely available egomotion data of mobile agents provides as good supervision as the expensive class-labels for CNNs to learn useful features for different vision tasks. Here, we use stereo-pairs of images as inputs to learn the camera motions as well, however, we are different in: (1)ourarchitectureisaMTLframework,inwhichthetaskofcamera-motion(pose- transformation) prediction serves as an auxiliary task to help object recognition; (2) our network is more flexible, which could take in both one image or a stereo- pair; (3) our MTL disCNN learns much better features for object-recognition than the baseline CNN using only class-label as supervision, while their single task two-streams CNNs only learn comparable features. 153 c2 c1 c3 c4 c5 fc6 fc7 identity1 object pose-transformation pose1 L2 w=[1 0] w=[0 1] identity2 pose2 w=[1 0] w=[0 1] stereo pair share parameters Figure 7.1: Architecture of our disentangling CNN (disCNN). disCNN is a two- streams CNN, which takes in an image pair and learns to predict the object cat- egory and the pose transformation jointly. In experiments, we use AlexNet in both streams to extract features, and explicitly partition the top layer represen- tations fc7 into two groups: identity and pose. We further enforce two identity representations to be similar, and one identity representation is used for object category prediction, and two pose representations are concatenated to predict the pose transformation. 7.4 Method Object identity is here defined to be the identity of one instance. Distinct instances, no matter whether they belong to the same class, have different object identities. Object pose refers to the extrinsic parameters of the camera taking the image, but given a single natural image taken by a consumer camera, it is hard to obtain the camera extrinsics, therefore the ground-truth object poses are expensive and sometimes impossible to collect in real cases. Camera extrinsics are known when we render 2D images from 3D models[120], but rendered 2D images are very different from natural images. Although single camera extrinsics are hard to get in real cases, the relative translation and orientation (a.k.a camera motion), represented by an essential matrix E, between a pair of images is relatively easier to compute, e.g., for calibrated cameras, first find 8 pairs of matched points, then use the normalized 8-point algorithm [56] to estimate. The camera motion between an image pair captures the pose transformation between objects in the two images. 154 We use the relative pose transformation between objects instead of absolute object pose, as supervision. In the following, we use “pose transformation” and “camera motion” interchangeably. Our system is designed to estimate any numeric pose transformations, but in experiments, we have a limited number of camera-pairs, with motion between each pair fixed. Therefore, we could further discretize the pose transformation using the fact that every image-pair taken under the same camera-pair has the same pose transformation, and the number of the camera-pairs determines the number of discrete pose-transformations. In this way, “pose transformation” estimation is transformed into a classification problem - classifying which camera-pair took the image-pair, with the number of labels equal to the number of camera-pairs. 7.4.1 Network Architecture Our ultimate goal is to learn object identity representations for object recogni- tion, but we further simultaneously learn the object pose transformation as an aux- iliary task. Building a ConvNet that can predict the pose transformation between a stereo-pair of images is straightforward: the ConvNet should take the pair as input, after several layers of convolutions, it produces an output which assigns a probability to each camera-pair under which that image-pair could be taken. But note that the image-pair contains the same object instance taken under different camera viewpoints, we wish to learn an object identity representation, such that the same pair should have as similar object identity representations as possible. We build a two-stream CNN architecture shown in Fig. 7.1, named disentan- gling CNN (disCNN). Each stream is a ConvNet independently extracting features fromoneimage, andbothConvNetshavethesamearchitectureandsharethesame weights. Here we use AlexNet [74] as the ConvNet, but with faster GPUs one could 155 use VGG [116] and GoogLeNet [122] as well. After getting fc7 representations, we explicitly partition the fc7 units into two groups, with one group representing object identity and the other representing object pose in a single image. Since object instances in an image pair are the same, we enforce the two identity repre- sentations to be similar by penalizing their ` 2 -norm differences, i.e.kid 1 −id 2 k 2 , where id 1 and id 2 are identity representations of two stereo images. One identity representation (either id 1 or id 2 ) is further fed into object-category label layer for object-category prediction. Two pose representations,pose 1 andpose 2 , are fused to predict the relative pose transformation, i.e., under which camera-pair the image- pair is taken. Our objective function is therefore the summation of two soft-max losses and one ` 2 loss: L = L(object)+ λ 1 L(posetransformation)+ λ 2 kid 1 −id 2 k 2 (7.1) We follow AlexNet closely, which takes a 227× 227 image as input, and has 5 convolutional layers and 2 fully connected layers. ReLU non-linearities are used after every convolutional/fully-connected layer, and dropout is used in both fully connected layers, with dropout rate 0.5. The only change we make is to change the number of units on both fc6 and fc7 from 4096 to 1024, and one half of the units (512) are used to represent identity and the other half to represent pose. If we use abbreviations Cn, Fn, P, D, LRN, ReLU to represent a convolutional layer with n filters, a fully connected layer with n filters, a pooling layer, a dropout layer, a local responsenormalizationlayerandaReLUlayer, thentheAlexNet-typearchitecture 156 used in our experiments is: C96-P-LRN-C256-P-LRN-C384-C384-C256-P-F1024- D-F1024-D (we omit ReLU to avoid cluttering). If not explicitly mentioned, this is the baseline architecture for all experiments. (a) Case One Case Two Case Three Case Four (b) Figure 7.2: Exemplar iLab-20M images and camera pairs. (a) images of the same object instance taken by different cameras under different rotations, each row is takenunderthesamecameraunderdifferentrotations, andeachcolumnistakenby different cameras under the same rotation; (b) camera pairs used in experiments. Notes: (1) the proposed two-stream CNN architecture is quite flexible in that: it could either take a single image or an image-pair as inputs. For a single image input, no pose transformation label is necessary, while for an image-pair input, it is not required to have an object-category label. For a pair of images without the object label, its loss reduces to two terms: λ 1 L(pose−transformation) +λ 2 k id 1 −id 2 k 2 , the soft-max loss of the predicted pose-transformation and the ` 2 loss of two identity representations. Given a single image with a object label, the loss incurred by it reduces to only one term: the soft-max loss of the predicted category labelL(object). (2) Scaling the same image-pair by different scales does not change its pose transformation label. In our case, each camera-pair has a unique essential matrix (up to some scale), and defines one pose-transformation label. By up/down scaling both images in a pair, the estimated essential matrix differs only by a scale factor. Since the essential matrix estimated from the raw image-pair is already uncertain 157 up to a scale factor (e.g. using the eight-point method for estimation [56]), the essential matrix estimated from the scaled pairs is equivalent to that estimated from the raw pair. This is useful when objects have large scale differences: we could scale them differently to make them have similar scales (see experiments on Washington RGB-D dataset). 7.5 Experiments In experiments, we first show the effectiveness of disCNN for object recognition against AlexNet on both iLab-20M and Washington RGB-D datasets. We further demonstrate that the pretrained disCNN on the iLab-20M dataset learns useful features for object recognition on the ImageNet dataset [31]: a AlexNet initialized with disCNN weights performs significantly better than a AlexNet initialized with random Gaussian weights. 7.5.1 iLab-20M dataset The iLab-20M dataset [14] is a controlled, parametric dataset collected by shooting images of toy vehicles placed on a turntable using 11 cameras at differ- ent viewingpoints. There are totally 15 object categories with each object having 25∼160 instances. Each object instance was shot on more than 14 backgrounds (printed satellite images), in a relevant context (e.g., cars on roads, trains on rail- tracks, boats on water). In total, 1,320 images were captured for each instance and background combinations: 11 azimuth angles (from the 11 cameras), 8 turntable rotation angles, 5 lighting conditions, and 3 focus values (-3, 0, and +3 from the default focus value of each camera). The complete dataset consists of 704 158 object instances, with 1,320 images per object-instance/background combination, or almost 22M images. Trainingandtestinstances: weuse10(outof15)objectcategoriesinourexper- iments (Fig. 7.4), and, within each category, we randomly choose 3/4 instances as training and the remaining 1/4 instances for testing. Under this partition, instances in test are never seen during training. Image-pairs: we only take images shot under one fixed lighting condition (with all 4 lights on) and camera focus (focus = 0), but all 11 camera azimuths and all 8 turntable rotations as training and test images, equivalent to 88 virtual cameras on a semi-sphere. In principle, we can take image-pairs taken under any camera-pairs (e.g. any pair fromC 2 88 combinations), however, one critical problem is that image- pairs taken under camera-pairs with large viewpoint differences have little overlap, which makes it difficult, or even impossible to predict the pose-transformation (e.g., difficult to estimate the essential matrix). Therefore, in experiments, we only consider image-pairs taken by neighboring-camera pairs. All image-pairs shot under a fixed camera-pair share the same pose-transformation label, and finally the total number of pose-transformation labels is equal to the number of camera-pairs. In experiments, we consider different numbers of camera-pairs, and evaluate the influence on the performance of disCNN. # of camera pairs 7 11 18 56 AlexNet 79.07 78.89 79.60 79.25 disCNN 81.30 83.66 83.60 83.66 Table 7.1: Object recognition accuracies (%) of AlexNet and disCNN on the iLab- 20M dataset. disCNN consistently outperforms AlexNet under different numbers of camera pairs used as supervision, showing the advantage of jointly learning object identity and its pose. We see as well: disCNN performs better when more camera-pairs are used, e.g., the performance of disCNN increases by 2% when≥11 camera pairs are used, compared with 7 camera pairs. 159 Fig. 7.2 shows images of one instance shot under different cameras and rota- tions: each row is shot by the same camera under different turntable rotations, and each column is shot by different cameras under the same turntable rotation. In experiments, we use different numbers of camera-pairs as supervision, there- fore, only take image-pairs shot under the chosen camera-pairs as training. Case one (Fig. 7.2 (a) topleft): we take two neighboring cameras as one camera-pair (we skip 1 camera, i.e., C i −C i+2 is a camera-pair), resulting in 7 camera-pairs, therefore 7 pose-transformation labels. Image pairs taken by the same camera-pair under different rotations share the same pose-transformation label. Case two (Fig. 7.2 (b) topright): two images taken by one camera under two adjacent rotations ((C i R j , C i R j+1 ))canbeimaginedtobetakenbyapairofvirtualcameras,resulting in 11 camera-pairs with 1 pair referring to one camera under two adjacent rota- tions. Case three (Fig. 7.2 (c) bottomleft): we combine 7 camera-pairs in case one and 11 camera-pairs in case two, and a total of 18 camera pairs. Case four (Fig. 7.2 (d) bottomright): in addition to take image-pairs taken under neighboring cameras (the same rotation) and neighboring rotations (the same camera), we further take diagonal image-pairs taken under neighboring-cameras and neighboring-rotations (i.e., (C i R j , C i+1 R j+1 ) and (C i R j+1 , C i+1 R j )). At last we have 56 camera-pairs. By taking image-pairs from the chosen camera-pairs, we end up 0.42M, 0.57M, 0.99M and 3M training image-pairs in 4 cases respectively. After training, we take the trained AlexNet-type architecture out and use it to predict the object category of a test image. We have a total of 0.22M test images by split. Implementation details: Since we have prepared training pairs for disCNN, we use the left images of training pairs as the training data for AlexNet. Therefore AlexNet and disCNN have the same number of training samples, with one image in AlexNet corresponding to an image pair in disCNN (Note: duplicate training 160 images exist in AlexNet). To do a fair comparison, we train both AlexNet and disCNN using SGD under the same learning rate, the same number of training epochs and the same training order within each epoch. We setλ 1 = 1 andλ 2 = 0.1 in the objective function 7.1 of disCNN. Practically, λ 1 and λ 2 are set such that the derivatives of three separate loss terms to the parameters are at a similar scale. Both AlexNet and disCNN are trained for 20 epochs under 4 cases. The initial (final) learning rate is set to be 0.01 (0.0001), which is reduced log linearly after each epoch. The ConvNets are trained on one Tesla K40 GPU using the toolkit [125]. Results: the object recognition performances are shown in Table 7.1. We have the following observations: (1) disCNN consistently outperforms AlexNet under differentnumbersofcamerapairs, withtheperformancegainupto∼ 4%; (2)when we have more camera-pairs, the performance gap between disCNN and AlexNet widens, e.g.,∼ 4% gain under 11,18,56 camera pairs compared with∼ 2% gain under 7 camera pairs. One potential reason is that when more camera pairs are used, more views of the same instance are available for training, therefore, a higher recognition accuracy is expected. But as observed, the performances of disCNN flatten when more camera pairs are used, e.g. the same performance under 18 and 56 camera pairs. One possible interpretation is: although we have 56 camera pairs, the diagonal camera-pairs in the case of 56 pairs do provide new pose transforma- tion information, since the motion between a diagonal pair could be induced from motions of two camera pairs in the case of 18 pairs, a horizontal camera pair and a vertical camera pair. Qualitative visualizations: Fig. 7.4 (a,b) shows the learned conv1 filters of AlexNet and disCNN in case 3, and (c,d) show their corresponding between-class confusion matrices. As seen, disCNN learns more edge-shaped filters, and disCNN 161 disCNN AlexNet query (a) (b) Figure 7.3: Examples of k nearest neighbors of query images. Images are first representedbyfc7-identity(disCNN,512D)andfc7(AlexNet, 1024D)features, and then 5 nearest neighbors are searched based on ` 2 distances in the representation spaces. On each row, the 1st image is the query image, and the next 5 (the last 5) images are retrieved nearest neighbors by disCNN and AlexNet. In group (a), disCNN always returns the same instance but under different poses as the nearestneighbors, butAlexNetfailstoretrievethesameinstance, insteaditreturns instanceswithdifferentidentitiesbutsimilarposes. Ingroup(b), althoughdisCNN fails to retrieve the right instance, it does find instances with similar shapes to the query image. In this case, AlexNet retrieves the correct instances with the same identity, but again, the poses of the retrieved images are very similar to the query one. This qualitative result shows disCNN disentangles identity from pose, to some extent. improves the recognition accuracies for 8 classes, draws on one and loses one. Fig. 7.3showsknearestneighborsofthequeryimage, basedonthe` 2 distancesbetween their fc7-identity (disCNN, 512D) and fc7 (AlexNet, 1024D) representations. We canseeclearlythatdisCNNsuccessfullyretrievesimagesofthesameinstanceunder different poses as the nearest neighbors (Fig. 7.3 (a)). Although in some cases (Fig. 7.3 (b)), AlexNet find different images of the same instance as the nearest neighbors, these retrieved neighbors clearly share similar poses as the query image. 162 (a) (b) 0.90 0.02 0.00 0.03 0.01 0.48 0.00 0.00 0.00 0.30 0.01 0.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.95 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.02 0.01 0.00 0.70 0.00 0.06 0.00 0.00 0.10 0.00 0.00 0.02 0.00 0.00 0.85 0.09 0.01 0.00 0.01 0.00 0.02 0.00 0.00 0.04 0.12 0.29 0.01 0.05 0.00 0.04 0.00 0.00 0.05 0.00 0.00 0.03 0.95 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.94 0.00 0.00 0.00 0.00 0.00 0.22 0.00 0.00 0.00 0.01 0.88 0.01 0.04 0.01 0.00 0.01 0.01 0.04 0.00 0.00 0.00 0.63 car f1car helicopter plane pickup military monster semi tank van car f1car helicopter plane pickup military monster semi tank van 0.95 0.02 0.00 0.04 0.01 0.52 0.00 0.00 0.00 0.25 0.00 0.96 0.00 0.00 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.72 0.00 0.01 0.00 0.00 0.07 0.00 0.00 0.01 0.00 0.00 0.85 0.08 0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.04 0.11 0.26 0.00 0.01 0.01 0.01 0.00 0.00 0.04 0.00 0.00 0.01 0.98 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.00 0.20 0.00 0.00 0.00 0.02 0.91 0.00 0.02 0.00 0.00 0.01 0.03 0.06 0.00 0.00 0.00 0.73 car f1car helicopter plane pickup military monster semi tank van car f1car helicopter plane pickup military monster semi tank van (c) (d) Figure 7.4: Learned filters and between-class confusion matrix. (a) learned Conv1 filters of AlexNet; (b) learned Conv1 filters of disCNN; between-class confusion matrix of (c) AlexNet and (d) disCNN. disCNN learns more edge-shaped filters, and improves the recognition accuracies for 8 categories (out of 10). These qualitative results show that disCNN disentangles the representations of identity from pose, to some extent. # of camera pairs 3 6 9 12 # of camera pairs 3 6 9 12 AlexNet (scratch) 71.2 72.8 72.1 72.9 AlexNet (AlexNet-iLab20M) 76.2 77.3 79.9 79.6 disCNN (scratch) 75.0 75.1 77.0 78.6 disCNN (AlexNet-iLab20M) 78.9 80.8 81.5 82.7 Table 7.2: Object recognition accuracies (%) of AlexNet and disCNN on the WashingtonRGB-Ddataset. Theleft(right)tableshowsperformancecomparisons between disCNN and AlexNet trained from scratch (from the pretrained AlexNet features on the iLab-20M dataset). As seen, by fine-tuning CNNs from features learned on iLab-20M, large performance gains are achieved, e.g.∼ 4.5%(∼ 5.5%) fordisCNN(AlexNet). ThisshowsfeatureslearnedfromiLab-20Mareeffectivefor, and generalizable to object recognition in the RGB-D dataset. Results from both tables shows disCNN outperforms AlexNet by∼ 3.5% (scratch) and∼ 2% (fine- tune), which shows the advantage of our disentangled architecture. Furthermore, when the number of camera pairs increases, the performances of disCNN increase as well. 163 (a) (b) (c) (d) Figure 7.5: Learned filters and between-class ` 2 distances. (a) and (b) show the learned filters of disCNN trained from scratch, and disCNN fine-tuned from the pretrained AlexNet on the iLab-20M dataset; (c) and (d) show the between-class ` 2 distances of fc7 representations from AlexNet (1024D) and disCNN (512D). Training disCNN from scratch learns only color blobs (a). (c,d) shows visually that disCNN representations have smaller within category distances and larger between category distances. The ratio between the mean between-category distance and the mean within-category distance is 7.7/5.7 for disCNN/AlexNet. 7.5.2 Washington RGB-D dataset The RGB-D dataset [77] depicts 300 common household objects organized into 51 categories. This dataset was recorded using a Kinect style 3D camera that records synchronized and aligned 640x480 RGB and depth images at 30 Hz. Each object was placed on a turntable and video sequences were captured for one whole rotation. For each object, there are 3 video sequences, each recorded with the camera mounted at a different height so that the object is viewed from different angles with the horizon. The dataset has a total of 250K images from different views and rotations. Two adjacent frames have small motions, therefore are visu- ally very similar, and in experiments, we pick one frame from each 5 consecutive frames, resulting in∼50K image frames. Since the scale of the datasets does not match the scale of ConvNets, we adopt the “pretrain-finetuning” paradigm to do 164 object recognition in this dataset, using the pretrained ConvNets weights on the iLab-20M dataset as initializations. Training and test sets: [77] provided 10 partition lists of training/test. They use leave-one-out to partition: randomly choose 1 instance within a category as test, and use the remaining instances as training. Due to the training time limi- tation, we evaluate performances using the first 3 partitions and report the mean accuracies. We use the provided object masks to crop the objects from the raw frames and resize them to the size 227×227. Since objects are located at the image center, by first cropping and then rescaling an image-pair does not change the pose-transformation of the raw pair. Camera pairs: similarly we take different numbers of camera-pairs and evaluate influenceontheperformances. Inonevideosequence, everyframe-pairwithafixed temporal gap could be imagined to be taken under a virtual camera-pair, thus all such pairs share the same pose-transformation label. As an example, two pairs, F i −F i+Δ and F j −F j+Δ , whose temporal gap between frames are both Δ, then they have the same pose-transformation label. One Δ defines one camera-pair, and in experiments, we let Δ ={5, 10, 15, 20}. Case one: we take image-pairs with Δ ={5}fromeachvideosequence, andall thesepairscouldbethoughtastakenby one virtual camera pair, therefore have the same pose-transformation label. Since we have 3 video sequences, finally all pairs have in total 3 pose-transformation labels, thus equivalently 3 virtual camera pairs; Case two: take image-pairs with Δ ={5, 10}, end in 6 camera pairs; Case three: Δ ={5, 10, 15}, end in 9 camera- pairs; case four: Δ ={5, 10, 15, 20}, end in 12 camera-pairs. The total number of training image pairs under each case is 67K, 99K and 131K respectively. The number of test images in all cases is 6773. 165 Implementationdetails: weusethesametrainingsettingsasiniLab-20Mexper- iments to train AlexNet and disCNN, i.e., the same learning rates (start from 0.01, end with 0.0001, with rate decreasing log linearly), the same number of training epochs (15), and the same training order within each epoch. We set λ 1 = 1 and λ 2 = 0.05 in experiments. Results: we do two comparisons: first compare disCNN (AlexNet) trained from scratch against from the pretrained weights on the iLab-20M dataset, then com- pare disCNN against AlexNet, both fine-tuned from the pretrained CNN features on iLab-20M. Results are shown in Table 7.2, our observations are: (1) disCNN (AlexNet) trained by fine-tuning the pretrained AlexNet features on the iLab-20M wins over disCNN (AlexNet) trained from scratch by∼ 4.5%(∼ 5.5%), and their fine-tuned performances are better than the published accuracies, 74.7%, in [77] by a large margin. This shows the features learned from the iLab-20M dataset generalize well to the RGB-D dataset; (2) disCNN outperforms AlexNet in both cases, either trained from scratch or from the pretrained AlexNet features, which shows the superiority of the disentangling architecture over the linear chain, single task CNNs; (3) similarly, we observe that the performance of disCNN increases as the number of camera pairs increase. We further compute ` 2 distances between categories using fc7-identity (disCNN, 512D) and fc7 (AlexNet, 1024D) represen- tations, and plot them in Fig. 7.5. Visually the off diagonal elements in disCNN are brighter and the diagonal elements are darker, showing smaller within-category distances and larger between-category distances. 7.5.3 ImageNet ImageNet has millions of labeled images, and training a ConvNet on a large dataset from pretrained models against from scratch has been shown to have 166 insignificant effects [60, 80]. In order to show that the pretrained disCNN on the iLab-20M datasets learns useful features for object recognition, we fine-tune the learned weights on ImageNet when only a small amount of labeled images are avail- able. We fine-tune AlexNet using 5, 10, 20, 40 images per class (5K,10K,20K and 40K training images in total) from the ILSVRC-2010 challenge. AlexNet is fine- tuned under three scenarios: (1) from scratch (random Gaussian initialization), (2) from pretrained AlexNet on iLab-20M, (3) from pretrained disCNN on iLab- 20M, and top-5 object recognition accuracies are presented in Table 7.3. When we pretrain AlexNet and disCNN on the iLab-20M dataset, we use the AlexNet with the units on the last two fully connected layers reset to 4096. Results: (1) when only a limited number of labeled images are available, fine- tuning AlexNet from the pretrained features on the iLab-20M dataset performs much better than training AlexNet from scratch, e.g., the relative improvement is as large as∼ 460% when we have only 5 samples per class, and the improvement decreaseswhenmorelabeledimagesareavailable, butwestillgain∼ 25%improve- ments when 40 labeled images per class are available. This clearly shows features learned on the iLab-20M dataset generalize to ImageNet. (2) fine-tuning from the pretrained disCNN on iLab-20M performs even better than from the pretrained AlexNet on iLab-20M, and this shows that disCNN learns even more effective fea- tures for general object recognition than AlexNet. These empirical results show the advantage of our disentangling architecture to the traditional single task linear architecture. 167 # of images/class 5 10 20 40 AlexNet (scratch) 1.47 4.15 16.45 25.89 AlexNet (AlexNet-iLab20M) 7.74 12.54 19.42 28.75 AlexNet (disCNN-iLab20M) 8.21 14.19 22.04 30.19 Table 7.3: Top-5 object recognition accuracies (%) on the test set of ILSVRC- 2010, with 150 images per class and a total of 150K test images. First, fine-tuning AlexNet from the pretrained features on the iLab-20M dataset clearly outper- forms training AlexNet from scratch, which shows features learned on the iLab- 20M dataset generalizes to ImageNet as well. Second, fine-tuning from the pre- trained disCNN-iLab20M performs even better than from the pretrained AlexNet- iLab20M, which shows our disentangling architecture learns even better features for object recognition than AlexNet. 7.6 Conclusions Inthispaper, wedesignamulti-tasklearningConvNettolearntopredictobject categories. Unlike traditional ConvNets for object recognition, which is usually a single task architecture and learns features sensitive to the current task (i.e., object category)butinvarianttootherfactorsofvariationasmuchaspossible(e.g., pose), disCNN retains all image generating factors of variation (object category and pose transformation in our case), and learn them simultaneously by explicitly disentan- gling representations of different factors. Experiments on the large scale iLab-20M dataset show that features learned by disCNN outperforms features learned by AlexNet significantly for object recognition. If we fine tune object recognition on the ImageNet dataset using pretrained disCNN and AlexNet features, disCNN- pretrained features are consistently better than AlexNet-pretrained features. All experiments show the effectiveness of our disentangled training architecture. As shown in [2], features learned using egomotion as supervision are useful for other vision tasks, including object recognition, and the egomotion-pretrained 168 features compare favorably with features learned using class-label as supervision. In our paper, we further showed that when our model has access to both object categories and camera motions, it learns even better features than using only class- label as supervision. One possible explanation is: although egomotion learns useful features for object recognition, it does not necessarily guarantee that feature repre- sentations of different instances of the same class are similar since egomotion does not has access to any class-label information. In our work, we showed, by feed- ing ConvNets with additional class labels, the feature learning process are further guided toward the direction that objects of the same class tend to have spatially similar representations. 169 Chapter 8 Conlusions 170 Situation understanding is an active research field, and still largely unsolved. Here, toward this long term goal, we proposed to solve several existing key issues. Specifically, we made the following contributions: (1) invented a temporal segmentation algorithm, which decomposes heteroge- neous time series into homogeneous segments in a greedy way. Our algorithm is widely applicable to discrete/continuous univariate/multi-variate time series decomposition, including audio, video and motion sequence segmentation. (2) developed a temporal sequence alignment algorithm, named shapeDTW, which augments the traditional DTW alignment by local shapes. shapeDTW out- performs DTW both qualitatively and quantitatively, and it achieves (3) proposed a metric learning algorithm to learn distance metrics between temporal sequences, and the learned metric outperforms the default Euclidean metric significantly. (4) proposed a two-stream convolutional neural network to disentangle object identity from its instantiation factors (e.g., pose, lighting), and learned more dis- criminative identity representations. (5) formulated visual object recognition in a CNN-based multi-task learning framework, and showed its superiority to single task learning theoretically and empirically. 171 Reference List [1] S. Agarwal and D. Roth. Learning a sparse representation for object detec- tion. In ECCV, pages 113–127. Springer, 2002. [2] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pages 37–45, 2015. [3] A. Bagnall and J. Lines. An experimental evaluation of nearest neighbour time series classification. arXiv preprint arXiv:1406.4757, 2014. [4] Anthony Bagnall, Aaron Bostrom, James Large, and Jason Lines. The great time series classification bake off: An experimental evaluation of recently proposed algorithms. extended version. arXiv preprint arXiv:1602.01711, 2016. [5] Anthony Bagnall, Jason Lines, Jon Hills, and Aaron Bostrom. Time-series classification with cote: the collective of transformation-based ensembles. IEEE Transactions on Knowledge and Data Engineering, 27(9):2522–2535, 2015. [6] Amr Bakry and Ahmed Elgammal. Untangling object-view manifold for multiviewrecognitionandposeestimation. InComputer Vision–ECCV 2014, pages 434–449. Springer, 2014. [7] Jernej Barbič, Alla Safonova, Jia-Yu Pan, Christos Faloutsos, Jessica K Hod- gins, and Nancy S Pollard. Segmenting motion capture data into distinct behaviors. In Proceedings of the 2004 Graphics Interface Conference, pages 185–194. Canadian Human-Computer Communications Society, 2004. [8] GustavoEAPABatista,XiaoyueWang,andEamonnJKeogh. Acomplexity- invariant distance measure for time series. In SDM, volume 11, pages 699– 710. SIAM, 2011. [9] Jonathan Baxter. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR), 12:149–198, 2000. 172 [10] Mustafa Gokce Baydogan, George Runger, and Eugene Tuv. A bag-of- features framework to classify time series. PAMI, 35(11):2796–2802, 2013. [11] Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on met- ric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013. [12] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1–127, 2009. [13] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri. Actions as space-time shapes. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1395–1402. IEEE, 2005. [14] Ali Borji, Saeed Izadi, and Laurent Itti. ilab-20m: A large-scale controlled object dataset to investigate deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [15] Aaron Bostrom and Anthony Bagnall. Binary shapelet transform for mul- ticlass time series classification. In International Conference on Big Data Analytics and Knowledge Discovery, pages 257–269. Springer, 2015. [16] K Selçuk Candan, Rosaria Rossini, Xiaolan Wang, and Maria Luisa Sapino. sdtw: computing dtw distances using locally relevant constraints based on salient feature alignments. Proceedings of the VLDB Endowment, 5(11):1519–1530, 2012. [17] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. [18] Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. 2011. [19] Jie Chen and Arjun K Gupta. Parametric statistical change point analysis: With applications to genetics, medicine, and finance. Springer Science & Business Media, 2011. [20] Lei Chen and Raymond Ng. On the marriage of lp-norms and edit distance. In VLDB, pages 792–803. VLDB Endowment, 2004. [21] Lei Chen, M Tamer Özsu, and Vincent Oria. Robust and fast similarity search for moving object trajectories. In ACM SIGMOD, pages 491–502. ACM, 2005. 173 [22] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. [23] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bag- nall, AbdullahMueen, andGustavoBatista. Theucrtimeseriesclassification archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/. [24] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. The variable band- width mean shift and data-driven scale selection. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol- ume 1, pages 438–445. IEEE, 2001. [25] Marco Cuturi, J-P Vert, Øystein Birkenes, and Tomoko Matsui. A kernel for time series based on global alignments. In ICASSP, volume 2, pages II–413. IEEE, 2007. [26] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR, volume 1, pages 886–893. IEEE, 2005. [27] Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection using ori- entedhistogramsofflowandappearance. InECCV,pages428–441.Springer, 2006. [28] YannNDauphin, RazvanPascanu, CaglarGulcehre, KyunghyunCho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point probleminhigh-dimensionalnon-convexoptimization. InAdvances in Neural Information Processing Systems, pages 2933–2941, 2014. [29] Fernando De la Torre, Joan Campoy, Zara Ambadar, and Jeffrey F Cohn. Temporal segmentation of facial behavior. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007. [30] HoutaoDeng, GeorgeRunger, EugeneTuv, andMartyanovVladimir. Atime series forest for classification and feature extraction. Information Sciences, 239:142–153, 2013. [31] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima- genet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248– 255. IEEE, 2009. [32] Zhigang Deng, Qin Gu, and Qing Li. Perceptually consistent example-based humanmotionretrieval. InProceedings of the 2009 symposium on Interactive 3D graphics and games, pages 191–198. ACM, 2009. 174 [33] Frédéric Desobry, Manuel Davy, and Christian Doncarli. An online ker- nel change detection algorithm. Signal Processing, IEEE Transactions on, 53(8):2961–2974, 2005. [34] HuiDing, GoceTrajcevski, PeterScheuermann, XiaoyueWang, andEamonn Keogh. Querying and mining of time series data: experimental comparison of representations and distance measures. Proceedings of the VLDB Endow- ment, 1(2):1542–1552, 2008. [35] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Mid-level visual element discovery as discriminative mode seeking. In Advances in neural information processing systems, pages 494–502, 2013. [36] Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei Efros. What makes paris look like paris? ACM Transactions on Graphics, 31(4), 2012. [37] Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1538–1546, 2015. [38] Z. Duan and B. Pardo. Soundprism: An online system for score-informed source separation of music audio. IEEE Journal of Selected Topics in Signal Processing, 5(6):1205–1215, 2011. [39] D. Ellis. Dynamic time warp (dtw) in matlab, 2003. www.ee.columbia.edu/ ~dpwe/resources/matlab/dtw/. [40] Paul Fearnhead. Exact and efficient bayesian inference for multiple change- point problems. Statistics and computing, 16(2):203–213, 2006. [41] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. PAMI, 32(9):1627–1645, 2010. [42] D. Forsyth and J. Ponce. Computer vision: A modern approach. 2003. [43] Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. Activity represen- tation with motion hierarchies. International journal of computer vision, 107(3):219–238, 2014. [44] D. Garreau, R. Lajugie, S. Arlot, and F. Bach. Metric learning for temporal sequence alignment. In NIPS, pages 1817–1825, 2014. 175 [45] T. Giorgino. Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software, 31(7):1–24, 2009. [46] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. Rich fea- ture hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 580–587. IEEE, 2014. [47] Georgia Gkioxari, Ross Girshick, and Jitendra Malik. Contextual action recognitionwithr*cnn. InProceedings of the IEEE International Conference on Computer Vision, pages 1080–1088, 2015. [48] Dian Gong, Gérard Medioni, Sikai Zhu, and Xuemei Zhao. Kernelized tem- poral cut for online temporal segmentation and recognition. In Computer Vision–ECCV 2012, pages 229–243. Springer, 2012. [49] Ross Goroshin, Michael Mathieu, and Yann LeCun. Learning to linearize under uncertainty. arXiv preprint arXiv:1506.03011, 2015. [50] Josif Grabocka and Lars Schmidt-Thieme. Invariant time-series factoriza- tion. Data Mining and Knowledge Discovery, 28(5-6):1455–1479, 2014. [51] Michael Grant and Stephen Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, LectureNotesin ControlandInformation Sciences, pages 95–110. Springer-Verlag Limited, 2008. http://stanford. edu/~boyd/graph_dcp.html. [52] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014. [53] Steinn Gudmundsson, Thomas Philip Runarsson, and Sven Sigurdsson. Sup- port vector machines and dynamic time warping for time series. In IEEE International Joint Conference on Neural Networks, pages 2772–2776. IEEE, 2008. [54] Zaıd Harchaoui and Olivier Cappé. Retrospective multiple change-point esti- mation with kernels. In IEEE Workshop on Statistical Signal Processing, pages 768–772, 2007. [55] Zaid Harchaoui, Eric Moulines, and Francis R Bach. Kernel change-point analysis. In Advances in Neural Information Processing Systems, pages 609– 616, 2009. [56] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. 176 [57] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. [58] Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classification of time series by shapelet transformation. DMKD, 28(4):851–881, 2014. [59] Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. https://www.uea.ac.uk/computing/machine- learning/shapelets/shapelet-results. 2014. [60] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic mod- eling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012. [61] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto- encoders. InArtificial Neural Networks and Machine Learning–ICANN 2011, pages 44–51. Springer, 2011. [62] E. Hsu, K. Pulli, and J. Popović. Style translation for human motion. ACM Transactions on Graphics, 24(3):1082–1089, 2005. [63] Bing Hu, Yanping Chen, Jesin Zakaria, Liudmila Ulanova, and Eamonn Keogh. Classificationofmulti-dimensionalstreamingtimeseriesbyweighting each classifier’s track record. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 281–290. IEEE, 2013. [64] N. Hu, R. Dannenberg, and G. Tzanetakis. Polyphonic audio matching and alignmentformusicretrieval. Computer Science Department, page521, 2003. [65] Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. Multi-task deep neural network for multi-label learning. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 2897–2900. IEEE, 2013. [66] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [67] Y. Jeong, M. Jeong, and O. Omitaomu. Weighted dynamic time warping for time series classification. Pattern Recognition, 44(9):2231–2240, 2011. [68] AndrejKarpathy, GeorgeToderici,SachinShetty, TommyLeung, RahulSuk- thankar, and Li Fei-Fei. Large-scale video classification with convolutional 177 neuralnetworks. InComputer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725–1732. IEEE, 2014. [69] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimensionality reduction for fast similarity search in large time series databases. Knowledge and information Systems, 3(3):263–286, 2001. [70] E.KeoghandC.Ratanamahatana. Exactindexingofdynamictimewarping. Knowledge and information systems, 7(3):358–386, 2005. [71] Eamonn J Keogh and Michael J Pazzani. Derivative dynamic time warping. In SDM, volume 1, pages 5–7. SIAM, 2001. [72] H. Kirchhoff and A. Lerch. Evaluation of features for audio-to-audio align- ment. Journal of New Music Research, 40(1):27–41, 2011. [73] Alexander Klaser and Marcin Marszalek. A spatio-temporal descriptor based on 3d-gradients. 2008. [74] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas- sification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [75] K. Kulkarni, G. Evangelidis, J. Cech, and R. Horaud. Continuous action recognition based on sequence alignment. IJCV, pages 1–25, 2014. [76] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenen- baum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2530–2538, 2015. [77] Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. A large-scale hierar- chical multi-view rgb-d object dataset. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1817–1824. IEEE, 2011. [78] R. Lajugie, D. Garreau, F. Bach, and S. Arlot. Metric learning for temporal sequence alignment. In NIPS, pages 1817–1825, 2014. [79] Ivan Laptev. On space-time interest points. IJCV, 64(2-3):107–123, 2005. [80] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [81] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 178 [82] Nuo Li and James J DiCarlo. Unsupervised natural visual experience rapidly reshapessize-invariantobjectrepresentationininferiortemporalcortex. Neu- ron, 67(6):1062–1075, 2010. [83] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic representation of time series, with implications for streaming algorithms. In ACM SIGMOD workshop, pages 2–11. ACM, 2003. [84] JessicaLin, RohanKhade, andYuanLi. Rotation-invariantsimilarityintime series using bag-of-patterns representation. Journal of Intelligent Informa- tion Systems, 39(2):287–315, 2012. [85] Jason Lines and Anthony Bagnall. Time series classification with ensem- bles of elastic distance measures. Data Mining and Knowledge Discovery, 29(3):565–592, 2015. [86] Jason Lines, Luke M Davis, Jon Hills, and Anthony Bagnall. A shapelet transform for time series classification. In ACM SIGKDD, pages 289–297. ACM, 2012. [87] Beth Logan et al. Mel frequency cepstral coefficients for music modeling. In ISMIR, 2000. [88] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015. [89] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004. [90] Calvin R Maurer Jr, Rensheng Qi, and Vijay Raghavan. A linear time algo- rithm for computing exact euclidean distance transforms of binary images in arbitrary dimensions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(2):265–270, 2003. [91] Abdullah Mueen, Eamonn Keogh, and Neal Young. Logical-shapelets: an expressive primitive for time series classification. In ACM SIGKDD, pages 1154–1162. ACM, 2011. [92] Alex Nanopoulos, Rob Alcock, and Yannis Manolopoulos. Feature-based classification of time-series data. International Journal of Computer Research, 10(3), 2001. [93] Jifeng Ning, Lei Zhang, David Zhang, et al. Scale and orientation adaptive mean shift tracking. Computer Vision, IET, 6(1):52–61, 2012. 179 [94] Hiroshi Shimodaira Ken-ichi Noma. Dynamic time-alignment kernel in sup- port vector machine. Advances in neural information processing systems, 14:921, 2002. [95] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10):1345–1359, 2010. [96] Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point problem for non-convex optimization. arXiv preprint arXiv:1405.4604, 2014. [97] Xiaojiang Peng, Limin Wang, Xingxing Wang, and Yu Qiao. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint arXiv:1405.4506, 2014. [98] Florent Perronnin, Jorge Sánchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, pages 143–156. Springer, 2010. [99] F. Petitjean, G. Forestier, G Webb, A Nicholson, Y. Chen, and E. Keogh. Dynamic time warping averaging of time series allows faster and more accu- rate classification. In ICDM, 2014. [100] L. Rabiner and B. Juang. Fundamentals of speech recognition, volume 14. PTR Prentice Hall Englewood Cliffs, 1993. [101] Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In ACM SIGKDD, pages 262–270. ACM, 2012. [102] Thanawin Rakthanmanon and Eamonn Keogh. Fast shapelets: A scalable algorithm for discovering time series shapelets. In SDM. SIAM, 2013. [103] Marc Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007. [104] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1431–1439, 2014. [105] Juan J Rodríguez, Carlos J Alonso, and Henrik Boström. Boosting interval based literals. Intelligent Data Analysis, 5(3):245–262, 2001. 180 [106] Juan José Rodríguez, Carlos J Alonso, and José A Maestro. Support vector machines of interval-based features for time series classification. Knowledge- Based Systems, 18(4):171–178, 2005. [107] Yunus Saatçi, Ryan D Turner, and Carl E Rasmussen. Gaussian process change point models. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 927–934, 2010. [108] H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spokenwordrecognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1):43–49, 1978. [109] Patrick Schäfer. The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery, 29(6):1505–1530, 2015. [110] Michael L Seltzer and Jasha Droppo. Multi-task learning in deep neural networks for improved phoneme recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6965– 6969. IEEE, 2013. [111] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detec- tion using convolutional networks. arXiv preprint arXiv:1312.6229, 2013. [112] Mohammad Shokoohi-Yekta, Jun Wang, and Eamonn Keogh. On the non- trivial generalization of dynamic time warping to the multi-dimensional case. In SDM. SIAM, 2015. [113] D. Silva, V. De Souza, G. Batista, et al. Time series classification using compression distance of recurrence plots. In ICDM, pages 687–696. IEEE, 2013. [114] KarenSimonyanandAndrewZisserman. Two-streamconvolutionalnetworks for action recognition in videos. arXiv preprint arXiv:1406.2199, 2014. [115] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. [116] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [117] Saurabh Singh, Abhinav Gupta, and Alexei Efros. Unsupervised discovery of mid-level discriminative patches. Computer Vision–ECCV 2012, pages 73–86, 2012. 181 [118] Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, pages 1470–1477. IEEE, 2003. [119] Stefano Soatto, Alessandro Chiuso, and Pratik Chaudhari. Visual repre- sentations: Defining properties and deep approximations. In International Conference on Learning Representations, volume 3, 2016. [120] Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas. Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In Proceedings of the IEEE International Conference on Computer Vision, pages 2686–2694, 2015. [121] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014. [122] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 1–9, 2015. [123] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(2579-2605):85, 2008. [124] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. [125] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural net- works for matlab. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, pages 689–692. ACM, 2015. [126] Michail Vlachos, George Kollios, and Dimitrios Gunopulos. Discovering sim- ilar multidimensional trajectories. In ICDE, pages 673–684. IEEE, 2002. [127] Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, CordeliaSchmid, etal. Evaluationoflocalspatio-temporalfeaturesforaction recognition. In BMVC, 2009. [128] Jin Wang, Ping Liu, Mary FH She, Saeid Nahavandi, and Abbas Kouzani. Bag-of-wordsrepresentationforbiomedicaltimeseriesclassification. Biomed- ical Signal Processing and Control, 8(6):634–644, 2013. [129] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In CVPR, pages 3360–3367. IEEE, 2010. 182 [130] Xiaolan Wang, K Selçuk Candan, and Maria Luisa Sapino. Leveraging meta- data for identifying local, robust multi-variate temporal (rmt) features. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 388–399. IEEE, 2014. [131] Xiaoyue Wang, Abdullah Mueen, Hui Ding, Goce Trajcevski, Peter Scheuer- mann, and Eamonn Keogh. Experimental comparison of representation methods and distance measures for time series data. DMKD, 26(2):275–309, 2013. [132] KilianQWeinbergerandLawrenceKSaul. Distancemetriclearningforlarge margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009. [133] Daniel Weinland, Mustafa Özuysal, and Pascal Fua. Making action recogni- tion robust to occlusions and viewpoint changes. In Computer Vision–ECCV 2010, pages 635–648. Springer, 2010. [134] Jens Windau and Laurent Itti. Situation awareness via sensor-equipped eye- glasses. In IROS, pages 5674–5679. IEEE, 2013. [135] Paul Wohlhart and Vincent Lepetit. Learning descriptors for object recog- nition and 3d pose estimation. arXiv preprint arXiv:1502.05908, 2015. [136] Xiang Xuan and Kevin Murphy. Modeling changing dependency structure in multivariate time series. In Proceedings of the 24th international conference on Machine learning, pages 1055–1062. ACM, 2007. [137] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pages 1794–1801. IEEE, 2009. [138] Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly- supervised disentangling with recurrent transformations for 3d view synthe- sis. InAdvances in Neural Information Processing Systems, pages1099–1107, 2015. [139] Lexiang Ye and Eamonn Keogh. Time series shapelets: a new primitive for data mining. In ACM SIGKDD, pages 947–956. ACM, 2009. [140] B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary lp norms. VLDB, 2000. [141] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolu- tional networks. In Computer Vision–ECCV 2014, pages 818–833. Springer, 2014. 183 [142] Cha Zhang and Zhengyou Zhang. Improving multiview face detection with multi-task deep convolutional neural networks. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 1036–1041. IEEE, 2014. [143] Mi Zhang and Alexander A Sawchuk. A feature selection-based framework for human activity recognition using wearable multimodal sensors. In Pro- ceedings of the 6th International Conference on Body Area Networks, pages 92–98, 2011. [144] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In Computer Vision–ECCV 2014, pages 94–108. Springer, 2014. [145] Jiaping Zhao and Laurent Itti. Decomposing time series with application to temporal segmentation. [146] Jiaping Zhao and Laurent Itti. Classifying time series using local descriptors withhybridsampling. Transaction on Knowledge and data engineering, 2015. [147] Jiaping Zhao and Laurent Itti. shapedtw: shape dynamic time warping. arXiv preprint arXiv:1606.01601, 2016. [148] Junbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-encoders. arXiv preprint arXiv:1506.02351, 2015. [149] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014. [150] Feng Zhou and Fernando De la Torre. Generalized time warping for multi- modal alignment of human motion. In Computer Vision and Pattern Recog- nition (CVPR), 2012 IEEE Conference on, pages 1282–1289. IEEE, 2012. [151] Feng Zhou, Fernando De la Torre, and Jessica K Hodgins. Hierarchical aligned cluster analysis for temporal clustering of human motion. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(3):582–596, 2013. [152] Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Multi-view perceptron: a deep model for learning face identity and view representations. In Advances in Neural Information Processing Systems, pages 217–225, 2014. 184
Abstract (if available)
Abstract
Situation awareness refers to using sensors to observe user’s environment, and infer user’s situation status from those perceptions. Being aware of user’s current situation helps to offer useful cognitive assistance to users. Typical cognitive assistance examples include: give a warning to the driver when the sensors detect the driver close his eyes (fatigued), reject incoming phone calls when the camera observes that the user is in a meeting, etc. In this paper, we develop algorithms to infer user’s situations from sensed data, concretely, which activities users are performing and what objects are in the scene. We use Google Glass, a wearable mobile device, as an example: it has embedded IMU sensors and a first-person camera, the former capture streams of acceleration and angular speed, and the latter records video streams. The former steams are multi-variate time series, while the latter are image frame sequences. At current stages, we analyze time series and image frames separately: concretely, we infer user’s current activities from time series, while recognize objects from images. By knowing user’s activities and objects in the scene, it is possible to infer user’s situation and provide cognitive assistance to the user. ❧ Activity recognition from time series is not as well studied as activity recognition from videos. In this paper, we first develop a time series classification pipeline, in which we introduce a new feature point detector and two novel shape descriptors. Our pipeline outperforms the-state-of-the-art classification approaches significantly. The developed pipeline naturally applies to activity recognition. Then we invent a novel temporal sequence alignment algorithm, named shape Dynamic Time Warping (shapeDTW). We show empirically that shapeDTW outperforms DTW for sequence alignment both qualitatively and quantitatively. When the shapeDTW distance is used as the distance measure under the nearest neighbor classifier to do time series classification, it significantly outperforms the DTW distance measure, which is widely recognized as the best distance measure to date. By using shapeDTW distance under the nearest neighbor classifier, we further improve the activity recognition accuracies, compared with our previous recognition pipeline. At last, we develop a time series decomposition algorithm, which split heterogeneous time series sequences into homogeneous segments. This algorithm facilitates data collection process, i.e., we can collect various different activities for hours or days, and then use this algorithm to automatically segment the sequences. This reduces manually labeling work hugely. ❧ Then we did object recognition from natural images. Although contemporary deep convolutional networks advanced objection recognition by a big step, the underneath mechanism is still largely unclear. Here, we attempted to explore the mechanism of object recognition using a large-scale image dataset, iLab20M, which contains 20 million images shot under controlled turntable settings. Compared with the ImageNet dataset, iLab20M is parametric, with detailed pose and lighting information for each image. Here we showed the auxiliary information could benefit object recognition. First, we formulate object recognition in a CNN-based multitask learning framework, designed a specific skip connection pattern, and showed its superiority to single task learning theoretically and empirically. Moreover, we introduced an two-stream CNN architecture, which disentangles object identity from its instantiation factors (e.g., pose, lighting), and learned more discriminative identity representations. We experimentally showed that the learned feature from iLab20M generalizes well to other datasets, including ImageNet and Washington RGB-D.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Smart monitoring and autonomous situation classification of humans and machines
PDF
Transfer learning for intelligent systems in the wild
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
From active to interactive 3D object recognition
PDF
Object detection and recognition from 3D point clouds
PDF
Towards generalizable expression and emotion recognition
PDF
Towards more occlusion-robust deep visual object tracking
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Ubiquitous computing for human activity analysis with applications in personalized healthcare
PDF
Efficient SLAM for scanning LiDAR sensors using combined plane and point features
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Towards learning generalization
PDF
Robust and generalizable knowledge acquisition from text
PDF
Deep representations for shapes, structures and motion
PDF
Learning invariant features in modulatory neural networks through conflict and ambiguity
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Spatiotemporal traffic forecasting in road networks
PDF
The importance of not being mean: DFM -- a norm-referenced data model for face pattern recognition
Asset Metadata
Creator
Zhao, Jiaping
(author)
Core Title
Toward situation awareness: activity and object recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/05/2016
Defense Date
10/20/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
activity recognition,OAI-PMH Harvest,object recognition
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Itti, Laurent (
committee chair
)
Creator Email
jiapingz@usc.edu,justinzo@hotmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-675927
Unique identifier
UC11336686
Identifier
etd-ZhaoJiapin-4955.pdf (filename),usctheses-c16-675927 (legacy record id)
Legacy Identifier
etd-ZhaoJiapin-4955.pdf
Dmrecord
675927
Document Type
Dissertation
Rights
Zhao, Jiaping
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
activity recognition
object recognition