Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mocap data compression: algorithms and performance evaluation
(USC Thesis Other)
Mocap data compression: algorithms and performance evaluation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MOCAP DATA COMPRESSION: ALGORITHMS AND PERFORMANCE EVALUATION by May-chen Kuo A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2010 Copyright 2010 May-chen Kuo Dedication This dissertation is dedicated to my beloved parents and grandmother who raised me with passion for science and life. ii Acknowledgments I can no other answer make, but, thanks, and thanks. William Shakespeare I would like to thank my advisor, C.-C. Jay Kuo, for his guidance and support through these years. I appreciate all the knowledges passed by him. I appreciate his patience and tolerance for all the mistakes I made in this journey. From him, I learn to be a better researcher as well as a better person. I would like to thank Dr. Karen Liu for her inspiration and assistance for this thesis. I would like to thank my family for their unconditional support. I would like to thank all my group members for their help in my research. Last but not least, I would like to thank my friends. I would never be able to achieve what I achieved without their company. Thanks to Hsu-Lei Lee, a faithful listener, a dependable safe net, and an adventurous explorer who makes the bad days bearable, and the good days splendid. Thanks to Pei-Chi (Celine) Hu, a generous roommate, a honest magic mirror, and an insightful advisor who answers to every detailed aspect of life. I would also like to thank Alyn Hsu, Yuan Wang, Ping-Hsuan Hsieh, Pauline Yu, Ivy Tseng, Yu Hu, Deniz Cakirer... who have lighted a ame in my dark moments. This is a list which will never be long enough to cover everyone I am grateful to, so thank you, thank you, thank you. iii Table of Contents Dedication ii Acknowledgments iii List Of Tables vii List Of Figures viii Abstract xi Chapter 1: Introduction 1 1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Review of Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Mocap Data Compression . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Motion Representation . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Motion Interpolation/Synthesis . . . . . . . . . . . . . . . . . . . . 7 1.3 Contributions of Proposed Research . . . . . . . . . . . . . . . . . . . . . 8 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: Research Background 12 2.1 Mocap Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Mocap Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Coding Performance Measures . . . . . . . . . . . . . . . . . . . . 17 2.2.2 State-of-the-Art Mocap Data Coding Algorithm . . . . . . . . . . 18 2.3 Human Perception of Motion . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Motion Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 3: Mocap Data Compression Algorithm 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Overview of Proposed Mocap Data Compression System . . . . . . . . . . 24 3.3 Temporal Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 I-Frame Coding: Vector Quantization . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Decomposition of Dofs . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.2 Vector Quantization (VQ) . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 B-Frame Coding: Interpolation and Residual Coding . . . . . . . . . . . . 36 iv 3.5.1 Interpolation Techniques . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5.2 Residual Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4: Bit Allocation and Performance Evaluation 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Bit Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2 Motion Characterization and Coding Parameters Selection . . . . 44 4.2.2.1 Spatial Features . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.2.2 Temporal Features . . . . . . . . . . . . . . . . . . . . . . 47 4.2.3 Case Study: CMU Mocap Database . . . . . . . . . . . . . . . . . 49 4.2.4 Statistics of CMU Mocap Data . . . . . . . . . . . . . . . . . . . . 52 4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 Comparison with Previous Work . . . . . . . . . . . . . . . . . . . . . . . 59 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Chapter 5: Enhanced Mocap Data Compression Algorithm 62 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Dyadic Temporal Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Pre-Processing DoFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.1 DoFs versus DoF Dierences . . . . . . . . . . . . . . . . . . . . . 69 5.4.2 Analysis and Solution of the Error Accumulation Problem . . . . . 70 5.4.3 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 6: Perception-based Mocap Data Coding 77 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 Perceptual Mocap Coding Algorithm . . . . . . . . . . . . . . . . . . . . . 79 6.2.1 Pre-processing Module . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.2 Segmentation and Normalization Module . . . . . . . . . . . . . . 84 6.2.3 Coding Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2.3.1 Vector Quantization . . . . . . . . . . . . . . . . . . . . . 86 6.2.3.2 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . 87 6.2.3.3 Dyadic Encoding . . . . . . . . . . . . . . . . . . . . . . . 90 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Chapter 7: Conclusion and Future Work 95 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2 Future Work: Motion Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 97 7.2.1 Overview of Proposed Motion Synthesis Solution . . . . . . . . . . 98 7.2.2 Motion Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2.3 Heterogeneous Motion Graph (HMG) . . . . . . . . . . . . . . . . 100 v 7.2.4 Open Research Problems . . . . . . . . . . . . . . . . . . . . . . . 101 References 103 vi List Of Tables 3.1 Comparison of dierent temporal sampling methods, where the distortion is the average MSE per dof per frame. . . . . . . . . . . . . . . . . . . . . 29 3.2 Quantization errors for codebooks of dierent sizes. . . . . . . . . . . . . . 33 3.3 Performance ranking of interpolation using the rst, the second and the third order polynomials with or without residual coding. . . . . . . . . . . 40 6.1 Subjective test results on perception-based mocap data compression. . . . 93 vii List Of Figures 1.1 The owchart of the mocap encoder. . . . . . . . . . . . . . . . . . . . . . 9 2.1 Illustration of a sample human skeleton for motion capture. . . . . . . . . 13 2.2 Exemplary mocap data as functions of time. . . . . . . . . . . . . . . . . . 14 2.3 An exemplary dof curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 The dotted line is used as an approximation to the curve in 2.3. . . . . . . 16 2.5 An example of a synthesized motion using the MG approach. . . . . . . . 22 3.1 Overview of the mocap data encoding system . . . . . . . . . . . . . . . . 26 3.2 The solid line is the ground truth, the dotted line is the approximated curve, and circles are sampled points. The predictions in (a) and (b) are worse than (c) and (d). Fixed-interval sampling is used in (a)-(c) and adaptive sampling is used in (d). . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 (a) is the labanotation of a series of motion in (b) . . . . . . . . . . . . . . 30 3.4 Illustration of a 3-bit quantized shoulder joint dofs and their corresponding Labanotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 The distortion as a function of the test data percentage. . . . . . . . . . . 35 3.6 The rate-distortion (R-D) curve for residual coding. . . . . . . . . . . . . 39 4.1 The procedure to select coding parameters for I-frames, where e is the threshold of tolerable error. . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Both spatial and temporal properties of motion contribute to the selection of I-frames. In this gure, we show how spatial features play a role in this decision process, where e is the threshold of tolerable errors. . . . . . . . . 48 viii 4.3 The procedure used to select I-frames . . . . . . . . . . . . . . . . . . . . 50 4.4 The distribution of the variance of sub-poses. . . . . . . . . . . . . . . . . 52 4.5 The distribution of the variance of frames quantized to the same code with 4-bit, 8-bit and 12-bit I-frame codebook, respectively. . . . . . . . . . . . 53 4.6 The distribution of the variance of prediction residuals using 4-bit, 8-bit, and 12-bit I-frame codebooks, respectively. . . . . . . . . . . . . . . . . . 53 4.7 The distortion distribution with respect to the xed sampling with intervals equal to 1, 0.5, and 0.25 seconds. . . . . . . . . . . . . . . . . . . . . . . . 55 4.8 The plot of the MSE distribution for les in the CMU mocap database with four compression ratios. . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.9 The plot of the compression ratio distribution for les in the CMU mocap database with ve maximum error bounds. . . . . . . . . . . . . . . . . . 58 5.1 The data structure of a partitioned interval can be stored in form of a binary tree in (a), which corresponds to the case shown in (b). . . . . . . 67 5.2 Comparison of three temporal sampling schemes. . . . . . . . . . . . . . . 68 5.3 (a) The histogram of dofs and (b) the histogram of dof dierences. . . . . 69 5.4 The error histograms after VQ over all frames in the whole CMU database for (a) dofs and (b) dof dierences. . . . . . . . . . . . . . . . . . . . . . . 71 5.5 The error histogram after VQ over all frames in the whole CMU database for dof dierences with a codebook of one half size of the case shown in Fig. 5.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.6 Percentage of the data in CMU mocap database which are able to achieve designated MSE in each compression ratio level. . . . . . . . . . . . . . . 75 6.1 The block-diagram of the perception-based mocap data coding algorithm. 79 6.2 An example to illustrate the perception-based algorithm. . . . . . . . . . . 81 6.3 An exemplary dof curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 The ltering criteria to remove a point in the candidate list. . . . . . . . . 83 ix 6.5 Illustration of two ltering scenarios, where the red candidate sample is removed and the curve is modied from the pink slash line to the brown dot slash line, which still oers a good approximation. (a) and (b) shows cases in the green-backslash-shaded area and in the yellow-dotted area in Fig. 6.4, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6 Illustration of the merge scenario, where two candidate points shown as the two red circles in (a) are replaced by the middle point of them shown as the yellow circle in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.7 This curve will be considered as one positive segment instead of 4 segments of interleaving positive/negative/positive/negative. The segment of oppo- site sign (negative in this example, for the overall segment is a positive segment) are bounded by the conditions dened in Fig. 6.4. . . . . . . . 86 6.8 The plot of the MSE as a function of the percentages of the training data size in the CMU database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.9 The 16 codewords of the 4-bit VQ codebook. . . . . . . . . . . . . . . . . 88 6.10 The histogram of the normalization parameter for the dof range values. . 89 6.11 The plot of the distortion as a function of the number of bits used to encode the normalization parameter for the dof range value. . . . . . . . . . . . . 90 6.12 (a) The original dof curve, (b) the coded result after VQ, SQ and dyadic encoding, (c) the interpolated curve based on (b), (d) comparison of the interpolated result (the pink dotted curve) and the original curve ( the black solid curve). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.13 The averaged compression ratios of dierent motion categories in the CMU database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 x Abstract The mocap data has been widely used in many motion synthesis applications for educa- tion, medical diagnosis, entertainment, etc. In the entertainment business, the synthe- sized motion can be easily ported to dierent models to animate virtual creatures. The richness of a mocap database is essential to motion synthesis applications. In general, the richer the collection, the higher quality the synthesized motion. Since there exists limitation on network bandwidth or storage capacity, there are constraints on the size of the mocap collection to be used. It is desirable to develop an eective compression scheme to accommodate a larger mocap data collection for higher quality motion synthesis. In order to synthesize natural and realistic motion from existing motion capture database particularly in the context of video game applications, the compression procedure enables ecient management. In this research, we explore the characteristics of the mocap data and propose two real-time compression schemes that allow a exible rate-distortion trade-o. These two compression schemes are designed to optimize dierent objective functions for dierent application purposes. Unlike previous work, the proposed schemes do not demand any prior knowledge of the motion type. xi The rst one aims at preserving nearly lossless content. It encodes prediction residuals, which allows more exibility in bit allocation. We study the relationship between the mocap data, coding parameters, and coding performance. To be more specic, we use temporal and spatial features to characterize a clip of mocap data, and propose a rate control algorithm to adjust coding parameters adaptively to deliver a result that matches the target bit rate or the error bound. With proper temporal and spatial information handling, this scheme can achieve a coding gain of about 45:1 with low coding complexity and good visual quality, which is 2.5 times better than the state-of-the-art techniques. The second one is perception based compression, which aims at further compression with visually pleasant quality. In the second scheme, we partition the motion into seg- ments with positive and negative velocities and then compress each segment separately. This scheme can achieve a coding gain of at least 100:1, and can be used to provide a quick preview of the content of the database. xii Chapter 1 Introduction 1.1 Signicance of the Research The motion capture (Mocap) data are obtained by recording the temporal trajectories of position sensors mounted on subjects. The 3-dimensional (3D) position of each sensor is tracked, and as the information is mapped to the skeleton of the subject, we can calculate temporal variation of the rotation of each joint in the skeleton. In general, the Euler representation is used to describe the rotation of each joint, i.e., the rotated angle against the X-, Y-, and Z-axes. The temporal trajectory of each parameter denes a degree of freedom (dof) curve. As a result, mocap data can be presented in two formats: i) marker positions (.c3d format) and ii) joint rotations (.amc format). The size of a mocap clip is proportional to the number of markers or the number of joints. To uniquely dene the rotation of a joint, three markers are required. Thus, the size of a mocap le in marker format is three times of that in joint format. 1 The mocap data has been widely used for many motion synthesis applications for education, medical diagnosis, entertainment, etc. Especially, in the entertainment busi- ness, the synthesized motion can be easily ported to dierent models to animate virtual creatures. Although quite a few physical based (as opposed to the data-driven) motion synthesis methods have been proposed, the naturalness of synthesized motion is still not yet well quantied. It is easier to ensure the naturalness of motion using mocap data. As a result, most physical based motion synthesis methods actually adopt a hybrid approach, which still demands some mocap data to assist. The richness of a mocap database is essential to motion synthesis applications. In general, the richer the collection, the higher quality the synthesized motion. However, there exists some limitation on network bandwidth or storage capacity. In other words, there are constraints on the size of the mocap collection to be used. It is desirable to develop an eective compression scheme to accommodate a larger mocap data collection for higher quality motion synthesis. As to mocap database management, it is strongly desired to have a preview feature so that users can have a quick view on the data in the database. This preview does not have to be close to the ground truth data on the frame-by-frame basis, but perceptually similar. To meet this demand, we develop two motion compression schemes. One provides a coding result which is close to the ground truth in terms of low mean-squared errors (MSE). The other oers a coding result that is perceptually similar to the original motion and achieves a very high compression ratio. With this perception-based compression scheme, users can have an ecient preview of various motion clips stored in the database. 2 In this thesis, we focus on techniques for mocap data compression. Several challenging research problems along this line are described below. Graphic data can be rendered as image and video and compressed accordingly. Image/video coding has been extensively studied in the last three decades. Although some of the coding ideas can be leveraged, most of them are not applicable to mocap data compression. For example, the high frequency information in the spatial domain of image/video may not be visible and can be sacriced to achieve a higher compression ratio. However, the high frequency in the temporal domain of mocap data tends to carry important information, which is critical to human perception. For example, a contact with the environment may result in a local extrema in a dof curve, which should be well preserved. The well-known foot-skating artifact [14] is caused by failing to preserve clean contact points. The format of mocap data is similar to 3D animated meshes [12]. The data of 3D animated meshes consist of spatial positions of vertices over time while mocap data consist of spatial positions of dofs over time. The spatial correlation of mesh vertices is stronger than that of dofs so that these two data types demand dierent treatment. The marker position format (.c3d) provides data in a more linear space than the joint rotation format (.amc). Since the state-of-the-art mocap compression scheme developed by Arikan [2] used the Bezier curve to approximate the dof curves by exploiting the linear property, it operates on the marker position format, which is 3 3 times as large as the joint rotation format. It is desirable to begin with the joint rotation format. Arikan's method uses the principal component analysis (PCA) to analyze the mocap data in the database for compression. Thus, it is an o-line non-real-time compres- sion algorithm. Besides, the motion clustering step has to be performed manually, which is not practical as the database grows larger. It is desirable to develop an automatic and real-time mocap data coding scheme. We attempt to address these problems and propose a new mocap data compression scheme in Chapter 3 and analyze its performance in Chapter 4. In Chapter 5, we fur- ther improve the coding methods proposed in Chapter 4 with two major improvements. First, the temporal information of sampled frames is represented in a dyadic grid to save bits required to store the temporal indices of adaptively-sampled I-frames. Second, the spatial information is converted to the velocity domain in the pre-processing stage and correlated joints are identied for special treatment. This pre-processing module improve the eciency of the vector quantization (VQ) module in a later stage. The perception-based compression algorithm is discussed in Chapter 6. Instead of preserving the frame-by-frame precision, we aim to preserve perceptual similarity. The perception- based algorithm achieves a very high compression ratio (from 100:1 to 300:1). It allows a quick preview of mocap data in a database. We will focus on techniques for motion synthesis in the future work. Preliminary treatment on this topic is described in Chapter 7. 4 1.2 Review of Previous Work 1.2.1 Mocap Data Compression Previous research on mocap data compression [2], [22], [8], [5] consists of the following 3 ingredients. 1. Motion categorization With proper categorization, one can cluster similar motion clips together so that the PCA application is more eective. However, it is dicult to automate the categorization process, and its still demands human intervention nowadays. The categorization step in previous research is considered part of the pre-processing and thus not counted in the total execution time. 2. Principal Component Analysis (PCA) The PCA technique has been used to represent and encode animation sequences before, e.g., [1, 28]. For mocap data compression, the PCA technique is used for dimension reduction. For the PCA to be eective, it requires a reasonable amount of similar data in the training set. 3. Inverse kinematics (IK) in post-processing An inverse kinematics mechanism based on [14] can be adopted to encode the envi- ronmental contacts as a post-processing step to ensure clean contact motion. How- ever, this mechanism might over-stretch some joints to result in an artifact known as the knee-popping artifact. 5 There are two main shortcomings in the existing mocap data compression algorithms. First, as automatic motion clustering is dicult, it is usually a semi-automatic process that demands human guidance. Second, the amount of similar mocap data required for PCA training may not be available in a general context. To overcome these two shortcomings, the proposed methodology in this research does not demand any prior knowledge on motion category, and can be applied to generic mocap data. Being similar to previous work, we include the IK technique in the post-processing step. 1.2.2 Motion Representation The use of a simplied data set to represent a motion sequence for summary, abstraction and description is important for a diverse spectrum of elds, ranging from arts to sciences. Some previous work is brie y summarized below. Assa et al. [3] introduced a method that created an action synopsis using still images. Their method carefully selected key poses based on an analysis of a skeletal animation sequence to facilitate the expression of complex motions in a single image or a small number of concise views. This approach projected a high-dimensional motion curve into a low-dimensional Euclidean space while preserving the main characteristics of the skele- tal action. The lower complexity of the low-dimensional motion curve allows a simple iterative method to analyze the curve and locate signicant points that are associated with key poses of the original motion. Ogale et al. [19] represented human action in terms of atomic body poses. The collection of body poses was stored implicitly as a set of silhouettes seen from multiple viewpoints. No explicit 3D poses or body models were used, and individual body parts 6 were not identied. Actions and their constituent atomic poses were extracted from a set of multi-view multi-person video sequences by an automatic keyframe selection process, and used to construct a probabilistic context-free grammar (PCFG) automatically, which encodes the syntax of action. 1.2.3 Motion Interpolation/Synthesis One can synthesize a sequence of images portraying continuous motion by interpolating between a set of keyframes, which arises in many applications such as the keyframe animation. If keyframes are specied by parameters of moving objects at several instants of time, (e.g., position, orientation, velocity) then the goal is to nd their values at the intermediate instants of time. Previous approaches to this problem have been to construct these intermediate, or in-between, frames by interpolating each of the motion parameters independently. This often produces unnatural motion since the physics of the problem is not considered and each parameter is obtained independently. Rose et al. [26] proposed a technique to interpolate between exemplary motions obtained from live motion capture or produced by traditional animation tools. They may be characterized by emotional expressiveness or control behaviors such as turning, going uphill/downhill, etc. They called such parameterized motions "verbs" and the parameters that control them "adverbs." Verbs can be combined with other verbs to form a "verb graph," with smooth transitions between them, allowing an animated gure to exhibit a vast repertoire of expressive behaviors. A combination of radial basis functions and low order polynomials is used to create the interpolation space between exemplary motions. 7 Brotmanet al. [6] modeled the motion of objects and their environment by dierential equations based on classical mechanics and proposed a motion interpolation technique for realistic human animation; namely, to blend similar motion samples with weighting functions whose parameters are embedded in an abstract space. Their method is however insensitive to statistical properties, such as correlations between motions. In addition, they lack the capability to evaluate the reliability of synthesized motion quantitatively. Mukai and Kuriyama [18] proposed a method that treats motion interpolation as statistical predictions of missing data in a parametric space. For a given set of parameters, their method optimizes the interpolation kernel at each frame statistically and uses a pose distance metric to analyze the correlation. Since motions can be well predicted using the spatial constraints in the parametric space, there exist few undesirable artifacts, if any. This property alleviates the problem of spatial inconsistency, such as foot-sliding, associated with many existing methods. The methods described above can produce good interpolation results at a higher computational cost. However, they are not free from errors. 1.3 Contributions of Proposed Research In Chapters 3 and 4, we explore the characteristics of mocap data, and propose a new com- pression framework which allows a exible rate-quality trade-o. A high level overview of the proposed framework is illustrated in Fig. 1.1. The contributions of our research are detailed below. 8 frame quantization temporal sampling residual coding Original Data Compressed Data Figure 1.1: The owchart of the mocap encoder. . Mocap data format used for compression Mocap data can be stored in two inter-changeable formats: 1) positions of the markers and 2) the degree of freedom (dof) of angles. A standard human model used in mocap data has 18 joints, each of which has three dofs, plus three dofs for global position and three dofs for orientation, a standard human model has 60 degree of freedoms. To have a one-to-one mapping for a pose between these two formats, the amount of data that describes the position of markers is three times as large as the number of dofs. In other words, the dof format is more compact than the marker position format. Arikan's scheme [2] operates on the marker position format. This format was chosen in [2] to take advantage that the space for marker positions is more continuous and the interpolation is easy. He argued that, for a duration of t, if the foot is on the ground in the beginning and in the end, the interpolated result for the marker position format will lead to the desired positioning - the foot will be on the ground during the whole time. In contrast, interpolated results based on dofs do not guarantee that the foot will always be on the ground. It might slide over, oat above, or penetrate into the ground. There are two main advantages to process the mocap data in dof format. First, it is a more compact representation. The amount of data to compress is smaller. Second, it is easier to compare two 9 poses using the dof representation. As a result, our framework operates on dofs directly. However, we will rely on a post-processing step to preserve contact points. Time sampling The correlation of consecutive frames in mocap data is high. It is intuitive to remove the temporal redundancy via adaptive time sampling. Arikan's framework [2] uses a xed-time interval (i.e., anything between 8 to 15 frames, which is equivalent to a sampling rate between 1/16 and 1/8.) In our framework, we reduce the dimension of each frame with VQ to form basic elements of the motion and, then, pick extrema in the motion as key poses. Simply speaking, our sampling scheme is more exible and the number of sample points can be saved using an event-trigger approach. Computational complexity Arikan's scheme uses the Bezier curve as the approximation, where the cubic poly- nomial computation is inevitable. Instead of encoding the approximating curves, we perform residue coding as well, which allows more exibility in bit allocation. With residual coding, we can adopt the simple linear interpolation as the approximation. Prior Knowledge of the Data Arikan's scheme was applied to a database of Mocap data by segmenting the mo- tion and approximating curves in each segment with Bezier curves. Similar motion data have to be clustered together to form a motion category. Our method is illus- trated in Fig. 1.1. We take advantage of the time-domain redundancy by temporal sampling, and take advantage of space-domain redundancy by vector quantizing 10 sampled poses. Our method encode time and space separately and does not need the prior knowledge of the motion type. The algorithm proposed in Chapters 3 and 4 can achieve a compression ratio of 20:1 and still preserve decent quality. We further improve its performance along two directions in Chapter 5. First, we propose an adaptive temporal sampling algorithm and show that the temporal information of the sampled I-frames to be coded eciently using the dyadic grid. Second, we apply some pre-processing technique to the spatial information and convert it to the velocity representation, which improve the eciency of the VQ coding. In Chapter 6, we propose a perception-based mocap encoding algorithm. This algo- rithm divides a motion clip into a series of monotonically increasing/decreasing sequences and encodes each segment using normalization and VQ. The spatial normalization pa- rameters are further compressed with scalar quantization. The temporal normalization parameters are compressed using dyadic coding. 1.4 Organization of the Thesis The rest of this thesis is organized as follows. The background of this research, including generic introduction to motion capture data, the state-of-the-art mocap data compression and an overview of motion synthesis are reviewed in Chapter 2. Then, a new mocap data compression algorithm is proposed in Chapter 3. The performance of the proposed method is studied in Chapter 4. An enhanced algorithm based on the proposal in Chapter 3 is stated in Chapter 5. Perception-based mocap data coding is discussed in Chapter 6. Finally, concluding remarks and future work on motion synthesis are given in Chapter 7. 11 Chapter 2 Research Background We will provide research background on mocap data compression and motion synthesis in this chapter. We will rst describe the mocap data in Sec. 2.1 and then present the state-of-the-art mocap data compression algorithm in Sec. 2.2. Human perception on motion is reviewed in Sec. 2.3, which will be used to justify the perception-based encoding algorithm proposed in Chapter 6. Last, we will brie y discuss motion synthesis in general and the motion graph technique in particular in Sec. 2.4, since the latter will serve as a basis for the proposed motion synthesis scheme in Chapter 7. 2.1 Mocap Data Motion Capture (Mocap) data are obtained by capturing the 3D trajectory of placed markers on subjects. They are tailored to the skeleton of the captured subject and stored as the trajectory of each degree of freedom (dof) over time. We show a simplied yet widely-used human skeleton that has 18 joints in Fig. 2.1, where each joint in the skeleton is labeled by a white cross. The rotation of each joint is represented by Euler angles with three dofs. There are six additional dofs used to record 12 Figure 2.1: Illustration of a sample human skeleton for motion capture. global position and orientation. Consequently, the captured mocap data for the motion of a person consists of 18 3 + 6 = 60 parametric curves (or dof curves) over time. Generally, we can represent these dof curves by q i (t); i = 1; ;N 13 where N = 60 and q i (t) is the value of dof i at frame t. Some exemplary dof curves are shown in Fig. 2.2. 2 4 6 8 −3 −2 −1 0 1 2 3 time (second) dof value Figure 2.2: Exemplary mocap data as functions of time. In the standard mocap data format, each dof is captured at 120 frames per second (fps). The data size of each dof,q i (t), at each frame is four bytes so that its precision can be as high as 10 6 . 57 out of the 60 dofs are Euler angles, whose ranges are in [;). The three dofs representing the global position have no range limit in principle. However, in practice, they are bounded by the area of the motion capture environment. To store a complete set of mocap data eectively, we may compress these N dof curves by exploiting the spatial and temporal correlations of these dof curves. A typical dof curve as a function of time is shown in Fig. 2.3. We can process the curve and obtain the dynamics of a joint, e.g., by taking the rst derivative to get the speed and the second 14 derivative to get the acceleration. As shown in Fig. 2.3, the time axis is divided into ve intervals of equal duration. The dof segments behave similarly in intervals 1 and 5, where the value spans for a wide range. The segment in interval 3 is relatively at, indicating that the motion is static in this interval. The segments in both intervals 2 and 4 are jerky, which suggests a more rapid change in the speed and the acceleration. However, since the uctuation in interval 4 is restricted to a small range, it could be noise in the capturing process. 1 2 3 4 5 time T T T T T Figure 2.3: An exemplary dof curve. If we sample the dof curve adaptively as shown in Fig. 2.4, we can use the interpolation technique to reconstruct the original dof curve. A reconstructed dof curve is shown by the dotted line. Clearly, the reconstructed dof curve is only an approximation to the original one. It is important to consider a good sampling technique in the temporal domain so that the error between the original and the reconstructed curves are minimized with a xed amount of coding bits. 15 time Figure 2.4: The dotted line is used as an approximation to the curve in 2.3. The motion synthesis performance is related to the mocap data coding technique as explained below. A typical duration of mocap data used in motion synthesis is of one- minute long. Without compression, the raw data of a single character motion over one minute have a size of 4Bytes/dof 60dof/frame 120fps 60seconds 1:728MBytes: The performance of any generic motion synthesis scheme strongly depends on the reso- lution of the mocap data. The more we can compress the mocap data, the more data samples can be fed into the motion synthesis module to get more accurate motion trajec- tories. 2.2 Mocap Data Compression In this section, we rst describe the mocap data compression problem and then present the state-of-the-art mocap data compression scheme. 16 2.2.1 Coding Performance Measures To evaluate the eectiveness of a mocap data coding scheme, we consider two factors: the compression ratio (or the bit rate) and the quality of the coded dof curves (or the distortion). The compression ratio is dened as the original le size divided by the compressed le size. For example, a 5:1 compression ratio implies the nal le size is only 20measure the compression performance is the bit rate. In the current context, it is dened as the average number of bits per dof-second. Without compression, the raw bit rate is 4Bytes=dof 120fps = 3:84KB=dofsec Our proposed coding scheme can oer a nearly lossless coding result with the 5:1 com- pression ratio and deliver decent quality with the 20:1 compression ratio. The quality measure can be examined from two perspectives: 1) the objective measure and 2) the subjective measure. The objective measure is dened as the closeness between the original and the reconstructed dof curves. Their dierence is usually measured in terms of the mean squared error (MSE). The subjective measure can be stated as the naturalness of the coded motion as compared to the original with respect to human viewers. Unfortunately, there is no commonly agreed way to describe the naturalness. In this work, to ensure that naturalness is well-preserved, we consider two factors: 1) physical feasibility and 2) avoidance of the foot-skating artifact. The physical feasibility is used to check whether the motion is physically possible (or, in other words, whether the motion violates a certain dof range). For example, the knee joint cannot bend forward. By 17 the foot-skating artifact, we refer to an incomplete and/or unstable foot contact point. When a character is walking, his/her supporting foot on the ground should be steady (rather than moving around). When the mocap data are compressed, the contact point will not be as steady as the original, which is not restricted to the foot contact but all kinds of contact points. To address this problem, we propose to use a post-processing module to x artifacts in contact points. 2.2.2 State-of-the-Art Mocap Data Coding Algorithm Arikan [2] proposed an eective algorithm to compress the mocap data in a database, which is the state-of-the-art macap data compression scheme. His algorithm can be summarized as follows. 1. Pre-processing Step The motion database is partitioned into clips, each of which consists ofk subsequent frames. For example, the rst k frames forms the rst clip, the next k subsequent frames form the second, and so on. The clip size k is a parameter chosen by users. It usually takes a value between 8 and 16. Then, by applying a proper rotation and translation operation to the target character of each clip so that the character is located at the origin with a standard orientation at the rst frame. The absolute position and orientation of the character before this transformation, which is characterized by 6 parameters is stored. 2. Transformation of Variables The original dof variables are converted to the positional variables via a one-to-one 18 mapping. The positional representation is used due to its relationship is close to linearity. However, the poisitional representation has a le size which is 3 times as large as the original. Then, the cubic Bezier curves are used to t the positional curves. 3. Coding of Motion Trajectories In each clip (of size k frames), each marker's trajectory will be approximated by a Bezier curve. For a Bezier curve can be uniquely represented with 4 control points, each marker's trajectory can be represented by a vector 3 4 = 12 (the x-, y-, z-coordinates of 4 Bezier control points) dimension. There are three markers on each bone, so there are 3number of bones markers. As a result, each clip can be represented by a vector of 12 3number of bones dimension. Considering a generic human model, which is of 18 bones/joints, the dimension is 648, which is really high dimension, thus some dimension redution scheme is desirable. He rst groups similar looking clips into distinct clusters, then perform the principal com- ponent analysis (PCA) in each cluster. He referred this as the clustered principal component analysis (CPCA). Note that the clustering process and the CPCA have to be performed o-line in the pre-processing stage. Arikan's method has three main parameters: 1. The number of frames in a clip denoted by k. If k is too small, one cannot take full advantage of temporal coherence. If k is 19 too large, the relationship between joints is not linear and the CPCA will perform poorly. In [2], the upper bound onk is set to 16 to 32 frames (130 - 270 milliseconds). 2. The upper bound on the reconstruction error of CPCA. The smaller the number is, the more coecients are needed for the coding of each clip. 3. The number of clusters chosen in CPCA A more diversied database demands more clusters for the optimal representation. However, the overhead will increase if the number of clusters becomes too large. These three parameters are selected manually in [2]. 2.3 Human Perception of Motion If motion distortion is not obvious to a human viewer, we may trade numerical accuracy [30], [20] for higher coding performance. There has been some experimental research on human perception in viewing animated motion. Harrison et al. [10] observed that limbs in an animal model, including human, can be changed to a certain extent without being noticed by a viewer. They concluded that, while a length change of 3% can be perceived when relevant links are given full attention, changes of over 20% can go unnoticed if there is no focused attention. O'Sullivan and Dingliana [20] identied several factors that aect perception: eccentricity, separation, distractions, causality, and accuracy of physical response. They also designed a set of psychophysical experiments to determine thresholds for human sensitivity to dynamic anomalies, including angular, momentum and spatio-temporal distortions [21]. Reitsma and Pollard [24] developed an estimation 20 scheme for perceived errors in animated human motion and observed that velocity errors in the horizontal direction are easier to detect than those in the vertical direction and accelerations are easier to detect than decelerations. We implement a perception-based mocap coding algorithm by leveraging the above-mentioned properties. 2.4 Motion Synthesis Despite recent breakthrough in robotics, the motion of a person and that of a robot are still visually dierent. The subtle dierence, referred to as naturalness, cannot be easily quantied [25]. For this reason, most physics-based motion synthesis methods [9] are actually a hybrid one. That is, they demand a limited amount of mocap data to enhance the naturalness of synthesized motion. On the other hand, to avoid the use a huge amount of mocap data, there is a general trend in data-driven methods, which is also in favor of a hybrid solution [17]. The motion graph (MG) concept [13] has dominated the data-driven methodology in recent years [15] [16] [27]. Until now, there exist no conclusive answers to the following three questions. 1. How to automatically organize motion clips for ecient search? 2. How to evaluate the similarity of two poses [7], [29]? 3. How to properly segment the data in time [11], [4]? We attempt to address these three issues in Chapter 7. Some preliminary results will be presented. That is, we break down a full pose to partial ones, which reduces the 21 Figure 2.5: An example of a synthesized motion using the MG approach. dimension of a database to a manageable scale. The knowledge from the literatures of dancing, robotics, and bio-mechanics helps build up rules to regulate motion synthesis. More investigation will be conducted in the future. 2.4.1 Motion Graph Given a set of keyframes of the mocap data, seamless transitions can be added between them to form a directed graph, which is called the motion graph (MG) [13]. Every edge on the MG is a clip of motion, and nodes serve as choice points connecting clips. Any sequence of nodes, or walk, is itself a valid motion. A user may exert control over the database by extracting walks that meet a set of constraints. An example given by [13] is shown in Fig. 2.5. 22 Chapter 3 Mocap Data Compression Algorithm 3.1 Introduction Richness of motion capture (mocap) data collection is essential to motion synthesis appli- cations. Mocap data are obtained by recording temporal trajectories of position sensors mounted on a subject. The 3-dimensional (3D) position information of each sensor is tracked, and the temporal variation of each dimension denes a curve of a degree of freedom (dof). Given a collection of mocap data, we can synthesize new motion for ap- plications in movies or video games. The richer the collection, the higher quality the synthesized motion. The volume of mocap data is growing as the price of mocap equipments is dropping. However, due to the limitation on network bandwidth and storage capacity, the size of mocap data available to a specic application is still constrained. It is desirable to develop mocap data compression algorithms to accommodate a larger collection for higher quality synthesis. Mocap data coding enables an application developer to handle a large amount of data in their retrieval and transmission. Such a management tool allows the 23 re-use of acquired and stored mocap data to avoid the motion capture eort from the scratch. In this chapter, we study the mocap data compression problem and propose a new compression algorithm to meet its requirements. The rest of this chapter is organized as follows. An overview of the mocap data compression system is given in Section 3.2. The system consists of three main modules: 1) temporal sampling, 2) VQ quantization of selected I-frames and 3) interpolation and residual coding. These three modules are examined in detail in Sections 3.3, 3.4, and 3.5, respectively. Finally, concluding remarks and future research topics are given in Section 3.6. 3.2 Overview of Proposed Mocap Data Compression System For a motion clip containing N dofs, and is of L frames, the full representation of this motion is 0 B B B B B B B B B B @ q 1 (1) q 1 (2) q 1 (L) q 2 (1) q 2 (2) . . . q 2 (L) . . . . . . . . . . . . q N (1) q N (2) q N (L) 1 C C C C C C C C C C A where q i (k) is the value of dof i at frame k. In practice, we can sample the motion sequence at some sampled frames and then approximate the motion by interpolating sampled frames. Mocap data compression can be achieved by reducing their spatial and temporal redundancies. The N-dimensional dof curves at the same time instance dene a N- dimensional vector, which is called a frame. Each frame is either directly coded or 24 predicted by a reference frame. By borrowing the term from video coding, the former is called an I-frame while the latter is called a B-frame. The accuracy of I-frames will aect that of B-frames since B-frames are predicted by I-frames. Generally speaking, it is worthwhile to spend more bits on the I-frames than B-frames. Representative frames are selected to serve as the I-frame. The selection of proper I-frames is important since it aects the compression ratio and quality of the motion. As shown in Fig. 3.1, the general idea of the proposed mocap data compression algorithm is described below. The coding system consists of the following three modules. 1. Temporal Sampling We will develop a rule to select proper frames to serve as I-frames. 2. Vector quantization in selected I-frames Each I-frame is vector quantized. For human mocap data, we can exploit the symmetry of human skeleton to break down a pose, and apply dierent codebooks to dierent parts of the human body. 3. Interpolation and residual coding We will describe a scheme to predict the B-frames based on their adjacent I-frames via interpolation. Furthermore, the prediction error of B-frames will be encoded, which is called the residual coding. The above three coding modules are examined in detail in Sections 3.3, 3.4, and 3.5, respectively. Assigning some bits to residual coding may bound the error more eectively than assigning all bits to the quantized dof vector. B-frames are also preserved with residual 25 frame quantization VQ1 temporal sampling residual coding Interpolation VQ2 VQ3 I-frames B-frames Figure 3.1: Overview of the mocap data encoding system coding. For B-frames, the bits spent on residual coding not only enhance the quality, but also allow a simpler interpolation algorithm, which may be advantageous for its low complexity. With residual coding, errors introduced in the quantization and prediction stages can be controlled, thus guaranteeing the quality of compressed mocap data to some degree. As compared to prior art, the proposed coding system has several advantages. Excellent coding performance It can reach a compression ratio of 20:1 with decent quality in real time, which outperforms prior art by a signicant margin. Flexibility in rate control The coding algorithm is exible to control. Specically, it allows a trade-o between quality and the bit rate. 26 No prior knowledge required on motion category It does not demand the knowledge of the motion category. Thus, no human super- vision is needed in the coding process. These points will be elaborated in the next chapter. 3.3 Temporal Sampling As the dof curve is smooth in most time, this temporal redundancy can be exploited in achieving higher mocap data compression. Specically, we may sample a dof curve at certain time instances and then interpolate skipped values based on sampled values. In this section, we compare several dierent sampling schemes. It could be attractive to sample dierent dof curves at dierent temporal locations. However, the coding of these temporal locations demand extra overhead bits. Thus, we sample all dof curves at the same temporal locations in our design. To sample a curve at a xed interval is straightforward. It is however dicult to pre- select an optimal interval length. If the interval is too long, it is dicult to interpolate the dof values accurately based on sampled points. If the interval is too short, one may sacrice the coding performance gain. Consider the example in Fig. 3.2, whereT is a pre-selected interval length, circles de- note sampled points and dotted curves are predicted dof curves by interpolating sampled points. The prediction in the last interval of Fig. 3.2(a) is far from the ground truth. The interval in Fig. 3.2(b) is shorter than T , but the prediction is still not good in some 27 intervals. The predicted curve in Fig. 3.2(c) is satisfactory. However, as compared with Fig. 3.2(a), we see that six out of the thirteen points are redundant. T T T (a) (b) (c) (d) Figure 3.2: The solid line is the ground truth, the dotted line is the approximated curve, and circles are sampled points. The predictions in (a) and (b) are worse than (c) and (d). Fixed-interval sampling is used in (a)-(c) and adaptive sampling is used in (d). The above example suggests the use of adaptive sampling. To implement adaptive sampling, one idea is to select the local minima/maxima of the curve as sample points, and interpolating them with the piece-wise Hermite spline as shown in Fig. 3.2(d). The local minima/maxima are locations where a dof curve changes its direction. Thus, such a selection captures the dynamics of the motion. The other idea comes from the interpo- lation perspective, which selects a new sample point based on the current interpolation result. As compared to xed-interval sampling, sampling at the local extrema can pre- serve the dynamics without over-sampling the motion. By comparing Figs. 3.2(c) and 3.2(d), we see clearly that adaptive sampling demands fewer samples to achieve good prediction. It is also possible to adopt a hybrid sampling method. That is, it rst segments the motion into multiple equal-interval motions and then performs adaptive sampling in each segment as described above. The segmentation in the rst step prevents error propagation, allows a xed-size buer and reduces the computation complexity. The interval length can be relatively longer to avoid over-sampling. Although the duration 28 FS (1s) FS (0.5s) FS (0.1s) AS HS complexity O(1) O(1) O(1) O(n) O(kn) distortion 0.29 0.13 0.06 0.15 0.08 Table 3.1: Comparison of dierent temporal sampling methods, where the distortion is the average MSE per dof per frame. is long, motion with high frequency will also be taken care of since those points will be selected in the second stage. We compare the performance of xed sampling (FS), adaptive sampling (AS) and hybrid sampling (HS) methods in terms of complexity and distortion in Table 3.1. In complexity evaluation, the averaged sampling interval isn frames for the AS. For the HS, the duration of the equal interval is n frames and we choose at most k samples in each interval. 3.4 I-Frame Coding: Vector Quantization Human motion is constrained by the skeleton-muscle system of the captured subject. We observe that some dofs are highly correlated while others are less correlated. Thus, we divide dofs into multiple groups and handle them separately according to the group characteristics. Specically, we adopt the notion of Labanotation and decompose a pose into ve sub-poses. Furthermore, we apply the vector quantization (VQ) technique to the space formed by each sub-pose. 3.4.1 Decomposition of Dofs To handle all dofs at the same time has to deal with a problem of N dimensions. In the current context, N = 60, which is too high to tackle directly. We may decompose the 29 60 dofs into several subgroups and treat them separately. Labanotation is the written language of motion. It is used to describe motion as notes are used to describe music. Labanotation segments the human body into 5 main parts: 2 arms, 2 legs and the torso. As illustrated in Fig. 3.3, (a) is the Labanotation of the four frames (A, B, C, D) of the motion as shown in (b). Here, we adopt the Labanotion to segment human body into ve groups of dofs. (a) (b) Figure 3.3: (a) is the labanotation of a series of motion in (b) After the dof decomposition, we have the following two observations. 1. One limb can move independently from the other. For example, if we decompose a human skeleton into two arms, two legs, and the torso, the right arm can move independently from the left arm. Besides, a human skeleton is left-right symmetric. That is, what can be achieved by the right half can be done by the left half in a mirrored manner. For real data, the distribution in the left and right motion data 30 might be slightly dierent due to subject's preference. For example, if the subject is right-handed, his right half is in charge of more motion activities than the left half. It is in general a good idea to leverage this symmetry. 2. A skeleton usually has a tree hierarchy. The parent dofs can be independent of children dofs. For example, how the torso bends (parent dof) does not have a clear suggestion how the arms and legs (children dof) are placed. However, they are sometimes correlated such that the pose looks smooth and natural. For example, the knee joints are usually bending if the orientation of the thigh (depending on the thigh joint) is not perpendicular to the ground, i.e. when a leg is raised, there are only certain poses which are natural. For human motion, the left/right symmetry suggests that the motion database of the right half body should be a mirrored one of the motion database of the left half body. Since the torso is not as exible as arms and legs, its main activity region, which is called the region of interest (ROI), is smaller than the ROI of arms and legs. Furthermore, legs are often responsible for balancing the body, and our thigh joint is not as exible as our shoulder joint. Thus, the probability distribution of these dofs should be dierent. It is worthwhile to point out that the above partitioning may sacrice potential cor- relation between limbs. For example, in walking motion, all ve dof groups are highly correlated with each other. However, in most situations, this oers a reasonable solu- tion. We can use the Human or various entropy coding methods to reduce the bit rate. Among the 60 degrees of freedom, 3 global transition dofs are handled separately, and we drop the 12 dofs with very small variance, and let each group has 9 dofs. Furthermore, 31 the left hand and the right hands share one sub-space and so do the left leg and the right leg. For arms, legs and the torso, the motion range of the arm is the widest, next the legs and nally the torso. 3.4.2 Vector Quantization (VQ) The basic idea of VQ is to divide a large number of vectors into groups, each of which is represented by its centroid. Since data points are represented by the index of their closest centroid, commonly occurring data have lower errors while rare data have higher errors. VQ is based on the competitive learning paradigm, and it is closely related to the self-organizing map model. It has certain machine learning avor, which suggests the hidden distribution of motion categories automatically. In contrast with motion classication using its semantic meaning, classication in the proposed algorithm is mostly done manually as a pre-processing step. The proposed algorithm does not demand users in the loop and can be re-trained for ner scales automatically. Each frame is a vector of dimension N. We apply VQ to this vector since we would like to exploit the inter-dof correlation. The quantization error is dened by Q Err = X i (q i C(q i )) 2 ; where C(q i ) is the representation for q i in the VQ codebook. Several codebook sizes have been tested and their quantization errors are reported in Table 3.2. If we use the previously sampled frame to predict the current frame, the average error is 8.2819. If we aim to preserve sampled frames in a lossless manner, VQ can give a better result. If we 32 settle with a lossy scheme, VQ has a smaller error. By adding one bit in the representation of the codeword, we double the codebook size. In Table 3.2, we see that the error can drop very fast with an extra bit. Furthermore, the reduced error might save us more than one bit in compressing the residual. We can further compress the index of codewords with the Human codes so that we can use a bigger codebook size in practice. Rate(bit) Error 0 12.8186 1 4.2900 2 2.4999 3 1.4691 4 0.9914 5 0.6807 6 0.4188 7 0.2389 8 0.1184 9 0.0396 Table 3.2: Quantization errors for codebooks of dierent sizes. To reduce the complexity, we follow the idea of dof decomposition in a frame as stated in the last subsection. The 6 global dofs determine the location of the subject and which direction he/she faces while the remaining dofs determine the pose the subject is in. Since a subject can do the same pose in various locations and facing various directions, the 6 global dofs are independent of the rest. The 3 dofs for the global orientation are bounded by (,) while the 3 dofs for the global position have no limit. In our coding system, we apply scalar quantization to the 3 global position dofs and VQ to dofs associated with limbs. We have three codebooks: the rst one for the torso, the second one for arms (including both the left arm and the right arm) and the third one for legs (including both the left leg and the right leg). 33 We use the CMU mocap database in our experiment. It consists of 2626 trials in 5 categories and 23 subcategories. For mocap data containing multiple subjects, we do not consider the correlation between any two characters, that is, we treat the motion of each subject independently. We randomly selected 10% of the data in each category as the test data, and took dierent percentages of the remaining as the training data. A user can specify the codebook size and apply a standard VQ training procedure, which is typically the k-means clustering algorithm, to the motion space of each limb. The centroids of all clusters are recorded as the codewords and all codewords form a codebook. The dof ranges from to , which aects the absolute dierencing, averaging, and interpolation in computation. To address this issue, the absolute dierence between two angles, 1 and 2 , is calculated via absolute dierence( 1 ; 2 ) = min(j 1 2 j; 2j 1 2 j); Similarly, the proper angle value will be selected among two possible values in the average computing. As to interpolation, there are two directions as well and the directional information has to be encoded. A famous VQ training algorithm is known as tree- structured VQ, which can generate a hierarchically organized codebook of size 2 n , n = 1; 2; . Such a codebook can facilitate the VQ encoding process. Fig. 3.4 shows how the shoulder joint dofs are quantized by 3 bits or, equivalently, 8 quantization levels. For each quantization level, we show the corresponding Labanotation in Fig. 3.4. We show the eect of the training data size in Fig. 3.5, where the x-axis is the percentage of the data used in the training process and the y-axis is the distortion in 34 Figure 3.4: Illustration of a 3-bit quantized shoulder joint dofs and their corresponding Labanotation. terms of SSD. The blue diamond line is the average SSD of all motions while the red squared line is the SSD for the walking motion. We see that the error decreases quickly when the size of the training data reaches 10%. Then, the curve becomes atter. It is also observed that the SSD becomes extremely small if more than 60% of the data are selected in the training. !" #!!$!!!" %!!$!!!" &!!$!!!" '!!$!!!" ($!!!$!!!" ($#!!$!!!" ($%!!$!!!" ($&!!$!!!" ($'!!$!!!" #$!!!$!!!" !" (!" #!" )!" %!" *!" &!" +,-.+/-" 0+123.45" Figure 3.5: The distortion as a function of the test data percentage. 35 3.5 B-Frame Coding: Interpolation and Residual Coding The dofs of natural human motions are continuous and often smooth in the time domain. This property can be exploited to interpolate B-frames based on sampled I-frames. In this section, we will discuss the interpolation techniques, which should be easy to imple- ment, robust and close to the ground truth. Besides, in contrast with traditional motion interpolation methods which demands a decent approximation accuracy, we introduce a post-processing step to reduce the residual errors. That is, we allow the coding of inter- polation errors to control the degree of interpolation errors at dierent time instances. This procedure is called the residual coding. In other words, we can use higher bit rates to trade for better coded dof curves. 3.5.1 Interpolation Techniques Three interpolation techniques are considered below. Linear interpolation Linear interpolation is a method of curve tting using linear polynomials. It con- nects two sample points with a straight line. It is the easiest and sometimes used to interpolate short period of motion in previous work. Spline interpolation Spline interpolation is the most common interpolation strategy applied to the mo- cap data. A spline is a special function dened piecewise by polynomials. The spline interpolation is often preferred to the simple polynomial interpolation since 36 it yields similar results but using low-degree polynomials and avoids Runge's phe- nomenon for higher degrees. Typical choices include the B-spline and the Hermite spline. A B-spline is simply a generalization of a Bezier curve. The Hermite form consists of two control points and two control tangents for each polynomial. For interpolation on a grid with pointsx k fork = 1;:::;n, interpolation is performed on one subinterval (x k ;x k+1 ) at a time (given that tangent values are predetermined), where subinterval (x k ;x k+1 ) is normalized to (0; 1). High Order Polynomials Higher order polynomials usually do not t the context of our interest since they might introduce extra bumps in the motion curve. An nth order polynomials will pass exactly n + 1 points. Each constraint can be a point, angle, or curvature. Angle and curvature constraints are often imposed at ends of a curve and called end conditions in such cases. Identical end conditions are frequently used to ensure a smooth transition between polynomial curves contained within a single spline. Generally speaking, if we have n + 1 constraints, we can run the polynomial curve of order n through those constraints. For example, the rst degree polynomial equation could also be an exact t for a single point and an angle. The third degree polynomial equation could also be an exact t for two points, an angle constraint, and a curvature constraint. Many other combinations of constraints are possible for higher order polynomial equations. It is well known that high order polynomials can be highly oscillatory. If we run a curve through two points A and B, we would expect the curve to run somewhat near 37 the midpoint of A and B as well. This may not happen with high-order polynomial curves, they may even have values that are very large in positive or negative magnitude. With low-order polynomials, the curve is more likely to fall near the midpoint (it's even guaranteed to exactly run through the midpoint on a rst degree polynomial). The inter- polation performance should be evaluated by some metrics. The least squares provides such a measure. 3.5.2 Residual Coding The residual error is the dierence between the interpolated and the actual values. There are two types of residual errors: errors in I-frames and B-frames. We use VQ to encode these residual errors with two dierent codebooks as described below. I-frame residual coding For a sub-pose codebook of I-frames, we may train a sub-pose residual codebook correspondingly since the use of sub-pose codebooks will t each group of dofs bet- ter. However, we observe that all sub-pose residuals can share the same codebook. This can be explained by the following arguments. The limb-dependent property is primarily captured by the I-frame codebook so that the residual becomes limb independent. B-frame residual coding Errors are introduced in the interpolation stage. B-frame residuals can be encoded to compensate these errors. Another tree-structured codebook can be trained based on the B-frame residuals. 38 The signicance of residual coding is illustrated in Fig. 3.6, where the rate-distortion trade-o curve is plotted. In this gure, the x-axis is the number of bits used in residual coding and the y-axis is the distortion measure in terms of the sum of squared dierences (SSD). Note that the curve become atter with 3 or 4 bits used in the residual coding. 0 180,000 360,000 540,000 720,000 900,000 1 2 3 4 5 6 7 8 9 10 11 R-D Curve SSD Bp (bits) Figure 3.6: The rate-distortion (R-D) curve for residual coding. Next, we run the experiment over the database, and rank the distortion and the interpolation complexity before and after the residual coding in Table 3.3, where 1 means the smallest and 3 means the largest. We see that, before residual coding, the spline interpolation gives the least distortion. However, after applying the 3-bit residual coding, the distortion of the linear interpolation is almost the same as the cubic spline. As to the 2nd order polynomial, the prediction can be very bad sometimes so that it needs more 39 bits in residual coding. Finally, the interpolation complexity depends on the order of polynomials. A higher order polynomial has a higher complexity. spline linear 2nd complexity 3 1 2 distortion w/o RC 1 2 3 distortion w/ RC 1 1 3 Table 3.3: Performance ranking of interpolation using the rst, the second and the third order polynomials with or without residual coding. 3.6 Conclusion and Future Work In this chapter, we explored the characteristics of the mocap data, and proposed a new lossy compression algorithm operating on the dofs domain. It consists of three main modules: 1) temporal sampling, 2) vector quantization of I-frames and 3) B-frame inter- polation and residual coding. Generally speaking, the coding of an I-frame takes more bits than a B-frame. The remaining issue is bit rate allocation. That is, for a given bit budget, how to allocate bits to the coding of temporal sampling information, the I-frame VQ indices and the I-frame/B-frame residual VQ indices? This problem will be examined in the next chapter. 40 Chapter 4 Bit Allocation and Performance Evaluation 4.1 Introduction The mocap data compression system was presented in the last chapter. There are several design parameters to choose in the proposed mocap data compression system, such as the interval of temporal sampling, the number of bits used in the VQ for I-frames and the residual coding for I-frames and B-frames. Given a bit budget, the problem of how to allocate bits to dierent coding modules is called bit allocation. A good bit allocation scheme will result in a visually better coded result. The rest of this chapter is organized as follows. We will rst study the bit allocation problem in Section 4.2 by performing a qualitative analysis on the correlation between coding parameters, the characteristics of the motion clip, and the bit budget (or the error bound). Then, we will evaluate the performance of the proposed mocap data compression system using the CMU mocap databased in Section 4.3. Furthermore, we will comment on novel contributions of the proposed mocap data compression system with respect to prior art in Section 4.4. Finally, concluding remarks are mentioned in Section 4.5. 41 4.2 Bit Allocation 4.2.1 Problem Statement We use N to denote the total number of frames, N I the number of I-frames, N B be the number of B-frames,R I be the bits per I-frame, andR B be the bits per B frame. Clearly, we have N I +N B =N (4.1) and the compressed size equals N I R I +N B R B : (4.2) Furthermore, letD I be the distortion of I-frames,D B be the distortion of B-frames. The factors that aect D I are N I , R I , and how I-frames are selected. The factors that aect D B are the I frame quality, R B , and how I-frames are selected and interpolated. We evaluate D I and D B using the MSE of dofs. Given a xed budget for I-frames, the coding gain of using dierent methods varies from one motion clip to the other. In Chapter 3, we showed that the hybrid sampling is in general the best in terms of the rate-distortion (R-D) performance. Consider a cost function that is a linear combination of the rate, distortion and complexity in form of J(D;R;C) =D +R +C: (4.3) Our goal is to minimize the cost function in Eq. (4.3) under a xed budget constraint as given in Eq. (4.2). 42 Generally speaking, there are two factors aecting the selection of I frames: the bit rate and the complexity of the underlying motion. For a given bit rate, if a motion sequence is more complex, it pays o to use a more complicated sampling scheme. Oth- erwise, a simpler method is preferred. Among these two factors, the bit rate is controlled by users while the evaluation of motion complexity is less intuitive. Hence a systematic analysis and therefore concrete quantitative methods are desired. The complexity of a motion clip can be accessed in both the spatial and the temporal domains. The spatial complexity is determined by the variety of poses. The temporal complexity is measured by the frequency of the pose change. It is obvious that the choice of the temporal sampling scheme depends on the temporal complexity. However, the spatial complexity also aects the temporal sampling scheme in an indirect way. That is, the spatial complexity aects R I and R B . When the spatial complexity is low, we can pick more I-frames under the bit rate budget. Then, the xed-interval sampling scheme can be favored for its simplicity. However, if the temporal complexity exceeds a threshold, the coding gain of using the hybrid sampling scheme becomes higher and thus it is favored. There are two design parameters in the hybrid sampling scheme: block size l, which decides the memory size, and the allowed maximum error, which determines the approximation accuracy. We also compare two schemes for B-frame interpolation: linear interpolation and cubic interpolation. Generally speaking, the linear interpolation method is favored due to its low complexity if the residual coding is applied. 43 In this subsection, we explain how the spatial and temporal complexities of motion aect bit allocation. In the next subsection, we would like to quantify these complexities using spatial and temporal features. 4.2.2 Motion Characterization and Coding Parameters Selection We will introduce six spatial features and four temporal features below and explain how to use them as a guideline to select the following six parameters: 1) the codebook size for I-frames, 2) the codebook size for I-frame residuals, 3) the codebook size for B-frame residuals, 4) the number of I-frames, 5) how I-frames are selected, and 6) how B-frames are interpolated. 4.2.2.1 Spatial Features The spatial complexities have an impact on the codebook size of I-frames. If the motion variation is large, more bits should be allocated to I-frames. If the motion changes rapidly, more I-frames should be selected to sample the motion better. The variance of the frame dierence between two consecutive I-frames aects the quantization parameter for B- frame residual coding. In the following, we present six spatial features, illustrate their physical meaning, and show how they aect the parameter choice. 1. Variance of Sub-poses and I-frame Codebook Design We measure the overall motion variety by examining the variance of each sub-pose; namely, P N1 t=0 P n1 i=0 (q i (t) q i ) 2 N ; (4.4) 44 whereq i (t) is dofi at framet,n is the number of dofs in a sub-pose,N is the number of frames and q i is average value of the ith dof. Specically, in the proposed dof decomposition scheme, there are ve sub-poses with n = 9. These ve sub-poses use three dierent codebooks in the I-frame quantization, where arms share one codebook and legs share another codebook. It is desirable that all quantized sub- poses should have similar distortion. Thus, we can assign bits to each sub-pose in proportion to its variance in the I-frame codebook design. 2. Variance of poses quantized to the same code After the codebook for I-frames is determined, we have a baseline of how sub-poses will be quantized. Then, we evaluated the variation of the sub-poses quantized to the same code. This is a reliability check. If the variance is above a certain threshold, we will need to assign more bits to I-frames to ensure their quality. This feature is calculated by max k2K ( P t2k P n1 i=0 (q i (t) q i ) 2 N k ); (4.5) where K is the collection of quantized levels, k is the collection of frame indices of the same quantization code, and N k is the number of frames in k. The threshold can be related to the user-specied upper bound of error or a pre-selected value obtained from the training data. 3. Variance of Residuals After the reliability check in Step 2, the next step is to decide the bit budget for 45 I-frame residuals. We can evaluate the variance of the residual errors based on the adjusted bit budget, and select the proper size of residual codebook size. The above discussion is summarized in Figure 4.1. Increase I-frame code length Evaluate variance of poses quantized to the same code > e Evaluate variance of residuals to determine the bit budget for I- frame residual No Evaluate variance of sub-poses to initialize I-frame code length Yes Figure 4.1: The procedure to select coding parameters for I-frames, wheree is the thresh- old of tolerable error. 4. I-frame Insertion or Dropping If sub-poses of two consecutive I-frames are all quantized to the same codewords, it means that one of the two I-frames can be dropped. If the variance is high, we may insert one more I-frame. This process can be repeated until the variance is under a certain threshold. We show the above procedure in Figure 4.2. In addition to the above considerations, the following two cases deserve our special attention. 46 Interaction with the environment (e.g., an object, the terrain or another characters) In frames where a character is in contact with the environment, their MSE should be reduced since the error will be more visible in these frames. Given the information of the terrain and the traveling path of the character, we can estimate the time of interaction and increase the approximation accuracy accordingly. This corresponds to modify the value of e in Figures 4.1 and 4.2. Correlation of sub-poses One interesting motion characteristics are the correlation of sub-poses. Although we do not exploit it in the proposed compression system, it serves as a potential direction for our future research. After the coding parameters for I-frames are determined, we perform interpolation and evaluate the variance of B-frames residuals to determine its codebook size. Note that all sub-poses share one codebook in B-frame residual coding. 4.2.2.2 Temporal Features Temporal features aect on the number of I-frames to be selected and how they are selected. After those parameters are decided, we can determine the interpolation method and the codebook design for B-frame residual coding. We will discuss four temporal features of motion clips and relate them to these design choices. Without loss of generality, we use the xed-interval sampling scheme in the following discussion. Maximum number of selected I-frames By using the xed-interval sampling method as the basis, we get an upper bound 47 Evaluate Variance of the frames More Segments > e Mark a candidate Yes Determine I-frame Selection Algorithm Yes No No Figure 4.2: Both spatial and temporal properties of motion contribute to the selection of I-frames. In this gure, we show how spatial features play a role in this decision process, where e is the threshold of tolerable errors. on the number of samples. If the budget allows us to use this number, we can go with the x-interval sampling method as discussed before. Otherwise, we will adopt the adaptive or the hybrid sampling method to reduce the number of samples. Distance between consecutive I-frames When the temporal complexity is low, we can increase the distance between two consecutive I-frames. Variance of the duration of samples After the use of the xed-interval sampling, we may use a simple merge algorithm to merge two adjacent intervals if their spatial dierences are small. After the 48 merge, we can calculate for the duration between sampled frames and have a more straightforward estimate of temporal complexity. Cumulative errors between sampled frames With the merge algorithm described above, we can estimate the cumulative errors between sampled frames. If the cumulative errors are too high, we can insert a new I-frame to lower the error. Both spatial and temporal complexities of motion contribute to the selection of I- frames. Generally speaking, temporal features play a more dominant role. We show how temporal features aect the decision in Figure 4.3, where n is the budget of the number of I-frames,lambda is the threshold that determines whether the distortion drops rapidly enough by increasing the number of I-frames and em is the tolerable cumulative error per segment, where m is the variable for the length of the segment, and e is the pre- determined upper bound. Possible decisions are labeled with double lined boxes. 4.2.3 Case Study: CMU Mocap Database The CMU mocap database, which covers a wide range of motion activities, is adopted in our experiment. The CMU mocap lab contained 12 Vicon infrared MX-40 cameras, each of which was capable of recording images of 4 megapixel resolution at 120Hz. The cameras were placed around a rectangular area, of approximately 3m x 8m, in the center of the room. Only motion activities that took place in this rectangle can be captured. To capture a motion clip, small gray markers were placed on the active subject. Hu- mans wore a skin-tight black suit with markers taped on. The suit was skin-tight to 49 > n Use Fixed-Interval Sampling No Evaluate Upper Bound of required # of sampls Evaluate tangent of R-D curve Yes > lamda No Blocked based with small block size Merge spatially close, temporally consecutive samples Yes Evaluate accumulative errors between I-frames > e*n Insert an I-frame Yes More segments? Yes Estimate Temporal Randomness, determine block size along with spatial information Figure 4.3: The procedure used to select I-frames 50 avoid confusion of whether the marker moved due to the motion or just rolling of sub- ject's clothes. As compared with markers directly taped on the bare skin, the suit had less re ection, thus introducing less noise in the capturing session. The Vicon cameras ob- served markers in infra-red. The images that various cameras picked up were triangulated to get 3D data. The 3D data were stored in two ways: 1) marker positions in C3D format, and 2) skeleton movement data format (rotational values of dofs). While the two data formats are interchangeable, the size of the skeleton movement format data are 1/3 of the C3D data. The skeleton movement data format is adopted in our work. The database contains motion clips captured from 112 subjects. Some subjects had fewer motion clips such as 10 while others had more such as 70. There are 2626 motion clips in total. These motion clips were categorized into 5 sets: 1) human interactions, 2) interaction with the environment, 3) locomotion, 4) sports, 5) situations and scenarios. In human interactions, the motion clips involve multiple characters and include some static motion clips such as one person pushing another or some behavior motion clips such as two people in conversation. In the interaction with the environment, there are motion clips in dierent playgrounds such as stairs, up-hills, puzzles, etc. Locomotion are walking and running. Sports includes several dierent kinds of sports such as the basketball, dance, etc. Situations and scenarios contain ne-detail motion of a limb such as gestures. The last category is out of our concern since we are interested in the whole human motion instead of partial motion. In the experiment, we classify motion clips into two groups depending on whether the motion clip has a high degree of contact interactions. Given the same amount of 51 0~0.5 21% 0.5~1 17% 1~1.5 15% 1.5~2 30% 2+ 17% Figure 4.4: The distribution of the variance of sub-poses. distortion, the motion clip with a higher degree of contact interaction (with either another character or the environment) tends to have more obvious visual artifacts. As a result, motion clips with more contacts should adopt a stricter distortion constraint. 4.2.4 Statistics of CMU Mocap Data The spatial features determine the size of the I-frames, which aects the quality of the I-frame and the eectiveness of the interpolation. Thus, the rst three spatial features described in Section 4.2.2 are highly relevant. We plot the statistics of these spatial features for mocap data in the CMU database in Figures 4.4-4.6. First, we show the distribution of the variance of sub-poses in Figure 4.4. The dis- tribution coincides with the intuition that motions of relatively static poses such as one person pushes another or repetitive patterns have a lower variance value. In contrast, most motions in the sport category are of higher variance except for the tennis motion. 52 0 20 40 60 80 100 0.05 0.1 0.15 0.2 0.2+ Percentage(%) Variance of frames of the same code 4 bit 8 bit 12 bit Figure 4.5: The distribution of the variance of frames quantized to the same code with 4-bit, 8-bit and 12-bit I-frame codebook, respectively. 0 20 40 60 80 100 0.05 0.1 0.15 0.2 0.2+ Percentage(%) Variance of residual error 4 bit 8 bit 12 bit Figure 4.6: The distribution of the variance of prediction residuals using 4-bit, 8-bit, and 12-bit I-frame codebooks, respectively. 53 The tennis motion in the CMU database is that the subject swings the racket, which does not involve running in real-world tennis games so that its variance is smaller. The distribution of the variance of frames assigned to the same code with 3 dierent codebooks is shown in Figure 4.5. For the 4-bit codebook, about 40% of the motion clips can be well quantized (i.e. their variance is lower than 0.05). These motion clips approximately correspond to the group whose subpose variance is under 1 (namely, 21% +17% = 38%). With the 8-bit codebook, more than 80% of the motion can be well- quantized. For motions which cannot be properly quantized by the 8-bit codebook (with their variances above 0.15), the situation only improves slightly with the 12-bit codebook. This suggests that, for some motion, it is probably more eective to assign the 8-bit codebook for I-frame coding, and the 4-bit codebook for I-frame residual coding, rather than spending all 12 bits for I-frame coding. The distribution of the variance of the residuals of I-frames belonging to the same codeword is shown in Figure 4.6 with 3 dierent codebooks. With the 4-bit codebook, about 60% of motions have well-bounded residuals while some motion clips have residuals of higher variance. It implies that the 4-bit codebook is still not eective for them. For the 8-bit codebook, the residuals of most dof les are small. On the other hand, even if we increase the codebook size to 12 bits, there are motion clips whose residual variance is still high. Furthermore, we will examine the statistics of a temporal feature to get a picture of the eectiveness of time sampling. We show the distribution of the distortion when the dof curves are sampled at a xed interval in Figure 4.7, where three sampling intervals (namely, 1, 0.5, and 0.25 seconds) are compared. For sequences that get signicant 54 0 10 20 30 40 50 60 70 80 90 0.1 1 2 Percentage(%) distor1on (MSE) 1 0.50 0.25 Figure 4.7: The distortion distribution with respect to the xed sampling with intervals equal to 1, 0.5, and 0.25 seconds. improvement in distortion by doubling the interval, we will investigate the potential to drop some points in the at region of the dof curves. 4.3 Performance Evaluation In this section, we evaluate the coding Performance for two cases: 1) with a xed bit budget and 2) with a maximum distortion level. For the rst case, we plot the distribution of MSE per dof per frame for all mocap les with four compression ratios; namely, 5:1, 10:1, 20:1, 30:1, in Figure 4.8. For the compression ratio of 5:1, about 98.2% of the mocap les have their MSE less than 0.01, which is nearly lossless. For the compression ratio of 10:1, the MSE is between 0.01 and 0.09. For the compression ratio of 20:1, the MSE ranges from 0.02 to 0.28, which still 55 has decent quality. For the compression ratio of 30:1, mocap data les can be clustered into 2 groups based on their MSE values. The MSE values of the rst group range from 0.95 to 0.30 with reasonable quality. The MSE values of the second group are above 0.5. There exists some obvious artifact in some frames. However, most mocap data les (about 82group. For the second case, we specify ve upper bounds of error tolerance; namely, 0.01, 0.05, 0.1, 0.3 and 0.5 and plot the distribution of compression ratios in Figure 4.9. As shown in this gure, to achieve near lossless quality with the maximum error bounded by 0.01, the compression ratio is likely to be 5:1. However, for decent coding quality with the maximum error bounded by 0.1, the compression ratio of 20:1 is an achievable goal. It is worthwhile to report the problems associated with the extreme cases observed in some frames under the 20:1 compression ratio. They include the following: 1. Even with a 12-bit codebook for I-frames, we still observe high variance of frames assigned to the same codeword. 2. Even with a 12-bit codebook for the residual coding, we still observe high variance of residuals. 3. We still observe a high distortion value even with a sampling interval as short as 0.25 seconds. While the locomotion is the dominant type of motion, the VQ training favors the motion similar to the locomotion. For most motion clips, the pose mostly stands straight and faces forward while the balance of the body depends on the support of the feet. However, the motion category that interacts with the environment contains other motion 56 0 10 20 30 40 50 60 70 80 90 100 <0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.2 0.3 0.4 0.5 >0.5 (%) MSE MSE distribu/on given fixed compression ra/o 5:1 10:1 20:1 30:1 Figure 4.8: The plot of the MSE distribution for les in the CMU mocap database with four compression ratios. 57 0 10 20 30 40 50 60 70 80 90 30:1 25:1 20:1 15:1 10:1 5:1 4:1 (%) Compression ra/o 0.01 0.05 0.1 0.3 0.5 Figure 4.9: The plot of the compression ratio distribution for les in the CMU mocap database with ve maximum error bounds. 58 types that use a dierent balancing strategy such as climbing that requires both hands and legs as the potential support. These motion clips tend to meet the above three criteria and give more obvious visual artifacts using the proposed mocap data compression algorithm. By importing more similar motion types to the training database may help the coding of these motion types. 4.4 Comparison with Previous Work It is worthwhile to point out the dierences of the proposed mocap data compression algorithm and the prior art in the following several aspects. Data format Mocap data can be stored in two inter-changeable formats: positions of the markers and the angles of degree of freedoms. Although there exists one-to-one mapping for a pose between the two formats, the data amount using the marker positions is three times of that using dofs. In other words, the dof format is more compact then the marker position format. Arikan's scheme [2] operates on the marker position format. Since the space of marker positions is smoother, the interpolation is easier. He argued that, for a duration, if the foot is on the ground in the beginning and in the end, the interpolated result using the marker position format will give the desired positioning - the foot will be on the ground during the whole time. With the interpolation in the dof domain, the foot might slide over, oat above or penetrate into the ground. However, we can apply the IK procedure to clean contact points and use a post-processing step to preserve reasonable and realistic contact points 59 when interpolating data in the dof domain. The advantage to handle mocap data in dof format is two folds. First, it is a more compact representation. Second, it is easier to compare two poses using dof representation rather than using clouds of markers. Time Sampling Mocap data, like video, the correlation of consecutive frames is high. It is natural to remove this redundancy is to perform time sampling. Arikan's framework [2] uses xed-time interval (choose anything between 8 to 15 frames, which are equivalent to 1/16 and 1/8 seconds.) He stated that, if a longer or shorter interval was chosen, the compression ratio would be sacriced. However, one may doubt whether it is always necessary to choose such a short duration. The selected time samples oer a summary of the motion. In the proposed coding scheme, we adopt a strategy to sample frames adaptively. Computational Complexity Arikan's scheme uses the Bezier curve as the approximation, which demands the cubic polynomial computation. Since our scheme encodes the interpolation residu- als, it gives more exibility in bit allocation. By using the residual coding to bound the approximation error, we can use the simple linear interpolation to lower time complexity. Prior Knowledge on Data Arikan [2] attempted to encode the mocap data database as a whole. He seg- mented the motion curve, approximated curves in each segment with Bezier curves, 60 and then converted each curve to a 4-point control vector. He put every set of four points together to make spatial-temporal vectors. The vectors have a high dimension so that he analyzed similar data to serve as a reference point. As a result, by grouping similar motion data together, they can form a motion category. Arikan's work depends on the motion category information of each mocap data. This meta data often needs human in the loop. Our approach exploits the time- domain redundancy by temporal sampling, and the space-domain redundancy by vector quantizing sampled poses. Our scheme does not demand the prior knowledge of the motion type. 4.5 Conclusion In this chapter, we studied the bit allocation problem and applied the proposed mocap data compression scheme to real world data. It was demonstrated that the scheme is generic and automated. A set of 6 design choices was examined in detail. They were: 1) the codebook size for the I-frame, 2) the codebook size for the I-frame residual, 3) the codebook size for the B-frame residual, 4) the number of I-frames, 5) how the I-frame is selected, and 6) how B-frames are interpolated. It was shown by experimental results the proposed mocap data compression scheme can achieve the nearly lossless quality at a compression ratio of 5:1 and decent quality at a compression ratio of 20:1. 61 Chapter 5 Enhanced Mocap Data Compression Algorithm 5.1 Introduction In Chapter 3, we proposed a motion capture data compression algorithm via temporal domain sampling and spatial domain vector quantization. The application of this algo- rithm was discussed and evaluated in Chapter 4. The algorithm, which will be referred to as the base algorithm for the rest of the chapter, can achieve a compression ratio of 20:1 while preserving decent quality. However, it is not fully optimized and can be further improved in the following two aspects. First, due to the nature of the adaptive temporal sampling scheme, a certain number of bits have to be spent in storing the temporal information of I-frames. A byte is allo- cated to each I-frame in order to record the temporal information in the base algorithm. This amount of allocated bits is based on the fact that the chosen block size is 2 seconds (or 240 frames) and the distance of two consecutive I-frames is at most 240 frames apart. Consequently, a byte (8 bits used to represent 0 to 255) is sucient to store this infor- mation. This allocation could be expensive if the motion consists of signicant higher 62 frequency components, which occurs often in general motion. It is desirable to save bits in the temporal information coding. Second, the base algorithm uses vector quantization to capture the correlation between the dofs within the same limb. However, there exists some correlation between the dofs that has not been fully explored. This leads to an interesting question; namely, whether it will lead to a higher compression ratio if we add a pre-processing stage to remove the redundancy. Here, we compute the forward dierence of dofs as a pre-processing step. Thus, instead of encoding dofs directly, we encode their dierence. In this chapter, we propose a new scheme for temporal information coding and discuss some transform options to enhance the base algorithm and call it the enhanced algorithm. The enhanced algorithm can achieve a quality level that is similar to the base algorithm with a higher compression ratio; namely, 45:1. Its complexity is slightly higher than the base algorithm since it involves the transform operation. However, the encoding and decoding can still be performed in real time (faster than 120 frames per second) with today's CPU. The rest of this chapter is organized as follows. An overview of the coding system of the enhanced algorithm is presented in Sec. 5.2. The coding of temporal information is discussed in in Sec. 5.3, and the spatial domain coding is presented in Sec. 5.4. Experimental results are shown in Sec. 5.5. Finally, concluding remarks are given in Sec. 5.6 63 5.2 System Overview The framework of the proposed enhanced algorithm is similar to that of the base al- gorithm except for two main dierences. The rst one lies in the module of temporal sampling. The second one is the replacement of the vector quantization coding module by another more eective coding algorithm. Due to these two changes, other modules, including interpolation, residual coding, and post-processing, have to be modied accord- ingly, although their high-level ideas are still the same. In the following, we will describe the two main dierences and the corresponding changes needed in other modules brie y. Then, more details of the two main changes will be given in Sec. 5.3 and Sec. 5.4, respectively. The temporal sampling module. In the base algorithm, samples are selected adaptively by choosing the frame that has the highest MSE from its prediction. The temporal information is stored in the frame index dierence. For example, if the following 6 frame index are selected (0, 1, 4, 10, 20, 254), we will store (1, 3, 6, 10, 234). By storing the frame dierence instead of the frame index itself, we bounded the value to a range between 1 and L, where L is the number of the frames in a block. For example, if the block size is 2 seconds, we can bound it within the range of 1 and 240 (there are 240 frames in two seconds) and it requires 1 byte to store the information per frame. In the enhanced algorithm, we evaluate the largest MSE of frames and their predictions in an interval. The frame with the highest MSE is selected as the new temporal sampling point. The sample points are selected based on the pace of the motion. To encode temporal sampling positions eectively, we demand that sampled frames in a block are located in 64 a dyadic grid so that the location information can be represented as a binary tree that takes 0s and 1s as the node value, where 1 means the node contains sampled frames, and 0 means that the node does not contain sampled frames. A binary tree can be stored eciently. Spatial domain transform. In the base algorithm, we compare two representa- tions of motion capture data: the marker positional format and the joint angle format. The advantage of the marker format is that it is linear. However, when we map from the marker format to the joint format, the le size of the former is three times of the latter since we need three markers (x 1 ;y 1 ;z 1 ), (x 2 ;y 2 ;z 2 ), (x 3 ;y 3 ;z 3 ) to dene angles of a joint (;;) uniquely. Thus, the base algorithm uses the joint angle format of dofs to remove spatial domain redundancy. In the enhanced algorithm, we encode the forward dierence of dofs which is dened as q i (t) =q i (t + 1)q i (t); (5.1) whereq i (t) is the value of dofi at framet. The distribution ofq i (t) is more concentrated thanq i (t), indicating the possibility of a higher compression ratio. In addition, we imple- ment several prediction schemes to achieve a higher compression ratio. The expression in Eq. (5.1) is the rst-order dierence of the original dof curve, which makes it closer to a linear curve. This pre-processing step allows us to work in the linear space with the joint angle format. 65 Interpolation, residual coding and post-processing. The prediction residual of q i (t) is more concentrated than that of q i (t) in the base algorithm. Linear inter- polation can be used, and the residual codebook is more eective than the one in the base algorithm, which produces a similar MSE level with fewer bits. The post-processing step can also be used to remove the foot-skating artifact, which is the same as the base algorithm. 5.3 Dyadic Temporal Sampling This algorithm aims to record the positions of I-frames where the maximum error of the coded dof curve is bound by a certain threshold T . Let the I-frame index set be S and the motion duration beL frames. First, we take the rst and the last frames as the initial I-frame index set, i.e., S =f1;Lg. Then, we interpolate the coded curve using values at the two end points and evaluate the maximum error of the entire interval. If it is higher than T , we take the middle point as the new sample, i.e., S =f1;L=2;Lg so that there are two sub-intervals. For the sub-interval that has its maximum error higher than T , we add its middle point to the sample point set. Generally speaking, given k samples in S, there are k 1 regions. For regions with the maximum error less than T , they will remain the same. For regions with the maximum error larger than T , they will take the center of the sub-interval as a new temporal sampling position. The previous temporal sampling algorithm takes 2 parameters: the block size and the maximum error. In contrast, the dyadic temporal sampling algorithm takes only 1 parameter: the maximum error. All information in a dyadic temporal sampling scheme 66 .8 .6 .5 .2 .3 .1 .3 Maximum Error = 0.8 0.6 0.5 0.2 0.3 0.1 0.3 (a) (b) Figure 5.1: The data structure of a partitioned interval can be stored in form of a binary tree in (a), which corresponds to the case shown in (b). can be stored in form of a binary tree. To give an example, the binary tree in Fig. 5.1(a) is obtained from the example shown in Fig. 5.1(b). In the memory size is constrained, we can divide the whole tree into multiple sub-trees and handle each sub-tree separately. For example, if the memory can accommodate one half of the duration of the motion in Fig. 5.1 only, we can drop the root node and work on two sub-trees one by one. This conguration can be handled by the program automatically. The implementation detail will be given in Section 5.4. Each region in the algorithm is represented by a node. If the maximum error of the region is within the threshold valueT , it will not be split furthermore and the correspond- ing node will have no children node. It is a leaf node and labeled by "0". The non-leaf nodes are labeled by "1". After labeling, we have a binary tree contain 0s and 1s only. Giving the property that nodes of value 1 have two children, and nodes of value 0 have 0 children, we can represent this tree using the string result of the depth rst search. The binary string can be encoded using the QM coder. 67 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.00 2.00 4.00 6.00 8.00 10.00 Maximum Error (I‐frame percentage) X (spent bytes) Fixed Interval Sampling Adap<ve Sampling Dyadic Sampling Figure 5.2: Comparison of three temporal sampling schemes. For a short duration, it is better to let the residual encoding module take care of the residual error since the coding of an I-frame is more expensive. Based on the fact, we set a lower bound on the region length; namely, 15 frames, which corresponds to 1 8 seconds. Unlike the sampling scheme in the base algorithm where the choice of the next sample depends on its previous sample, the dyadic temporal sampling scheme can handle multiple non-overlapping sub-intervals in parallel. Although the dyadic temporal sampling scheme may need more I-frames than the sampling module in the base algorithm, it demands fewer bits to store the temporal information. To show this, we compare the total number of bytes for temporal sampling recording (which is equal to the number of I-frames multiplied by the number of bytes needed to store the temporal information of each I-frame) in Fig. 5.2. It is clear that the dyadic sampling scheme outperforms the adaptive and the xed sampling scheme. 68 (a) histogram of the dof (b) histogram of the dof differences Figure 5.3: (a) The histogram of dofs and (b) the histogram of dof dierences. 5.4 Pre-Processing DoFs 5.4.1 DoFs versus DoF Dierences In the enhanced algorithm, we perform the forward dierencing operation on the dof values to enhance the coding performance. Fig. 5.3 shows the histogram of dofs in (a) and dof dierences in (b). Clearly, the distribution in (b) is more skewed than (a). Thus, the entropy of dof dierences is lower, which indicates a higher compression ratio if we encode dof dierences rather than dof values directly. 69 To demonstrate the advantage of dof dierencing from another angle, we show the maximum absolute error of the residual after VQ in each frame, D 1 = Max i j(^ q i q i )j; D 2 = Max i j( ^ q i q i )j; whereq i is the value of theith dof,q i is the forward dierencing of theith dof, ^ x denotes the VQ coded value of x, and i = 1; 2;:::; 60. Fig. 5.4 shows the histograms of D 1 and D 2 of each frame over the entire CMU database. Since the VQ codebook size is the same, the compression ratio is the same. However, the D 2 error is about one tenth of the D 1 error. Furthermore, the distribution in Fig. 5.4 (b) is very much skewed, where most frames are nearly lossless and only a few frames are with some noticeable error. If the error is in the torso, it will be more noticeable since the whole upper body are oset accordingly. On the other hand, if it occurs at other dofs, it will be less visible. Next, we tested codebooks of a smaller size to determine a codebook that produces approximately the same range for D 1 and D 2 . It turns out that if the codebook size for D 2 is reduced to about one half of the codebook size for D 1 , the distortion is about the same. The corresponding histogram of D 2 is plotted in Fig. 5.5. 5.4.2 Analysis and Solution of the Error Accumulation Problem Since the dof dierence is encoded in the advanced algorithm, the coding error may accumulate along the time domain. The error accumulation eect will be analyzed in this subsection. 70 (a) histogram of residual of VQ applied to the dof domain (b) histogram of residual of VQ applied to the dof differences domain Figure 5.4: The error histograms after VQ over all frames in the whole CMU database for (a) dofs and (b) dof dierences. 71 Figure 5.5: The error histogram after VQ over all frames in the whole CMU database for dof dierences with a codebook of one half size of the case shown in Fig. 5.4. 72 Recall that, in the advanced algorithm, we encode the following dof dierence: q i (t) =q i (t + 1)q i (t); where q i (t) is the value of the ith dof at frame t. Let q i (t) be the predicted value of q i (t). via linear interpolation. Then, we have q i (t) = tt 0 t 1 t 0 (q i (t 1 )q i (t 0 )) +q i (t 0 ): Let !(t) be the residual error of this prediction. Then, !(t) =q i (t) q i (t): The quantization error ofomega due to VQ is denoted by ~ !. We can rewriteq i (t +n) as q i (t +n) =q i (t +n 1) +q i (t +n 1) =q i (t +n 2) +q i (t +n 2) +q i (t +n 1) + =q i (t 0 ) + P t+n1 t=t 0 q i (t) =q i (t 0 ) + PP t+n1 t=t 0 ( q i (t) +!(t)) =q i (t 0 ) + P t+n1 t=t 0 (q i (t) + ~ !(t)): Since q i (t +n) =q i (t 0 ) + t+n1 X t=t 0 q i (t); 73 by denition, the cumulative distortion due to VQ is equal to P t+n1 t=t 0 ~ !(t). Generally speaking, the error tends to be aligned along similar directions, and they will not cancel out each other through time. To address this problem, we modify the dierence calculation slightly. That is, we compute q i (t) =q i (t + 1) q i (t); where q is the value of q after decoding. Then, we obtain q i (t) =q i (t + 1) q i (t) =q i (t + 1)q i (t) ^ !(t) As a result, the error caused by the VQ in the previous frame will contribute to the residual at for the current frame. Thus, the error accumulation problem can be resolved. 5.4.3 Other Considerations There are 6 global dofs. The 3 positional dofs specify the location of the character while the 3 angular dofs dene the pose and facing direction of the character. They provides a big picture of the motion. In the base algorithm, the 3 positional dofs are coded almost losslessly, and each dof takes one byte. The 3 angular dofs are considered part of the torso vector. To encode the 3 positional dofs (x;y;z) more eectively, we rst transform the space coordinates (x;z) to the corresponding polar expression (r;) and then encode the data in the dierence domain. This process converts the spatial information to quantities which are more relevant to human perception. Furthermore, since the direction of a 74 0% 20% 40% 60% 80% 100% 0.1/Basic 0.3/Basic 0.1/Enhanced 0.3/Enhanced % of Benchmark Database MSE bound RD comparison 50:1 30:1 25:1 20:1 15:1 10:1 Figure 5.6: Percentage of the data in CMU mocap database which are able to achieve designated MSE in each compression ratio level. human movement is of highly correlated with human's facing direction, the value can be predicted more easily. 5.5 Experimental Results We compare the base algorithm and the enhanced algorithm and show their coding per- formance at two dierent error levels (i.e., 0.1 and 0.3) in Fig. 5.6. The experiments were conducted for all mocap data in the CMU database. It is clear that the enhanced algorithm is able to achieve a higher coding ratio while maintaining the same level of distortion. 75 5.6 Conclusion We improved the coding performance of the base algorithm in Chapter 3 with two new techniques: 1) adaptive temporal domain sampling dened on a dyadic grid and 2) pre- processing of dofs by calculating their dierences. The improvement has led to an ad- vanced algorithm that oers a signicant coding gain over the base algorithm. With the enhanced algorithm, we can achieve the similar quality of the base algorithm, but in- creases the compression ratio from 20:1 to 45:1 on the average. Although the complexity of the advanced algorithm is slightly higher than the base algorithm, its encoding and decoding can still be done in real time (faster than 120 frames per second). 76 Chapter 6 Perception-based Mocap Data Coding 6.1 Introduction In Chapters 3, 4 and 5, we presented a sequence of mocap data coding algorithms via temporal sampling and vector quantization (VQ) of dofs at sampled frames as well as dof residuals at all frames. In Chapter 5, we showed that errors can be well compensated when temporal sampling is restricted to the dyadic grid. The performance of coding algorithms is evaluated in either the maximum error or the mean-squared error (MSE). However, the objective metrics do not correlate well to human subjective evaluation. It was observed in [30] that two animated motion sequences can be perceptually indierent although their MSE is large. Actually, if there is no abrupt change in energy, human perception can accommodate minor dierences. When the maximum error is bounded, it is often that two motions are perceptually similar (even their MSE is high). Based on the above observation, we propose a perception-based mocap data coding algorithm in this chapter. Although the coded dof values may have a larger MSE value, the motion is perceptually similar to the original one. 77 The following two properties can be exploited to develop the perceptual-based mocap data coding algorithm. If the timing information of the key-poses is shifted slightly, the resulting motion is still perceptually similar to the original one. If the poses in a frame are modied slightly, the resulting motion is still perceptually similar to the original one. The second property was exploited to justify the coding algorithms developed in Chapters 3, 4 and 5, too. The Bezier curve was used in [2] to approximate each motion segment. Consequently, the timing of key-poses can be slightly oset from the original one in the coded result. In this chapter, we explore the above two properties to achieve a higher compression ratio yet achieving the perceptual similarity. The basic idea of the perceptual coding algorithm can be simply described as follows. We segment a dof curve based on the sign of its velocity. That is, a dof curve can be decomposed to a sequence of segments with positive and negative velocities alternatively. For a region where the velocity is zero, it is included in the previous segment. We select segments from a training set, normalize them temporally and spatially, and classify them to generate a VQ codebook. Then, in the encoding process, the information to be stored include the normalization parameters and coded VQ indices. Besides, due to the success of adaptive temporal sampling on a dyadic grid, we adopt the same concept to encode the temporal information. Then, a scalar quantization (SQ) scheme is applied to the coding of the normalization parameters. It is shown by extensive experimental results that the perception-based coding algorithm can oer a compression ration of 150:1 on the average. 78 Coding Pre-Processing Segmentation & Normalization Vector Quantization Dyadic Encoding Scalar Quantization normalized segments temporal info spatial info Compressed Data Figure 6.1: The block-diagram of the perception-based mocap data coding algorithm. The rest of this chapter is organized as follows. The perceptual coding algorithm is presented in Sec. 6.2. Experimental results are shown in Sec. 6.3. Finally, concluding remarks are given in Sec. 6.4. 6.2 Perceptual Mocap Coding Algorithm The block-diagram of the perception-based mocap data coding algorithm is illustrated in Fig. 6.1. There are three major modules in the block-diagram. Their functions are simply described below: Pre-processing module The rst module performs ltering on the input dof data to produce smooth curves so that they will not be over-segmented due to small noise in the next module. 79 Segmentation and normalization module The second module partitions a dof curve into multiple segments by the sign of the velocity. In other words, the motion is segmented at local extrema and each segment is normalized to a block of height 2 (i.e., the dof range is normalized to 2.) and width 120 frames (i.e., 1 second.). Coding module The third module consists of three operations which can be executed in parallel. They are: { VQ applied to the normalized curve; { SQ applied to the normalization range parameter; { Dyadic coding applied to the temporal information of sampled frames. Finally, entropy coding is applied to VQ and SQ results. To compress a curve like the one in Figure 6.2, the procedure works as follows. First, the segmentation module partitions this curve based on local extrema, where each segment can be covered by a bounding box. The width of the bounding box gives the temporal information, and its value will be rounded to the closest dyadic number. The height of the bounding box is the range of the dof, and its value can be quantized with a SQ. Second, we normalize all bounding boxes to a unit square with their height and width set to 1 and encode all dof curves inside the unit square with the VQ scheme. In other words, each dof curve will be approximated by another curve, which is the closest codeword, from the codebook. Finally, the outputs from SQ and VQ can be further compressed using the entropy coding. 80 time Figure 6.2: An example to illustrate the perception-based algorithm. 81 More Details of each module will be given in the following three subsections. 6.2.1 Pre-processing Module In the pre-processing module, we are concerned with two issues. First, we need to smooth the dof curves so that the segmentation algorithm based on local extrema detection in the second module is robust to the presence of noise. Second, since we need to encode the timing information of sampled points, the dof curve should not be over-segmented. Otherwise, the overhead of time information coding would be high. We propose the following pre-processing steps to achieve the above goals. If point q i (t) is a local extrema (a minimum or a maximum), the slope at this point should be equal to zero. If the slope from one side of q i (t) is zero while that of the other side is non-zero, q i (t) is called a turning corner point, which means that the curve is entering/leaving a at zone such as zone 3 in Fig. 6.3. A turning corner point should be sampled if the duration of the at zone is long. To avoid numerical errors in evaluating the slope, we set up a threshold, " = 10 5 . If the absolute value of a slope is less than ", it is treated as zero. The local extrema and turning corner points form a candidate list for temporal sam- pling. However, some points in this list may be close to each other in terms of their temporal distances and range values, and it is desirable to remove some of them to save coding bits. For example, the uctuation caused by idling motion and/or capturing noise such as the exemplary dof curve in zone 4 in Fig. 6.3 may not be visible to human eyes. It is desirable to lter out these redundant samples to save coding bits. 82 1 2 3 4 5 time T T T T T Figure 6.3: An exemplary dof curve. We develop a ltering scheme to remove redundant points in the candidate list. The impact of a candidate sampling point on an approximating curve can be estimated from the absolute value of the distance between two neighbor candidate points. The x-axis and the y-axis in Fig. 6.4 denote the distance between a sample point and its previous and next sample points, respectively. We may consider the following scenarios as shown in Fig. 6.4. ! ! 1.5! 1.5! absolute value difference with the previous sample absolute value difference with the next sample Remove if "t # $ Always remove Merge if allowed Figure 6.4: The ltering criteria to remove a point in the candidate list. A point falls in the green-backslash-shaded area if both absolute values are small (). One example is given in Fig. 6.5(a). Such a sampling point can be removed since the curve can be well approximated without it. The point in the yellow-dotted area is a little bit far away from its neighbors in values (i.e., between and 1.5). However, it can still be removed if it is temporally close to its neighbors () as illustrated in Fig. 6.5(b). 83 For a point in the orange-cross-shaded area, the two consecutive candidates are close either temporally ( ) or in their values ( ), the middle point between them can be used to replace these two candidates as illustrated in Fig. 6.6. ! (a) 1.5! ! (b) Figure 6.5: Illustration of two ltering scenarios, where the red candidate sample is removed and the curve is modied from the pink slash line to the brown dot slash line, which still oers a good approximation. (a) and (b) shows cases in the green-backslash- shaded area and in the yellow-dotted area in Fig. 6.4, respectively. 6.2.2 Segmentation and Normalization Module The second module consists of two sub-modules; namely, segmentation and normalization. The segmentation sub-module uses the local extrema obtained from the pre-processing module to partition each dof curve into multiple segments. Most often, the dof dierences of frames in the same segment are of the same sign, which means that they are monoton- ically increasing and decreasing. However, it is still possible to have dof dierences with mixed signs in a segment. An example is given in Fig. 6.7. This occurs since we allow a margin in the ltering process as shown in Fig. 6.4. The normalization sub-module is applied to the y-axis, which corresponds to the range value of a dof curve. Since the dof curve is the temporal evaluation of an angular value 84 ! ! (a) (b) Figure 6.6: Illustration of the merge scenario, where two candidate points shown as the two red circles in (a) are replaced by the middle point of them shown as the yellow circle in (b). of a joint, the value is bounded by (;]. We normalize the range value to 2 and map them into the (0; 1] interval. Furthermore, we apply the normalization to the x-axis, which corresponds to the time. Instead of encoding the beginning and end points of each segment exactly, we round their values to the closest dyadic grid positions and encode them accordingly. 6.2.3 Coding Module There are three sub-modules: VQ, SQ and dyadic encoding. For VQ and SQ, the mocap data in the training set are rst processed by the pre-processing module and the segmen- tation and normalization module. The position and the range normalization parameter of each segment are extracted for codebook training. 85 + + + - - Figure 6.7: This curve will be considered as one positive segment instead of 4 segments of interleaving positive/negative/positive/negative. The segment of opposite sign (nega- tive in this example, for the overall segment is a positive segment) are bounded by the conditions dened in Fig. 6.4. 6.2.3.1 Vector Quantization In the design of the TSVQ codebook, we plot the MSE as a function of the percentages of data in the CMU database used as the training set parameterized by the number of bits for the codebook in Fig. 6.8. In this gure, we compare the VQ codebook of four dierent sizes, i.e., 4, 16, 64 and 256 (corresponding to 2, 4, 6, and 8 coding bits, respectively) with 5%, 10%, 20%, 30%, 40% to 50% of the CMU database as the training data. We see that the performance gain diminishes quickly if the number of coding bits reaches 86 4 and the training percentages reaches 20%. Thus, we choose the codebook size to be 16 (or 4 bits) and the training size to be 20% of the data in the CMU database in the experimental section. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0% 10% 20% 30% 40% 50% 60% MSE Data percentage used as Training set 8 bit 6 bit 4 bit 2 bit Figure 6.8: The plot of the MSE as a function of the percentages of the training data size in the CMU database. The 16 codewords of the 4-bit codebook are shown in Fig. 6.9. Actually, this codebook can be shared by up-rising and down-going curves with a mirror against the middle vertical line of the gure, wherey = 0:5. Again, it is worthwhile to point out that some codewords are not strictly monotonically increasing due to thresholding in the pre-processing stage. 6.2.3.2 Scalar Quantization To design the scalar quantizer for the normalization parameter of the dof range value, we plot the histograms of the scaling parameter using 5%, 10%, 20% and 50% of the data in the CMU database as the training set in Fig. 6.10, where the parameters are grouped 87 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 Spa$al Temporal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 6.9: The 16 codewords of the 4-bit VQ codebook. into 10 bins. The distributions are very close to each other with 5%, 10%, 20%, and 50% of the data selected from the training set. To be consistent with the VQ codebook training, we use the same training set for the scalar quantizer design, i.e., 20% of the data in CMU database are used in the training. The R-D performance of the scalar quantizer is shown in Fig. 6.11. In the experimental section, we use 8 bits to encode the normalization parameter for the dof range value. Due to the lossy coding nature of VQ and SQ, two adjacent segments may not be well connected. To resolve this issue, the boundary frames will be re-calculated via interpolation to ensure the rst-order connectivity. An example is illustrated in Fig. 6.12. In detail, the last 5 frames of the rst segment and the rst 5 frames of the second segment will be extracted. Then we will approximate these 10 frames with a Hermite 88 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Ra#o Scaling parameter 5% 10% 20% 50% Figure 6.10: The histogram of the normalization parameter for the dof range values. Spline. Where in a unit interval (0; 1], given the rst frame at t = 0 is of value q(0), and the last frame t = 1 is of value q(1). The starting tangent _ q(0) is dened by the rst segment, and the ending tangent _ q(1) is dened by the second segment. The polynomial then can be dened by q(t) = (2t 3 3t 2 + 1)q(0) + (t 3 2t 2 +t) _ q(0) + (2t 3 + 3t 2 )q(1) + (t 3 t 2 ) _ q(1) where t2 (0; 1]. 89 0 0.2 0.4 0.6 0.8 1 3 5 7 9 11 13 MSE bits Figure 6.11: The plot of the distortion as a function of the number of bits used to encode the normalization parameter for the dof range value. 6.2.3.3 Dyadic Encoding The positions of segment boundary points are rounded to their closet dyadic grid points. However, since there is no specic relationship between of parent and child nodes using a binary tree representation, we directly encode points in the nest grid as a string of 1s and 0s depending on whether they are segment boundary points or not. After that, we apply the QM coder to the binary string. In detail, the nest grid interval is chosen to be 15 frames (i.e., 1 8 seconds). A 1 minute motion will have 8 60 = 480 grid points. Without the QM coding, each grid index (on or o) demands 1 bit. The percentage of 0s and 1s varies depending on the dynamics of the underlying motion. Generally speaking, since the chosen grid interval is relatively small, there are more 0s then 1s. After the QM coding, the averaged number of bits required to represent each index may be as small as 0.2 bits. 90 (a) (b) ( ) ( ) Figure 6.12: (a) The original dof curve, (b) the coded result after VQ, SQ and dyadic encoding, (c) the interpolated curve based on (b), (d) comparison of the interpolated result (the pink dotted curve) and the original curve ( the black solid curve). 6.3 Experimental Results The coding performance of the perception-based coding algorithm depends on the dura- tion of partitioned segments. The nal size of the coded le is roughly in proportion to the number of segments. As a result, a slow motion clip can give a higher compression ratio while a fast motion clip allows a lower compression ratio. 91 The average compression ratios of several motion categories are shown in Fig. 6.13. the jumping motion can be compressed aggressively with a ratio of 300:1 since the move- ment is very simple. The motion in the category of acrobatics involves a lot of balancing. However, the pose is similar so that the compression ratio is still high. The motion in martial arts contains repetitive sword plays and Tai-Chi, both of which allow a high compression ratio. The motion clips in the categories of Basketball and Boxing have a lower compression ratio since the speed of most dofs changes frequently. However, even in the worst cases, the compression ratio is close to 100:1, which is twice of the approach presented in Chapter 5. The averaged compression ratio of the perception-based coding algorithm applied to the CMU database is 150:1. To further justify the perception-based coding result, we conduct a subjective test on perceptual quality evaluation. Ten motion patterns from dierent categories were selected for this test. For each motion pattern, there were two associated questions. In the rst question, each user was asked to compare the similarity of two synthesized motion patterns using the original mocap data and the compressed mocap dat side by side. The user can give a score from 1 (nothing alike) to 10 (indistinguishable). In the second question, the user is also asked to give a score on motion naturalness from 1 (extremely unnatural) to 10 (very natural). The reported statistics in Table 6.1 were collected from 20 users. As shown in Table 6.1, every coded motion is still natural (score 10 out of 10) regard- less whether it is similar to the original motion or not. For similarity comparison, the score is also very high (9.4 out of 10). The motion with the lowest similarity score is the 92 Table 6.1: Subjective test results on perception-based mocap data compression. playground follow path run/jog walk jump Similarity 9.6 9.4 10 9.8 9.8 Naturalness 10 10 10 10 10 basketball dance cartwheel Tai-Chi boxing Similarity 9.8 9.6 9.6 9.6 9.8 Naturalness 10 10 10 10 10 "follow path" motion. This motion involves a lot of turning, which makes its temporal advances or delay more visible than those in other motions patterns. 6.4 Conclusion A perception-based mocap data coding algorithm was presented in this Chapter. It consists of three modules; namely, the pre-processing module, the segmentation and nor- malization module, and the coding module. The algorithm partitions the dof curve of a motion clip into interleaving segments of positive and negative velocities alternatively. Then, it normalizes each segment and applies VQ, SQ and dyadic encoding to encode the dof curves, the dof range normalization parameter and the temporal position of segment boundaries. Entropy coding is further applied to the VQ and SQ results to achieve a higher compression ratio. The performance of the proposed algorithm depends on the speed of the motion, and the existence of the repetition of poses in the motion. Cyclic motion such as walking can reach a compression ratio of about 250:1 while dynamic and fast-changing motion such as basketball can be compressed at a ratio of 100:1. We would like to improve the coding performance of fast-changing motion. Usually, fast pace implies a smaller range of motion. The optimal choice of VQ and SQ codebook 93 0 50 100 150 200 250 300 350 running walking jumping basketball dance gymnas:cs acroba:cs mar:al arts paddle sports soccer boxing Compression Ra-o Mo-on Category Figure 6.13: The averaged compression ratios of dierent motion categories in the CMU database. size can be correlated to the temporal duration of the segment. By taking advantage of this property, we may push for a higher compression ratio to encode highly dynamic motion such as basketball or boxing. The cross-correlation between dierent dof curves may be exploited as well. 94 Chapter 7 Conclusion and Future Work 7.1 Conclusion In this research, we explored the characteristics of mocap data and proposed two compres- sion schemes that allow a exible rate-quality trade-o. The rst one is a precision-based method. It aims to oer a compressed result which is loyal to the ground truth. The second one is a perception-based method. It aims to preserve the perceptual experience while providing a very high compression ratio. We presented the design and the appli- cation of the precision-based method in Chapters 3 and 4, respectively. Some further improvement was addressed in Chapter 5. The perception-based method was described in Chapter 6. The proposed compression scheme, which operates in the domain of dofs, was de- scribed in Chapter 3. The dof-based coding scheme has three major advantages. First, it is a more compact representation than the marker position domain. The use of the dof format cuts down the size of the marker position format by two thirds. Second, it is easier to compare two poses using the dof representation. The proposed scheme 95 achieved compression by reducing the time-domain redundancy with temporal sampling, and the space-domain redundancy by applying vector quantization to sampled frames (or I-frames). The dimension of a human pose is 60, which is still high. Consequently, we decomposed a full poses to 5 sub-poses, and encoded them separately. The remaining frames (or B-frames) were predicted by interpolating I-frames, and the prediction error was coded. The application of the proposed compression scheme to real world data was discussed in Chapter 4. It was demonstrated that the scheme is generic and automated. Unlike previous work, our scheme does not demand any prior knowledge of the motion type. It uses tree-structured vector quantization to generate an embedded bit stream. In addition, it encodes prediction residuals, which allows more exibility in bit allocation. By pushing the residual coder to the extreme, it becomes a lossless coding scheme. A set of 6 design choices was used in the encoder, which aects the rate-distortion-complexity control. They were: 1) the codebook size for the I-frames, 2) the codebook size for the I-frame residual, 3) the codebook size for the B-frame residual, 4) the number of I-frames, 5) how the I-frames are selected, and 6) how the B-frames are interpolated. We also investigated the relationship between the data, the design choice, and the coding performance. We dened temporal and spatial features to characterize a clip of mocap data, and proposed a rate control algorithm to adjust encoder parameters adaptively to deliver a coded result that matched the target bit rate or error bound. It was shown by experimental results that the proposed scheme can achieve 20:1 compression with low coding complexity and good quality. 96 In Chapter 5, we improved the precision-based algorithm proposed in Chapter 3 from 20:1 to 45:1. The performance improvement was achieved by two techniques: 1) adaptive temporal domain sampling and 2) change of coding variables from dofs to the dierences of dofs. The temporal information of I-frames can be encoded eciently via dyadic coding. The change of coding variables leads to a more eective VQ codebook. We proposed a perception-based mocap coding algorithm in Chapter 6. Since minor temporal oset or spatial stretch is not obvious to human eyes, we can leverage this property to achieve a higher compression ratio. The encoder consists of three modules: 1) pre-processing, 2) segmentation and normalization, and 3) handling. In the second module, the encoder partitions a dof into segments of positive and negative velocities, normalizes each segment and applies vector/scalar quantization and dyadic encoding. The performance of the proposed algorithm depends on the motion speed and the existence of pose repetition in the motion. Dynamic and fast-changing motion such as playing basketball can reach a compression of ratio of about 100:1, while cyclic motion such as walking may reach a compression ratio of 250:1. 7.2 Future Work: Motion Synthesis Motion synthesis based on coded mocap data will be an interesting research direction. The general framework and some detailed ideas will be described below. Then, we will mention some open research issues at the end of this section. 97 7.2.1 Overview of Proposed Motion Synthesis Solution The unit to synthesize motion is called the motion gene. Traditionally, a motion gene denotes a full pose of a human subject [16], [27] and a motion sequence is decomposed in time by the motion graph (MG) algorithm [13], [23]. In this research, we attempt to decompose a motion sequence in both time and space. That is, we are interested in using a smaller motion gene unit to describe a sub-pose of a limb. This concept of a smaller motion gene is useful when two motion sequences do not have a full pose in common but share some similar sub-poses. Besides, for a typical Vicon human model, a full pose consists of 45 dofs while the motion of a limb has only 9 dofs. The dimension of a sub-pose is smaller and, hence, more manageable. When the sub-pose is used as the basic motion synthesis unit, the correlation of two sub-poses may exist in space and time. For example, in a walking motion, one foot has to be on the ground as the weight support, which can be either the left foot or the right foot. In the time domain, each sub-pose can be transited from certain sub-poses directly. Furthermore, since sub-poses are used as the basic unit to describe the motion of dierent limbs, we generalize the concept of MG to a heterogeneous motion graph (HMG). Each motion gene is a node in the HMG, and it can be chosen as a representative from a collection of similar motion sequences of sub-poses. We adopt the vector wavelet representation to describe a set of dof curves since the wavelet transform can characterize both global and local variations eectively. The vector wavelet representation also allows a fast way to match two motion sequences. 98 7.2.2 Motion Genes Motion genes are used as building blocks to construct a complex motion sequence. As described in Chapter 2, the motion of a subject denes a set of dof curves over time. We use Q S;t to denote the motion gene that covers a set of dofs, S, for a duration of t frames. When an motion gene covers dofs of a limb such as an arm or a leg, the chance of nding another transitable motion gene is usually higher, and it is easier to perform segmentation in time since selecting the dominant dof of a limb is easier than that of the full body. On the other hand, it is more challenging to assemble these motion genes together in the synthesis stage. We may manipulate motion genes using various tools as given below. 1. Alignment: Align the tempo of two similar motions. 2. Comparison: Two motion genes can be rst aligned then compared frame by frame. 3. Cascade in time: Two motion genes can be cascaded in time if they are transitable. 4. Union in space: Two motion genes can be integrated together in space if they are biologically compatible. 5. Search: A search among motion genes can be conducted by checking the above properties. When an motion gene covers dofs of a limb such as an arm or a leg, the chance of nding another transitable motion gene is usually higher, and it is easier to perform segmentation in time because it is easier to choose the dominant dofs of a limb than 99 that of a full body. On the other hand, it is more challenging to assemble dierent limb motions into that of a full body. 7.2.3 Heterogeneous Motion Graph (HMG) The heterogeneous motion graph (HMG) is used to synthesize long and complicated motion sequences. A node in an HMG corresponds to a motion gene. Dierent motion genes give rise to dierent nodes. It is desirable to reduce the number of node types in the HMG to simplify the motion synthesis process. Since the human body has the left-right symmetry and our arms and legs have similar bone structure, some motion genes can be shared among multiple limbs. Furthermore, a certain delay in time or space can create more variety in motion such as a periodic motion, e.g., walking. Four mirroring eects are considered in the HMG: 1) self-mirroring in time, 2) self-mirroring in space, 3) mirroring to symmetric limbs with delay and 4) mirroring to symmetric limbs without delay. Self mirroring in time creates the eect of rewinding the motion. Mirroring can be used to reduce the number of nodes in HMG. For example, since a lot of daily motions have the left-right symmetric property, this mirroring operation can save about one half of nodes. On the other hand, it is important to conduct path planning to avoid some artifacts such as penetrating into other limbs or the environment. Several commonly used checking criteria are given below. Balance Checking Any natural motion of an object has to meet the balance criterion. 100 Rotation Momentum Checking We should check the rotational momentum to avoid unnaturally continuous spins. Note that rotational motion is more sensitive to time warping modication. Thus, this check plays as the safe net. When all poses and transition in a motion database can be created by an HMG, the HMG serves as a great candidate to index this huge database. We may observe how limbs are used. This information can be used to prevent an athlete from injury due to the over-use of certain joints or limbs, and suggest what might be the weakness of a subject, which demands more training. With proper AI plug-ins, the HMG can serves as a beginner tool for choreographing. 7.2.4 Open Research Problems There are a couple of open research problems to be addressed in the near future as listed below. We only outline the key ideas in motion synthesis in the last section. Their imple- mentation and detailed performance evaluation will be conducted in the second half of the thesis research. Specically, we would like to synthesize a rich set of motion with a minimal amount of human involvement. Some machine learning techniques could be adopted for this purpose. In the proposed mocap data coding scheme, we did not explore the potential correla- tion between sub-poses. The correlation between sub-poses may depend on the mo- tion. For example, the correlation between two legs in walking is stronger than those 101 in a basketball game. It would be interesting to exploit such a correlation between sub-poses. This subject is closely related to motion segmentation/decomposition. A deeper understanding of this subject will be benecial to motion synthesis as well, since it may guide us in assembling nodes in a heterogeneous motion graph. For the coding of mocap data, we use the MSE as the distortion measure. In the proposed motion synthesis scheme, we argue that, if the combination of sub-poses co-exists in other motion, the motion should be considered natural. However, MSE does not fully re ect human perception of naturalness accordingly. It is possible that a motion of higher MSE may look more natural. A potential solution to this is to investigate the salient point of human motion, and quantify the extraction process as the updated quality. It will be challenging yet meaningful to nd a better objective measure which may quantify the naturalness of a pose or a motion. 102 References [1] M. Alexa and W. M uller, \Representing animations by principal components," Com- puter Graphics Forum, vol. 19, no. 3, pp. 411{418, Aug. 2000. [2] O. Arikan, \Compression of motion capture databases," ACM Transactions on Graphics, vol. 25, no. 3, pp. 890{897, Jul. 2006. [3] J. Assa, Y. Caspi, and D. Cohen-Or, \Action synopsis: pose selection and illustra- tion," in SIGGRAPH '05: ACM SIGGRAPH 2005 Papers. New York, NY, USA: ACM, 2005, pp. 667{676. [4] J. Barbi c, A. Safonova, J.-Y. Pan, C. Faloutsos, J. K. Hodgins, and N. S. Pol- lard, \Segmenting Motion Capture Data into Distinct Behaviors," in Proceedings of Graphics Interface 2004, Jul. 2004, pp. 185{194. [5] P. Beaudoin, P. Poulin, and M. van de Panne, \Adapting wavelet compression to human motion capture clips," in Graphics Interface 2007, May 2007, pp. 313{318. [6] L. S. Brotman and A. N. Netravali, \Motion interpolation by optimal control," in SIGGRAPH '88: Proceedings of the 15th annual conference on Computer graphics and interactive techniques. New York, NY, USA: ACM, 1988, pp. 309{315. [7] J. Chai and J. K. Hodgins, \Performance animation from low-dimensional control signals," ACM Transactions on Graphics (SIGGRAPH 2005), vol. 24, no. 3, Aug. 2005. [8] S. Chattopadhyay, S. M. Bhandarkar, and K. Li, \Human motion capture data com- pression by model-based indexing: A power aware approach," IEEE Transactions on Visualization and Computer Graphics, vol. 13, pp. 5{14, 2007. [9] S. Cooper, A. Hertzmann, and Z. Popovi c, \Active learning for real-time motion controllers," ACM Transactions on Graphics, vol. 26, no. 3, pp. 5:1{5:7, Jul. 2007. [10] J. Harrison, R. A. Rensink, and M. van de Panne, \Obscuring length changes during animated motion," in SIGGRAPH '04: ACM SIGGRAPH 2004 Papers. New York, NY, USA: ACM, 2004, pp. 569{573. [11] T. hoon Kim, S. I. Park, and S. Y. Shin, \Rhythmic-motion synthesis based on motion-beat analysis," ACM Transactions on Graphics, vol. 22, no. 3, pp. 392{401, Jul. 2003. 103 [12] L. Ibarria and J. Rossignac, \Dynapack: space-time compression of the 3d ani- mations of triangle meshes with xed connectivity," in 2003 ACM SIGGRAPH / Eurographics Symposium on Computer Animation, Aug. 2003, pp. 126{135. [13] L. Kovar, M. Gleicher, and F. Pighin, \Motion graphs," ACM Trans. on Graphics (SIGGRAPH), vol. 21, no. 3, pp. 473{482, Jul. 2002. [14] L. Kovar, J. Schreiner, and M. Gleicher, \Footskate cleanup for motion capture editing," in ACM SIGGRAPH Symposium on Computer Animation, Jul. 2002, pp. 97{104. [15] K. H. Lee, M. G. Choi, and J. Lee, \Motion patches: building blocks for virtual environments annotated with motion data," ACM Transactions on Graphics, vol. 25, no. 3, pp. 898{906, Jul. 2006. [16] J. McCann and N. S. Pollard, \Responsive characters from motion fragments," ACM Transactions on Graphics (SIGGRAPH 2007), vol. 26, no. 3, Aug. 2007. [17] J. McCann, N. S. Pollard, and S. S. Srinivasa, \Physics-based motion retiming," in 2006 ACM SIGGRAPH / Eurographics Symposium on Computer Animation, Sep. 2006. [18] T. Mukai and S. Kuriyama, \Geostatistical motion interpolation," ACM Trans. Graph., vol. 24, no. 3, pp. 1062{1070, 2005. [19] A. S. Ogale, A. Karapurkar, and Y. Aloimonos, View-Invariant Modeling and Recog- nition of Human Actions Using Grammars. Springer Berlin / Heidelberg, 2007. [20] C. O'Sullivan and J. Dingliana, \Collisions and perception," ACM Trans. Graph., vol. 20, no. 3, pp. 151{168, 2001. [21] C. O'Sullivan, J. Dingliana, T. Giang, and M. K. Kaiser, \Evaluating the visual delity of physically based animations," ACM Trans. Graph., vol. 22, no. 3, pp. 527{536, 2003. [22] M. Preda, B. Jovanova, I. Arsov, and F. Pr^ eteux, \Optimized mpeg-4 animation encoder for motion capture data," in Web3D '07: Proceedings of the twelfth inter- national conference on 3D web technology. New York, NY, USA: ACM, 2007, pp. 181{190. [23] P. S. A. Reitsma and N. S. Pollard, \Evaluating motion graphs for character naviga- tion," in SCA '04: Proceedings of the 2004 ACM SIGGRAPH/Eurographics sympo- siumonComputeranimation. Aire-la-Ville, Switzerland, Switzerland: Eurographics Association, 2004, pp. 89{98. [24] ||, \Perceptual metrics for character animation: sensitivity to errors in ballistic motion," in SIGGRAPH '03: ACM SIGGRAPH 2003 Papers. New York, NY, USA: ACM, 2003, pp. 537{542. 104 [25] L. Ren, A. Patrick, A. A. Efros, J. K. Hodgins, and J. M. Rehg, \A data-driven approach to quantifying natural human motion," ACM Transactions on Graphics (SIGGRAPH 2005), vol. 24, no. 3, Aug. 2005. [26] C. Rose, B. Bodenheimer, and M. F. Cohen, \Verbs and adverbs: Multidimensional motion interpolation using radial basis functions," IEEE Computer Graphics and Applications, vol. 18, pp. 32{40, 1998. [27] A. Safonova and J. K. Hodgins, \Construction and optimal search of interpolated motion graphs," ACM Transactions on Graphics (SIGGRAPH 2007), vol. 26, no. 3, Aug. 2007. [28] M. Sattler, R. Sarlette, and R. Klein, \Simple and ecient compression of animation sequences," in 2005 ACM SIGGRAPH / Eurographics Symposium on Computer Animation, Jul. 2005, pp. 209{218. [29] A. Treuille, Y. Lee, and Z. Popovi c, \Near-optimal character animation with con- tinuous control," ACM Transactions on Graphics, vol. 26, no. 3, pp. 7:1{7:7, Jul. 2007. [30] T. Y. Yeh, G. Reinman, S. J. Patel, and P. Faloutsos, \Fool me twice: Exploring and exploiting error tolerance in physics-based animation," ACM Trans. Graph., vol. 29, no. 1, pp. 1{11, 2009. 105
Abstract (if available)
Abstract
The mocap data has been widely used in many motion synthesis applications for education, medical diagnosis, entertainment, etc. In the entertainment business, the synthesized motion can be easily ported to different models to animate virtual creatures. The richness of a mocap database is essential to motion synthesis applications. In general, the richer the collection, the higher quality the synthesized motion. Since there exists limitation on network bandwidth or storage capacity, there are constraints on the size of the mocap collection to be used. It is desirable to develop an effective compression scheme to accommodate a larger mocap data collection for higher quality motion synthesis. In order to synthesize natural and realistic motion from existing motion capture database particularly in the context of video game applications, the compression procedure enables efficient management.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Compression algorithms for distributed classification with applications to distributed speech recognition
PDF
Advanced techniques for human action classification and text localization
PDF
Complexity scalable and robust motion estimation for video compression
PDF
Random access to compressed volumetric data
PDF
Distributed wavelet compression algorithms for wireless sensor networks
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
On optimal signal representation for statistical learning and pattern recognition
PDF
Low complexity mosaicking and up-sampling techniques for high resolution video display
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
PDF
Robust speaker clustering under variation in data characteristics
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Algorithmic aspects of throughput-delay performance for fast data collection in wireless sensor networks
PDF
Human appearance analysis and synthesis using deep learning
PDF
Active data acquisition for building language models for speech recognition
PDF
Hierarchical methods in automatic pronunciation evaluation
PDF
Block-based image steganalysis: algorithm and performance evaluation
PDF
Algorithms and data structures for the real-time processing of traffic data
PDF
Multimodal analysis of expressive human communication: speech and gesture interplay
PDF
Data-driven 3D hair digitization
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
Asset Metadata
Creator
Kuo, May-chen
(author)
Core Title
Mocap data compression: algorithms and performance evaluation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/14/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm,compression,motion capture data,OAI-PMH Harvest,perception-based algorithm,vector quantization
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Narayanan, Shrikanth S. (
committee member
)
Creator Email
maychenk@usc.edu,maychenkuo@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3428
Unique identifier
UC1179878
Identifier
etd-Kuo-4051 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-384320 (legacy record id),usctheses-m3428 (legacy record id)
Legacy Identifier
etd-Kuo-4051.pdf
Dmrecord
384320
Document Type
Dissertation
Rights
Kuo, May-chen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
algorithm
compression
motion capture data
perception-based algorithm
vector quantization