Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deep representations for shapes, structures and motion
(USC Thesis Other)
Deep representations for shapes, structures and motion
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEEP REPRESENTATIONS FOR SHAPES, STRUCTURES AND MOTION by Yi Zhou A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2020 Copyright 2020 Yi Zhou Acknowledgments My deepest gratitude goes to my advisor, Prof. Hao Li, for the insightful guidance during my Ph.D. program and the generous support for me to freely explore my interested areas. I would also like to thank my committee members: Prof. Laurent Itti and Prof. Andrew Nealen, for the constructive comments and feedbacks. I feel grateful for being able to collaborate with many talented people from Adobe Research, Facebook Reality Lab, Microsoft Research and USC: Connelly Barnes, Chen Cao, Weikai Chen, Liwen Hu, Hanyuan Hsiao, Zeng Huang, Han-wei Kong, Jun Xing, Zimo Li, Cynthia Lu, Jason Saraghi, Yaser Sheikh, Xin Tong, Chenglei Wu, Sitao Xiang, Jimei Yang and Yuting Ye, who also contributed to the work in this thesis. I cherish the wonderful time spent with my labmates and friends at USC, Mingming He, Liwen Hu, Zeng Huang, Jiaman Li, Qiangeng Li, Ruilong Li, Tianye Li, YijingLi, ZimoLi, ShichenLiu, KyleOlszeweski, ShunsukeSaito, BohanWang, Weiyue Wang, Lingyu Wei, Pengda Xiang, Sitao Xiang, Yuliang Xiu, Harry Yang, Ronald Yu, Danyong Zhao, Yajie Zhao, Mianlun Zheng and Yiqi Zhong, for the dinners we shared, the whales we watched, the board games we played and the sleepless nights before dealines we went through together. ii I owe my gratitue to my family, and particularly Mr and Mrs Courtright who took care of me like their niece in this foreign country. Their whole-hearted love encouraged me to overcome many difficulties. I also thank Haoqi Li for his accom- pany. Last but not least, I acknowledge the USC scholarship and the Annenberg Fellowship for sponsoring my Ph.D. program and allowing me to study in diversed disciplines and take professional classes in music. iii Contents Acknowledgments ii List of Tables vii List of Figures ix Abstract xvi Chapter 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Strands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.4 Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2 Mesh 8 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Graph Convolution . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 CNNs with Local Awareness . . . . . . . . . . . . . . . . . . 12 2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Graph Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 vcConv and vcTransConv . . . . . . . . . . . . . . . . . . . 16 2.2.3 vdPool, vdUnpool, vdUpRes, and vdDownRes . . . . . . . . 18 2.2.4 Fully Convolutional Auto-Encoder . . . . . . . . . . . . . . . 19 2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Comparison of Network Architectures . . . . . . . . . . . . . 21 2.3.2 Comparison of Network Layers . . . . . . . . . . . . . . . . 22 2.3.3 High Resolution Mesh . . . . . . . . . . . . . . . . . . . . . 24 2.3.4 Localized Interpolation . . . . . . . . . . . . . . . . . . . . . 26 2.3.5 N-D Manifold and Non-Manifold Meshes . . . . . . . . . . . 26 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Chapter 3 3D Hair 30 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv 3.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.3 Hair Prediction Network . . . . . . . . . . . . . . . . . . . . 36 3.2.4 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 Quantitative Results and Ablation Study . . . . . . . . . . . 42 3.3.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4 Rotation Representations 49 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Definition of Continuous Representation . . . . . . . . . . . . . . . 53 4.3 Rotation Representation Analysis . . . . . . . . . . . . . . . . . . . 57 4.3.1 Discontinuous Representations . . . . . . . . . . . . . . . . . 57 4.3.2 Continuous Representations . . . . . . . . . . . . . . . . . . 60 4.4 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Sanity Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.2 Pose Estimation for 3D Point Clouds . . . . . . . . . . . . . 68 4.4.3 Inverse Kinematics for Human Poses . . . . . . . . . . . . . 70 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Chapter 5 3D Human Motion Synthesis 73 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2.2 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.3 Range-Constrained Motion Representation . . . . . . . . . . 86 5.2.4 Global Path Predictor . . . . . . . . . . . . . . . . . . . . . 88 5.2.5 Local Motion Generation . . . . . . . . . . . . . . . . . . . . 89 5.2.6 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2.7 Dataset and Training . . . . . . . . . . . . . . . . . . . . . . 97 5.3 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . . 99 5.3.1 Effect of Range-Constrained Motion Representation . . . . . 99 5.3.2 Accuracy of the Global Path Predictor . . . . . . . . . . . . 100 5.3.3 Keyframe Alignment . . . . . . . . . . . . . . . . . . . . . . 100 5.3.4 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . 102 5.3.5 Motion Quality . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3.6 Extension and Comparison with RTN . . . . . . . . . . . . . 103 5.3.7 Variation Control with Motion DNA . . . . . . . . . . . . . 106 5.4 Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4.1 Partial Body Control: Root . . . . . . . . . . . . . . . . . . 110 5.4.2 Partial Body Control: 2D Joints . . . . . . . . . . . . . . . . 111 5.4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 v 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Chapter 6 Conclusion 114 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 116 Bibliography 118 Appendix A Mesh Convolution 132 A.1 Method Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.1.1 Network Implementation . . . . . . . . . . . . . . . . . . . . 132 A.1.2 Graph Sampling Algorithm . . . . . . . . . . . . . . . . . . 133 Appendix B Hair Synthesis 135 B.1 Collision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 B.2 Hair Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 B.3 Detailed Network Architecture . . . . . . . . . . . . . . . . . . . . . 137 B.4 Results Gallery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Appendix C Rotation Representations 142 C.1 Overview of the Supplemental Document . . . . . . . . . . . . . . . 142 C.2 6D Representation for the 3D Rotations . . . . . . . . . . . . . . . 142 C.3 Proof that Case 4 gives a Continuous Representation . . . . . . . . 143 C.4 The Unit Quaternions are a Discontinuous Representation for the 3D Rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 C.5 Interaction Between 5D and 6D Continuous Representations and Discontinuous Ones . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 C.6 Visualizing Discontinuities in 3D Rotations . . . . . . . . . . . . . . 147 C.7 Additional Empirical Results . . . . . . . . . . . . . . . . . . . . . . 148 C.7.1 Visualization of Inverse Kinematics Test Result . . . . . . . 148 C.7.2 Additional Sanity test . . . . . . . . . . . . . . . . . . . . . 148 Appendix D Motion Synthesis 153 D.1 Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 153 D.2 Dataset Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 vi List of Tables 2.1 Comparisons between different network architectures. . . . . . . . . 22 2.2 Comparison of using different blocks. . . . . . . . . . . . . . . . . . 24 3.1 Hair classes and the number of hairs in each class. S refers to short, M refers to medium, L refers to long, X refers to very, s refers to straight and c refers to curly. Some hairs are assigned to multiple classes if its style is ambiguous. . . . . . . . . . . . . . . . . . . . . 36 3.2 Reconstruction Error Comparison. The errors are measured in met- ric. The Pos Error refers to the mean square distance error between the ground-truth and the predicted hair. "-VAW" refers to eliminat- ing the visibility-adaptive weights. "-Col" refers to eliminating the collision loss, "-Curv" refers to eliminating the curvature loss. "NN" refers to nearest neighbor query based on the visible part of the hair. 44 3.3 Time and space complexity. . . . . . . . . . . . . . . . . . . . . . . 44 5.1 IK Reconstruction errors using different rotation representations. The quaternion and vanilla Euler angle representations are used for all nodes. Ours uses the range-constrained Euler representation for non-root nodes and the continuous 6D representation [154] for the root node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 vii 5.2 Global path prediction mean errors in centimeters. V n is the mean error of the root (hip) translation differences in the x-z plane for poses predicted at n frames in the future. Y is the mean error of the root (hip) along the y axis. . . . . . . . . . . . . . . . . . . . . 100 5.3 Mean Computation Time for Generating Different Lengths of Mo- tions. s refers to second. . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4 Evaluation results for motion quality and keyframe alignment. The first eight rows are the user study results collected from 101 human workers. The rows “real" and “synthetic" are the number of workers who chose the real or synthetic motions, respectively. “User pref- erence" is the percentage of synthetic motions chosen out of total pairs, and the margins of error are listed on the next row with confi- dence level at 95%. The last two rows are the mean Euclidean error of the global root positions and the local joint positions of the input keyframes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 B.1 Details of our network architecture. "in_ch" means input channel size. "out_ch" means output channel size. . . . . . . . . . . . . . . 137 D.1 Range and order of the rotations at each joint defined in the CMU Motion Capture Dataset. Note that finger motions are not captured in this dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 viii List of Figures 1.1 Hair reconstruction from a single view image using HairNet. . . . . 3 1.2 Given the same four keyframes (pink), we show an example where our method generates two different and long motion sequences that interpolates these keyframes. Notice that the synthesized frames contain complex natural poses and meaningful variations. A subset of the synthesized frames are rendered from blue to white in the order of time. The synthesized trajectories are rendered as small dots on the ground. . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Graph sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Residual block for down/upsampling. . . . . . . . . . . . . . . . . . 19 2.3 Encoding and decoding process for DFAUST data. . . . . . . . . . . 20 2.4 Visualization of receptive field in the middle layer. . . . . . . . . . . 20 2.5 Point-wise reconstruction error on D-FAUST meshes. . . . . . . . . 22 2.6 Result on high resolution mesh. . . . . . . . . . . . . . . . . . . . . 25 2.7 Interpolation from source to target using global or local latent codes. 27 2.8 Reconstruction result of 3D tetrahedron meshes. . . . . . . . . . . . 27 2.9 Reconstruction result of a non-manifold triangle mesh. . . . . . . . 28 ix 3.1 Network Architecture. The input orientation image is first encoded into a high-level hair feature vector, which is then decoded to 32×32 individual strand-features. Each strand-feature is further decoded to the final strand geometry containing both sample positions and curvatures via two multi-layer perceptron (MLP) networks. . . . . . 37 3.2 Ellipsoids for Collision Test. . . . . . . . . . . . . . . . . . . . . . . 39 3.3 The orientation image (b) can be automatically generated from a real image (a), or from a synthesized hair model with 9K strands. Theorientationmapandadown-sampledhairmodelwith1Kstrands (c) are used to train the neural network. . . . . . . . . . . . . . . . 40 3.4 Hair strand upsampling in the space of (b) the strand-features and (c) the final strand geometry. (d) shows the zoom-in of (c). . . . . . 41 3.5 Reconstruction with and without using curliness. . . . . . . . . . . 42 3.6 Interpolation comparison. . . . . . . . . . . . . . . . . . . . . . . . 45 3.7 Comparison with Autohair in different views. . . . . . . . . . . . . . 45 3.8 Comparison with Autohair for local details. . . . . . . . . . . . . . 47 3.9 Hair tracking and reconstruction on video. . . . . . . . . . . . . . . 48 3.10 Failure Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1 A simple 2D example, which motivates our definition of continuity of representation. See Section 4.2 for details. . . . . . . . . . . . . . 54 4.2 Our definition of continuous representation, as well as how it can apply in a neural network. See the body for details. . . . . . . . . . 55 x 4.3 An illustration of stereographic projection in 2D. We are given as input a point p on the unit sphere S 1 . We construct a ray from a fixedprojectionpointN 0 = (0, 1)throughpandfindtheintersection of this ray with the plane y = 0. The resulting point p 0 is the stereographic projection of p. . . . . . . . . . . . . . . . . . . . . . 62 4.4 An illustration of hown− 2 normalized projections can be made to reducethedimensionalityfortherepresentationofSO(n) fromCase 3 byn− 2. In each row we show the dimensionn, and the elements of the vectorized representation γ(M) containing the first n− 1 columns of M∈SO(n). Each column is length n: the columns are grouped by the thick black rectangles. Each unique color specifies a group of inputs for the “normalized projection" of Equation (4.8). The white regions are not projected. . . . . . . . . . . . . . . . . . 65 4.5 Empirical results. In (b), (e), (h) we plot on the x axis a percentile p and on the y axis the error at the given percentile p. . . . . . . . 67 5.1 Method Overview. Given user-specified keyframes and the corre- spondingmask, wegeneratethelocalmotionofeveryjointandevery frameinarotationrepresentation. WethenuseRange-Constrainted ForwardKinematicsmoduletoobtainthelocaljointpositionsbased on which we use a global path predictor to estimate the global path of the root joint. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 xi 5.2 Network Architecture for Local Motion Generation. As explained in Section 5.2.3, V t contains a 6-D vector that represents the rotation of the root and (M-1) 3-D vectors that represent the rotation of the other joints. R t contain the 3x3 rotation matrices for all the joints. T t and ˆ T t contain the local translation vectors for all the joints except for the root joint. w is concatenated with each z t to be fed into the decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3 Input Format for Local Motion Generator. The first row contains the frame indices. The second and third rows show the sparse in- put format, and the fourth and fifth rows show the dense input format. 1s and 0s in the second row indicate the presence and ab- sence of keyframes. S 0 φ ,φ = 3, 64, 67 are the poses at user-specified keyframes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4 Root translation error calculation. The blue path is the synthesized root path, and blue dots are the synthetic root positions at the keyframes. Red crosses indicate the input root positions at the keyframes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Mean L2 errors (cm) of local joints (left) and root joint (right) at keyframes throughout the training process. Blue lines refer to the result of using the sparse input format. Red lines refer to the dense input format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.6 Training and testing setting for different approaches. . . . . . . . . 106 xii 5.7 Examples of motion variety. Within each of the four sub-figures, we visualizefourposes(blue)atthesameframefromamotionsequence synthesized from the same set of keyframes but with four different Motion DNAs. The transparent pink skeleton is the target keyframe pose in the near future. . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.8 Examplesofgeneratingonesecondandfoursecondtransitionsgiven the 40 past frames and 2 future frames. The first two rows show se- lected frames of a one-second transition sampled at 30 fps. The last two rows show frames of a four-second transition sampled at 60 fps. Pink skeletons visualize the input keyframes. From left to right, the blue skeletons show how the synthesized motion transitions between two keyframes. Numbers at the top-right corners are the frame in- dices. Corresponding results can be found in the supplementary video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.9 Examples of inbetweening. The pink skeletons visualize the user- specified keyframes. From left to right, the blue skeletons show how the synthesized motion transitions between two keyframes. The semi-transparent pink skeletons are the keyframes in the near fu- ture within 400 frames. Each group of two rows shows frames from twogenerated2048-frame-longsyntheticmotionsequencesgiventhe same input keyframes but different Motion DNA. The yellow skele- tons are the representative poses for the Motion DNA. . . . . . . . 109 xiii 5.10 Motion Generation given sparse root coordinates. Two synthe- sized motion sequences given the same keyframes (position of the root joints as pink dots on the ground) and different representa- tive frames. Top row: the 100th, 500th, and 700th frame from a synthesized sequence. Bottom row: the corresponding frames from another sequence synthesized with a different set of representative frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.11 MotionGenerationgivenonly2Dkeyframeposes. Left: 2Dkeyframe inputs given on the x-y plane. Right: the synthesized 3D motion sequence. The pink skeletons visualize the user-specified keyframes. From left to right, the blue skeletons show how the synthesized mo- tion transitions between two keyframes. The semi-transparent pink skeletons are the keyframes in the near future within 400 frames. . . 110 A.1 Steps for graph up and down-sampling. . . . . . . . . . . . . . . . . 134 A.2 More up-sampling examples. . . . . . . . . . . . . . . . . . . . . . . 134 B.1 Without the collision loss in training (c,d), our method could pro- duce less plausible results. . . . . . . . . . . . . . . . . . . . . . . . 135 B.2 Hair interpolation in the latent space of our model. . . . . . . . . . 136 B.3 Our single-view reconstruction results for various hairstyles. . . . . 138 B.4 Our single-view reconstruction results for various hairstyles. . . . . 139 B.5 Our single-view reconstruction results for various hairstyles. . . . . 140 B.6 Our single-view reconstruction results for various hairstyles. . . . . 141 C.1 Visualization of discontinuities in 3D rotation representations. . . . 149 xiv C.2 Attop, weshowIKresultsforthetwoframeswithhighestposeerror from the test set for the network trained using quaternions, and the corresponding results on the same frames for the network trained on the 6D representation. At bottom, we show the two worst frames for the 6D representation network, and the corresponding results for the quaternion network. . . . . . . . . . . . . . . . . . . . . . . . . 151 C.3 Additional Sanity test results. “Quat" refers to quaternions, “Quat- hemi" refers to quaternions constrained to one hemisphere, “AxisA" refers to axis angle and “Rodriguez" refers to the 3D Rodriguez-vector.152 xv Abstract In recent years, deep learning has achieved great success in 2D image process- ing. However, its development in the 3D field is still restricted due to the lack of proper implicit representations and neural network architectures for 3D data. For 3D tasks, we face more complex and diversed forms of data such as meshes and hierarchically defined transformation groups. Unlike 2D images which are stored in uniformed grids, 3D data can have uneven parametrization and non-Eulicdean topology. While it is difficult to directly apply existing deep learning models to 3D data, we propose the deep representations and neural network architectures for sev- eral common 3D cases, covering topics from 3D shapes and structures to motion synthesis. We first introduce the convolutional neural network for manifold and non-manifold meshes, then dig into strand structures and address the problem of human hair modeling. After that, we investigate the rotation representations in neural networks, based on which, we further propose the neural network architec- ture for natural human motion synthesis. Enabled by our proposed deep learning models, we are able to conquer challenging reconstruction and synthesis tasks for 3Dmodelingandanimation, andsurpassstate-of-the-artworkswithhigherquality, higher resolution and real-time or near real-time performance. xvi Chapter 1 Introduction 1.1 Motivation In the past decade, 3D modeling and animation are fundamentally changed by several key technologies. On one hand, the advancement in capturing systems and the popularity of social media opens up the possibility of collecting massive appearance and behavior data. On the other hand, effective learning and efficient decoding of implicit representations from massive data are enabled by the advent of deep neural networks. Altogether, these paradigm shifts in technology bring the twilight of deep learning being massively used in social media, movies, games and education and revolutionize the way how people interact with computers. Although deep learning has achieved great success in 2D image processing, it hasn’t been fully developed in the 3D field yet. One critical problem is the lack of proper neural network architectures and deep representations for 3D data. Unlike 2Ddatawithregulargrid-structuressuchasimages, 3Ddatacanhaveiregulardata structures and non-Euclidean topologies that classic neural network architectures cannot be directly applied on. In the computer graphics area, in order to compute complex 3D phenomenons, a variaty of data representations are used for different senarios. For 3D shape modeling, polygon meshes are the most widely used form, but there are also some special cases like hair that has volumetric and filar structures. For animating the 1 avatars, people use the skeleton model and rotations at each joint to abstract the pose, and the temporal change of poses to represent motion. In the dissertation, we propose the deep representions and neural network ar- chitectures for those common 3D representations. We first explore the problem of designing the convolutional neural network for manifold and non-manifold meshes, then we dig into hair modeling. After that, we investigate rotation representa- tions in neural networks, based on which, we further propose the neural network architecture for natural human motion synthesis. 1.2 Overview 1.2.1 Mesh Starting from 3D shape, in Chapter 2 we introduce a novel convolutional neural network model for mesh. Convolutional Neural Networks (CNN) have achieved great success in many computer vision applications. However, it is difficult to directly apply existing CNN architectures for images to irregular graph data, such as 3D meshes. To this end, we propose a generalization of operators which are commonly used in 2D CNNs, namely convolution, transpose convolution, pooling, unpooling, and residual layers, so that they can be applied to any irregular 3D meshes, including manifold or non-manifold meshes. Our convolution operator learns a spatially shared kernel basis which produces locally varying coefficients to obtain the actual kernel weights. In this way, we can apply convolution to arbitrary meshes as long as the dataset shares the same topology. We demonstrate the effectiveness of our operators by designing an auto-encoder on 3D meshes that outperforms state-of-the-art methods for the task of mesh reconstruction. 2 Figure 1.1: Hair reconstruction from a single view image using HairNet. 1.2.2 Strands in Chapter 3, we study strands modeling from the aspect of human hair. Due to its volumetric structure and its deformability in each strand, hair is one of the highly complex and irregular subjects for 3D digitization. State-of-the-art auto- matic hair modeling methods rely on large hairstyle collections queries and heavy computations [93, 21]. To explore a much faster and lighter hair modeling solution, we introduce a deep learning-based method to generate full 3D hair geometry from an unconstrained image as shown in Figure 1.1 We propose to embed each hair strand as 3D curves into a latent vector and parameterize latent vectors of the hair strands on a 2D map based on their root position on the scalp. Built on this, we can apply conventional 2D convolutional neural networks to encode and decode the global hair style feature vector. Trained on a large hair dataset that contains 40,000 hair models, the network provides a compact and continuous representation for hairstyles, which allows us to interpo- late naturally between hairstyles. As for the single-view reconstruction task, we first feed the network with the 2D orientation field of a hair image to infer the global feature of the hair and then decode the 3D hair model. We introduce a collision loss to synthesize more 3 plausible hairstyles, and the visibility of each strand is also used as a weight term to improve the reconstruction accuracy. We use a large set of rendered synthetic hair models to train our network. Our method scales to real images because an intermediate 2D orientation field, automatically calculated from the real image, factors out the difference between synthetic and real hairs. We demonstrate the effectiveness and robustness of our method on a wide range of challenging real Internet pictures, and show reconstructed hair sequences from videos. Compared to previous methods, our deep learning approachis highly efficient in storage and can run 1000 times faster than data-query method while generating hair with 30K strands. 1.2.3 Rotation Traditionally, a 3D avatar’s pose is represented by a group of rotations at each joint. A proper rotation representation of joint motion is critical for deep neural network based pose estimation or motion synthesis frameworks. In neural networks, it is often desirable to work with various representations of the same space. For example, 3D rotations can be represented with quaternions or Euler angles. In Chapter 4, we advance a definition of a continuous representation, which can be helpful for training deep neural networks. We relate this to topolog- ical concepts such as homeomorphism and embedding. We then investigate what are continuous and discontinuous representations for 2D, 3D, and n-dimensional rotations. We demonstrate that for 3D rotations, all representations are discon- tinuous in the real Euclidean spaces of four or fewer dimensions. Thus, widely used representations such as quaternions and Euler angles are discontinuous and difficult for neural networks to learn. We show that the 3D rotations have con- tinuous representations in 5D and 6D, which are more suitable for learning. We 4 also present continuous representations for the general case of the n dimensional rotation groupSO(n). While our main focus is on rotations, we also show that our constructions apply to other groups such as the orthogonal group and similarity transforms. We finally present empirical results, which show that our continuous rotation representations outperform discontinuous ones for several practical prob- lems in graphics and vision, including a simple autoencoder sanity test, a rotation estimator for 3D point clouds, and an inverse kinematics solver for 3D human poses. 1.2.4 Motion The ability to generate complex and realistic human body animations at scale, while following specific artistic constraints, has been a fundamental goal for the game and animation industry for decades. In Chapter 5, we explore the subject of human motion in the following two aspects. Embedding hard bio-constraints in neural networks. Human body has a limited flexibility and the angle of joint rotations are inherently constrained. Some works addressed such constraint by learning the distribution of motion range through big dataset. However, since the range of joint rotation is not always in a convex hull and noises in the captured dataset are not ignorable, inferring the joint constraints from measurement is not reliable. To solve this problem, we propose using bio-priors to strictly define the range of motion for each joint using Euler angles. To avoid discontinuity problems in Euler angle representations, we modified the traditional Forward Kinematics functions with specially rotated local coordinates and Euler angle multiplication orders. As a result, the network can 5 Figure 1.2: Given the same four keyframes (pink), we show an example where our method generates two different and long motion sequences that interpolates these keyframes. Notice that the synthesized frames contain complex natural poses and meaningful variations. A subset of the synthesized frames are rendered from blue to white in the order of time. The synthesized trajectories are rendered as small dots on the ground. synthesize motions that strictly follow the flexibility constraints while maintaining a continuous representation. Improvisation and control. Imagine a director wants a five-minute dance but only has designed some key posetures in the middle and relies on the computer to choreograph the rest of the dance. Paradoxically, while many possible dances could exist based on the sparsely given keyframes, any improper movement in the middle could affect the alignment with future keyframes. Such a combination of flexibility and determinism is essential to automatically synthesizing natural, complex and improvisational human motions with respect to the usersâĂŹ input. To solve this problem, we designed an unsupervised deep generative model to learn the inbetweening function based on a big motion dataset. We introduced a scheme 6 called MotionDNA to encourage output variations with respect to interpretable random seeds so that the users can weakly control the content of the synthesized motion as shown in Figure 1.2. The resulting model is robust, flexible and efficient for generating various highly realistic motions in 3D under the given keyframes. We believe this work could pave the way for fundamental changes in the pipeline of motion synthesis in animations and games. 7 Chapter 2 Mesh Convolutional Neural Networks (CNNs) have been widely used in many visual learning tasks, from image classification to texture synthesis. The characteristic of learning shared weights across different spatial locations enables effective learning without over-fitting. In spite of its success in 2D images, applying the convolution operator to irregular graph data such as 3D meshes remains an unsolved prob- lem. The main difficulty stems from the non-uniform connectivity across nodes in a graph structure, and the non-uniform spatial discretization in the case of 3D meshes. These non-uniformities prevent nodes from sharing the same discretized convolution kernel as in the 2D case. While many specialized convolution operators are proposed for manifold meshes [95, 15, 51, 41], or general graphs [128, 97, 130, 34], the field hasn’t yet capitalized on the success of CNN architectures for 2D images. In this chapter, we propose a generalization of operators which are commonly used in 2D CNNs, so they can be applied to irregular graph data such as 3D meshes. By unifying neural network operators on irregular topology, our formulation enables graph problems to benefit directly from the advancement of modern deep learning machinery. To this end, we propose the use of a spatially varying convolution kernel which obtains its weights from a shared kernel basis. Unlike most 2D CNNs, which use the same kernel weights at every pixel location, our network performs convolution withfilterswhichvaryatdifferentvertexlocations. Thisisoftenreferredtoaslocal connectivity. While locally connected networks have been used in the past [123], 8 we found that the number of training parameters can make such networks difficult to use. For a given layer of the network, if I is the number of input channels andO the number of output channels, there would beI×O parameters per mesh edge (for that layer). A naive implementation of this network structure would therefore require a prohibited amount of GPU memory, especially when applied to high quality meshes with over 100K vertices. To reduce the number of parameters and GPU memory footprint, we hypoth- esize that the kernel-weights for all the edges in a layer should share some com- monality. In particular, we employ the prior that all kernel-weights (in that layer) lie in the span of a shared kernel basis. We useM basis vectors each of dimension I×O. By jointly learning these M basis vectors, as well as the per-edge weight coefficients, we are able to create locally connected networks with only M param- eters per edge. By setting M to be much smaller than I×O, we greatly reduce the size of the network while maintaining good performance. Our method can be seen as a generalization of traditional CNNs to arbitrary graphs, as long as all the graphs in the dataset share the same topology. We derivetraditionalCNNconceptssuchasconvolution, transposeconvolution, stride, dilation, pooling and unpooling in a natural way within our method. Indeed, our formulation allows us to construct an analog of almost any 2D CNN architecture on a mesh. In our experiments, we demonstrate that our proposed operators exceed state- of-the-art performance on the task of mesh reconstruction via auto-encoder. Our fully convolutional auto-encoder architecture has the additional advantage of se- mantically meaningful localized latent codes, which enable interpolation and artis- tic manipulation. Furthermore, while recent mesh convolution techniques such as 9 [15, 51, 114] can apply only to 2D manifolds, our operators are fully compatible with both N-D manifold meshes and non-manifold meshes. 2.1 Related Work In recent years, many methods have been developed to learn on arbitrary shapes. These include 3D voxel convolution [152], PointNet [110], mapping to UV space [12] and more recently, deep implicit surfaces [103]. Although our method is related to these in the sense of learning on arbitrary shapes, we will focus on discussing methods which pertain to graph or mesh data. 2.1.1 Graph Convolution Many graph convolution methods have been developed to deal with the irreg- ularity found in non-grid structure data (varying vertex degree, varying sampling density, etc). There are two main trends to solve this challenge. One strategy is to convert the irregularly sampled domain to the regularly sampled domain, exem- plified by the spectral approaches [16, 35, 77, 114]. Instead of learning on graphs directly, these methods instead work on spectral representations of graphs. While this domain adaptation allows us to leverage the power of standard 2D CNNs directly, it also loses the fidelity of the original signal and the ability to learn di- rectly in the spatial domain. This leads to lower precision in the case of generative models, especially for the task of reconstruction. The other approach to solve the graph convolution problem is to define convo- lution directly on the graph [37, 8, 95, 14, 128, 97, 129, 15, 51, 140]. One major challenge to develop these non-spectral methods is to define an operator that works with different numbers of neighbors, yet maintains the weight sharing property of CNNs. Duvenaud et al. [37] propose to learn a specific weight matrix for each node 10 degree. Masci et al. [95] use convolution on local patches of meshes with filters defined in polar coordinates and resolve the orientation ambiguity of filters with angular maxpooling. Boscani et al. [14] propose an anisotropic CNN model that orients filters by extracting local patches using the maximum curvature directions. Monti et al. [97] present mixture models of CNNs (MoNet), parameterized in local patchesofthegraph, sothataunifiedCNNarchitecturescanbedefined. Velickovic et al [128] introduce an attention-based architecture to perform node classification of graph-structured data. Verma et al. [129] learn soft attention across the weight- ing matrix with a predefined function relying on input feature maps. While the above methods all define convolution in local patches and model the local varia- tions based on the input features, our method differs from these in that our local awareness relies only on the topology. Our local convolution is thus independent from input features and fully learned to be shared across all the training samples. Our topology-dependent local variations are on the design choice that we want to have local coefficients to only model the graph irregularity, like varying sampling density, while the feature statics are learnt in the shared weight bases. This enables our method to be consistent with the classic CNNs. In terms of handling varying sampling density of mesh vertices, Hermosilla et al.[54] proposes touseMontaCarlosamplingfornon-uniformlysampledpointcloudswherevertices are weighted by the local density estimated based on their 3D positions. But this only works in the case when the features for computing densities are not the prediction target, e.g. point cloud segmentation, but not for reconstruction purpose as our method is designed for. Our local awareness is also related to locally linear embedding [118], which learns a weight coefficient for each neighbor to reconstruct each point as a weighted sum of its neighbors. 11 Other methods have been proposed to design graph convolution on different representations or primitives. Bouritsas et al. [15] introduce a novel graph con- volutional operator, i.e. the spiral operator, which acts directly on meshes using consistent local orderings of the vertices of the graph, achieving anisotropic filters without sacrificing computational complexity. Hanocka et al. [51] develop special- ized convolution and pooling layers on edges for triangle meshes (MeshCNN) by leveraging their intrinsic geodesic connections. The defined edge-convolution is direction invariant and so it cannot be directly applied for reconstruction. Fey et al. [41] propose to use a spline for kernel generation, which makes the computa- tion time independent of the kernel size but also limits the learning capacity. In addition, this convolution layer requires inputting pseudo-coordinates computed by the 3D positions of the vertices, making it infeasible for a reconstruction task using an autor-encoder. All the previous graph convolution methods listed above also share a common limitation that they don’t support transpose convolution, thus in tasks that re- quires up-sampling, one needs to apply additional unpooling layers. While our method enables transpose convolution layer. 2.1.2 CNNs with Local Awareness Although we are primarily concerned with convolution on irregularly connected graph data, our method can be easily adapted to regular grids as well. One major feature of our method if extended to regular grids is the learn-able locally varying coefficients, which captures non-stationary features. There are several existing workswhichalsousespatiallyvaryingkernelsforthe2Dcase. Jiaetal.[69]propose to generate instance-specific filters through a separate network branch. Chen et al. [25] develop Spatial Attention to predict an attention map with the same size as the input feature, in order to reweight the feature map based on location. Jia 12 et al.[69] and Dai et al.[32] both propose to predict a spatial transform that warps the convolution kernel, which can account for global or local geometry distortion. Su et al.[122] propose to employ a fixed function to adjust the shared convolution kernel to adapt to local content, while [132] directly regresses the local multiplier by convolving with the input feature map. Thesemethodsallperformlocalkerneladaptationinawaythatisdependenton the input feature map, thus constraining the possible types of variation. The other extreme is to learn the local convolution kernel independently for every spatial location [123]. This achieves the most capacity and can achieve good performance when the training data is significant. However, the exponential increase in the number of parameters makes this method it infeasible for high resolution data. In contrast, by modeling the local filter to be in a span of a kernel basis, our method can massively reduce the number of parameters while still keeping the kernel independent of input feature maps. 2.2 Method In this section, we introduce how we generalize the traditional CNN operators on regular grids to irregular topology. We will use the 3D mesh as a convenient example, and assume the dataset shares the same mesh topology. In Section 2.2.1, we first explain how we construct the input and output graphs and determine the convolution kernel size. Then in Section 2.2.2, we propose the new convolution and transpose convolution operators. We assume the convolution weights lie in the span of a shared kernel basis, and are sampled by a set of local coefficients per neighbor. These local coefficients are called "variant coefficients"(vc). and we call our convolution operations vcConv and vcTransConv. In Section 2.2.3, 13 Figure 2.1: Graph sampling for pooling and unpooling layers, we propose to weight the input vertices with learned densities, which accounts better for the irregularity during downsampling and upsampling. We name the density weights as "variant densities"(vd) and name the layers as vdPool and vdUnpool. We also define the residual layers noted as vdUpRes and vdDownRes in a similar manner respectively for up and down- sampling. Finally, in Section 2.2.4, we construct a fully convolutional mesh auto- encoder with these proposed graph layers. 2.2.1 Graph Sampling Due to the irregularity in the nodes’ connections, it is not trivial to define the sampling method (both downsampling and upsampling) on a graph. A good sampling method should respect the graph topology without over-fitting to one data instance like in [132, 15]. Thus, we design our graph sampling method to only depend on the topology and not on the features defined on a node or an edge. Suppose an output graphY is sampled from an input graphX. Y has N vertices, and each vertex y i is computed from a local regionN (i) inX.N (i) has E i vertices x i,j ,j = 1,..,E i . We define the distance between two vertices as the number of edges in the shortest path connecting them. 14 With sampling strides = 1 and kernel radiusr, for eachx i inX, we will create ay i inY and assign the same edges forY as inX. The correspondingN (i) fory i is defined byx i ’s 0 tor-ring neighbors including itself. As shown in Figure 2.1 (a), N (i) is indicated by the dotted lines fromy i (yellow vertex) to the input graphX, where the 1-ring (r = 1) edges are marked in red. For down-sampling, given stride s> 1, we first select a subset of vertices{x i } fromX so that each vertex is at leasts away from any other selected vertices, and no more vertices can be added to the set. The breadth-first traversal algorithm for vertex selection is explained in Appendix A.1.2. One can also manually assign certain vertices to be included after down-sampling for application specific pur- poses. As shown in Figure 2.1 (b),{x i } are the blue vertices with orange circles sampled withs = 2. In the output graphY, we create a vertexy i (yellow vertices) for each x i . The topology ofY purely depends on the vertex connectivity inX: two vertices are connected inY if their distance is less than or equal to 2s− 1 in X. With a kernel radius of r,N (i) contains the r-ring neighborhood of x i . For up-sampling, the input and output graphs are the output and input graphs in down-sampling with the same stride size. Figure 2.1 (c) shows a dual image of Figure 2.1 (b) for up-sampling. To determineN (i) of y i for a kernel radius r, we first collect y i ’s r-ring neighbors inY, then locate the sampled vertices (yellow- circled) and include their corresponding vertices inX to constructN (i). Please find more examples in Appendix A.1.2 forr = 1 and 3. Since the topology is fixed, the sampled graphs and vertex indices are the same for all data. 15 2.2.2 vcConv and vcTransConv In a 2D convolution operation, output feature y i ∈R O is computed as: y i = X x i,j ∈N(i) W j x i,j + b (2.1) , where x i,j ∈R I are the input feature, W j ∈R I×O is the learned weight matrix defined for each neighboring vertex, and b is the learned bias. In a regular grid, as the topology of all neighborhoods are identical, W j can be defined consistently and shared within all the vertices in the grid. For graphs like meshes, the vertices’ density, connectivity, and directions in a graph can be different from vertex to vertex, so the same weighting schemes can not be directly applied. One solution for this could be to allow the weights to vary spatially [123], so that each vertex defines its own convolution weights. We will call this method LCConv for locally connected convolution. An LCConv layer has IO P N i=1 E i +O training parameters and will require considerable memory. This over-parameterization is also more prone to overfitting. In this paper, instead of simply using LCConv, we propose a new strategy for both convolution and transpose convolution, which we call vcConv and vc- TransConv. Our design is based on the assumption that a mesh is a discretization of a continuous space. Since a continuous convolution kernel can be shared spa- tially on the original continuous space, we should be able to discretize the operator inasimilarwaytothediscrete2Dconvolutionoperator. However, unlikeanimage, meshes are usually discretized non-uniformly: vertices are unevenly distributed in the space, and each vertex has a different connectivity. To convert to a discrete convolution operation, we can resample the unique continuous kernel to generate the weights for each neighborhood of the vertex. To achieve that, the sampling 16 functions need to be defined per vertex locally. Rather than using handcrafted sampling functions, we learn them through training. Specifically, for both vcConv and vcTransConv, we compute the weights per x i,j as the linear combination of kernel basis vectors B ={B k } M k=1 , B k ∈ R I×O with locally variant coefficients(vc) A i,j ={α i,j,k } M k=1 , α∈R: W i,j = M X k=1 α i,j,k B k (2.2) , and compute the convolution as y i = X x i,j ∈N(i) W i,j x i,j + b (2.3) A i,j are different for each vertex x i,j inN (i) of each y i , but B is shared globally in one layer. They are both learnable parameters and shared across the entire dataset. Withthisformulation, ourparametercountisreducedtoIOM+M P N i=1 E i +O. Empirically, wechooseM toberoughlytheaveragesizeofaneighborhood. Section 2.3.2 shows vcConv is much lighter in parameter number and less memory intensive than LCConv. To remove the scaling ambiguity between the kernel basis and variant coeffi- cients, one can additionally normalize B k before multiplying withα. However, we found in our ablation study that this can slow down the convergence and lead to higher error. As in Figure 2.1, vcConv corresponds to the down-sampling process and vc- TransConv to the up-sampling. The sampling scale and receptive field are deter- mined by the sampling results of{y i } and{N (i)} from the preprocessing. Notice that when s = 1, the graph size will remain the same. 17 2.2.3 vdPool, vdUnpool, vdUpRes, and vdDownRes To be consistent with traditional CNNs, we also need to define our Pool and Unpool layers. Naively, we could use max or average operations, which work great for regular grids. However, in an arbitrary graph, the vertices can distribute quite unevenly within the kernel radius, and our experiments in Section 2.3.2 show that simply using max or average pooling doesn’t perform well. Inspired by Hermosilla et al.[54], we apply Monte Carlo sampling for feature aggregation. In [54], the vertex density is estimated by the 3D coordinates of its neighboring vertices. However, in more general cases, we don’t have such in- formation for each layer. While it’s hard to design a generally rational density estimation function, we let the network learn the optimal variant density (vd) co- efficients across all the training samples. Note that vd is defined per node after pooling or unpooling. Specifically, the aggregation functions in vdPool and vdUnpool layers are y i = X j∈N(i) ρ 0 i,j x i,j , ρ 0 i,j = |ρ i,j | P E i j=1 |ρ i,j | (2.4) , whereρ i,j ∈R is the training parameter and ρ 0 i,j is the density value. Due to the vd coefficient normalization, the vdPool/vdUnpool does not perform any rescaling nor change the mean values of the input feature map. Similarly, we can define a residual layer as: y i = X x i,j ∈N(i) ρ 0 i,j Cx i,j (2.5) When the input and output feature dimensions are the same, C is an identity matrix, otherwise, C is a learned O×I matrix shared across all the graph nodes. 18 vdDownRes/vdUpRes Linear T ransform Input vdPool/vdUnpool vcConv/vcT ransConv Output + Elu Figure 2.2: Residual block for down/upsampling. With the residual layer, we can design a residual block for up or down- sampling. As illustrated in Figure 2.2, the input passes through the vcConv or vcTransConv layer and the activation layer Elu [29], and is then added by the output of the vdDownRes or vdUpRes layer. The convolution and residual layer should have the same sampling stride. We denote it as vcConv+vdDownRes or vcTransConv+vdUpRes. For simplicity, we don’t denote Elu in the rest of the paper. Please refer to Appendix A.1.1 for implementation details. 2.2.4 Fully Convolutional Auto-Encoder Based on the network layers explained above, we propose a fully convolutional mesh auto-encoder(AE). The encoder uses down-sampling residual blocks to cen- tralize the whole graph’s information into several latent vertices in the bottleneck, and the up-sampling residual blocks in the decoder reconstruct the original graph from the latent codes on these latent vertices. The last block of the auto-encoder doesn’t have Elu. Different from [15, 114, 78], our network has no fully connected layers in the latent space. Figure2.3 shows the structure of an AE on DFAUST meshes. The network is designed to have four downsampling blocks and four upsampling blocks with s = 2 and r = 2 for all layers. It compresses the original mesh to seven vertices, 9 channels per vertex, resulting in a 64 dimensional latent code. 19 Figure 2.3: Encoding and decoding process for DFAUST data. Figure 2.4: Visualization of receptive field in the middle layer. Localized Latent feature Interpolation. Using our up/down-sampling scheme, we can manually set the seven latent vertices on the head, hands, feet, and torso. Their receptive fields will naturally centralize at these vertices and propagate gradually on the surface as visualized in Figure 2.4. As a result, the latent code defines a semantically meaningful latent space. For instance, we can interpolate only the latent vertex on the right arm between a source and a target code, to pose only the right arm on the full mesh. In comparison, the quadratic mesh simplification method as proposed in [114] and [15] doesn’t provide such local semantic control in its latent space, as it simpli- fies the mesh according to the point to plane error of a template, so the receptive field is prone to respect Euclidean space rather than geodesic distance. In Figure 2.4, one can see the receptive fields from the quadric mesh simplification to both the right arm and the right hip, which is less favorable for localized interpolation. 20 2.3 Experiments In this section, we first compare our proposed auto-encoder for 3D meshes with state-of-the-art architectures on the D-FAUST dataset, then we compare the per- formance of different convolution and (un)pooling layers under the same architec- ture. After that, we show our auto-encoder’s performance on a high-resolution 3D hand dataset and a 3D body dataset. Next, we present localized latent code inter- polationfor3Dhandmodels. Finallywedemonstratetrainingforhigh-dimensional manifold or even non-manifold data. All experiments were trained with L1 recon- struction loss only, Adam [76] optimizer and reported with point to point mean euclidean distance error if not specified. 2.3.1 Comparison of Network Architectures We compare our method with that of Neural3DMM [15] which is the current state-of-the-art auto-encoder for registered 3D meshes, as well as MeshCNN [51]. We choose to work with the D-FAUST human body dataset as it captures both high-frequency variance in poses and low-frequency variance in local details. The D-FAUST dataset contains 140 sequences of registered human body meshes. We use 103 sequences (32933 meshes in total) for training, 13 sequences (4023 meshes) for validation and 13 sequences (4264 meshes) for testing. We trained our network with 200 epochs, batch size=16, learning rate=0.0001, learning rate decay=0.9 every epoch, using Geforce 1080Ti, cuda 10.0 and pytorch 1.0. For Neural3DMM, we set its latent space dimension to be 63 to match our bottle neck size and trained it with 300 epochs using the exact settings described in its paper. As reported in Table 2.3.1, our network achieves over 30% and 40% lower errors for the testing and training set respectively with less training 21 Figure 2.5: Point-wise reconstruction error on D-FAUST meshes. Train Error(mm) Test Error(mm) Params Ours 3.73 5.01 1.9m SpiralCNN 6.42 7.39 2m MeshCNN 83.27 101.77 2.2m Table 2.1: Comparisons between different network architectures. epochs and a similar number of parameters. Moreover, Neural3DMM compresses the input mesh into a fully connected bottleneck, while ours is fully convolutional and directly compresses the mesh to 7 vertices, 9 channels per vertex. A visual comparison between the two methods can be found in Figure 2.5. For MeshCNN [51], we found that it was infeasible to train a model on the full resolution D-FAUST dataset due to memory constraints. Thus, we down-sampled all meshes to 750 vertices, which is the same size they use in their experiments. Because [51] performs convolution on the edges, we use each edge’s two endpoints as the input feature we attempt to reconstruct. We set the input size to 3000 edges and have a bottleneck layer of 150 edges. The number layers and channels is the same as in [51]. We train for 200 epochs. 2.3.2 Comparison of Network Layers In this section, we evaluate our proposed network with other design choices. Based on the network architecture defined in Section 2.2.4, we keep the block 22 number, the input/output channel and vertex number the same 1 , but compare with different convolution and (un)pooling layers. All experiments were trained on the DFAUST dataset with the same setting as described in section (Sec. 2.3.1). Table 2.2 lists all the block designs, the errors, the parameter count and the GPU memory consumption for training with batch size=16. In Group 1 Ablation Study, we compare different attributes and combinations of our proposed layers. From 1.1 to 1.3, we tested the effect of adjusting the kernel basis size M. We denote the encoding or decoding residual blocks as vc- Down/TransConv(s, r,M) + vcDown/UpRes(s). Note that 1.3 is the architecture in the previous section. By increasing M, the network’s capacity increases and achieves lower errors. In 1.4 the residual layers are removed and errors go higher. In 1.5 we replace the convolution and transpose convolution by the combination of stride 1 convolution and stride 2 pooling or unpooling layers, denoted as vcConv- >vdPool and vdUnpool->vcTransConv and the errors increase a bit. In 1.6 we apply normalization on the weight bases and the error further increases. In Group 2, we compare our vcConv with other convolution operators as pro- posed in previous work. Since all the other convolution layers don’t support up or down-sampling, we use our vdPool and vdUnpool layers for sampling, and set s = 1,r = 1 for all convolution layers. From the table, 2.2 Locally Connected convolution layer, which never learns any shared weights, has the lowest error but it has around 135 times more learnable parameters than our vcConv layer and consumes twice the GPU memory for training, preventing it from being applied to bigger network architectures, like with high-resolution meshes. In 2.3, spectral convolution with 6 Chebyshev bases [34] has the least parameters but the test error increases 50%. For 2.4 MoNet [97], we use the pseudo coordinates from its paper, 1 Except for GATConv in Experiment 2.5 which has eight times more channels. 23 Error (mm) Encoder Block Decoder Block Train Test Params Train Mem 1.1 vcDownConv(2,2,37) + vcDownRes(2) vcTransConv(2,2,37) + vcUpRes(2) 3.02 4.56 3.9m 2509Mib 1.2 vcDownConv(2,2,27) + vcDownRes(2) vcTransConv(2,2,27) + vcUpRes(2) 3.29 4.73 2.9m 2493Mib 1.3 vcDownConv(2,2,17) + vcDownRes(2) vcTransConv(2,2,17) + vcUpRes(2) 3.73 5.01 1.9m 2471Mib 1.4 vcDownConv(2,2,17) vcTransConv(2,2,17) 4.02 5.23 1.8m 2123Mib 1.5 vcConv(1,1,9) ->vdPool(2) vdUnpool(2)->vcConv(1,1,9) 4.57 5.63 1.4m 4183Mib 1.6 vcConv(1,1,9)* ->vdPool(2) vdUnpool(2)->vcConv(1,1,9)* 13.25 14.29 1.4m 4183Mib 1. Ablation Study 2.1 vcConv(1,1,9) ->vdPool(2) vdUnpool(2)->vcConv(1,1,9) 4.57 5.63 1.38m 4185Mib 2.2 LCConv(1,1) ->vdPool(2) vdUnpool(2)->LCConv(1,1) 2.67 4.23 185.77m 8767Mib 2.3 ChebConv(1,1) ->vdPool(2) vdUnpool(2)->ChebConv(1,1) 7.19 8.59 0.15m 1853Mib 2.4 MoNetConv(1,1) ->vdPool(2) vdUnpool(2)->MoNetConv(1,1) 9.21 10.4 0.61m 5223Mib 2.5 GATConv(1,1) ->vdPool(2) vdUnpool(2)->GATConv(1,1) 11.95 14.28 0.21m 7377Mib 2.6 FeaStConv(1,1) ->vdPool(2) vdUnpool(2)->FeaStConv(1,1) 14.77 17.03 0.76m 7359Mib 2. Comparison of Convolution Layers 3.1 vcConv(1,1) ->vdPool(2) vdUnpool(2)->vcConv(1,1) 4.57 5.63 1.38m 4185Mib 3.2 vcConv(1,1) ->avgPool(2) avgUnpool(2)->vcConv(1,1) 5.93 6.88 1.37m 3651Mib 3.3 vcConv(1,1) ->maxPool(2) maxUnpool(2)->vcConv(1,1) 8.8 13.55 1.37m 3651Mib 3.4 vcConv(1,1) ->qPool(2) qUnpool(2)->vcConv(1,1) 4.94 6.08 1.37m 4731Mib 3. Comparison of Pooling Layers Table 2.2: Comparison of using different blocks. and set the kernel size to 25; For 2.5 GATConv [128], we set the head number being 8, all heads concatenated except for the middle and last layer which were averaged instead; for 2.6 FeaStConv [130], we set the head size as 32. 2.3 to 2.6 were implemented with PyTorch Geometry [42]. They have much higher error than our vcConv and require much more memory. This demonstrates our method achieves the best accuracy and efficiency in terms of memory consumption. In Group 3, we keep the vcConv layers the same but use different (un)pooling layers. From 3.2 and 3.3, we can see that using simple average or max (un)pooling operations increases the error. In 3.4, using the quadric (un)pooling layers(qPool), the error also increases. 2.3.3 High Resolution Mesh Real-world 3D data can have much higher resolution than DFAUST. Here we experimented our network on two high-resolution body datasets. The first dataset 24 Figure 2.6: Result on high resolution mesh. containsfullyalignedhandmeshesreconstructedfromperformancecapturesoftwo people with roughly 200 seconds of 90 poses per person. The mesh has 57k vertices and115kfacets. Werandomlypicked39one-secondclipsfortesting, 39one-second clips for validation, and used the rest for training, resulting in 9376 meshes for training and 1170 for testing. Each vertex is given both the 3D coordinate and RGB color as input. Our auto-encoder network has nine down-sampling residual blocks and nine up-sampling residual blocks with s = 2, r = 2 and M = 17 for all blocks. The latent space has 6 vertices with 64 dimensions per vertex, resulting in a compression rate of 0.11%. The middle 6 vertices were manually selected to be at the fingertips and wrist. We trained the network with point to point L1 loss, L1 laplacian loss and L1 RGB loss. After 205 epochs of training, the mean point to point euclidean distance error dropped to 1.06 mm and the mean L1 RGB error to 0.036 (range 0 to 1). The second dataset contains 24,628 fully aligned body meshes reconstructed from a performance capture of one person conducting different poses. The resolu- tion for each mesh is 154k vertices and 308k triangles. We randomly chose 2, 463 for validation and used the rest for training. Each vertex has 3D coordinates as 25 the feature to reconstruct. Our auto-encoder has 6 down-sampling residual blocks and 6 up-sampling residual blocks, with s = 2,r = 2 and M = 17 for all blocks. The latent space has 18 vertices and 64 dimensions per vertex, resulting in a com- pression rate of 0.25%. We additionally used L1 laplacian loss for training. After training 100 epochs, the mean point to point euclidean distance error dropped to 2.15 mm for training and 3.01 mm for validation. From Figure 2.6 we can see that the output mesh is quite detailed. Compared with the groundtruth, which is more noisy, the network output is relatively smoothed. Interestingly, from the middle two images, we can see that the network learned to reconstruct the vein in the inner side of the arm from the originally noisy surface. 2.3.4 Localized Interpolation As an additional advantage, the latent codes of our network are localized. We demonstrate this on hand meshes in Figure 2.7. The latent vertices are the tips of the five fingers and the wrist. For interpolation, we first inferred the latent codes from a source mesh and a target mesh, then we linearly interpolated the latent code on each individual latent vertex separately as shown in Figure 2.7. With only two input hand models, we can obtain many more gestures by interpolating a subset of latent vertices instead of the entire code. 2.3.5 N-D Manifold and Non-Manifold Meshes Different from [114, 15, 51], which are limited to 2-D manifold meshes, our method can be applied to N-D manifold or even non-manifold meshes. We first demonstrate 3-D manifold cases using 3D tetrahedron meshes (tet mesh). Tet meshes are commonly used for physical simulation [121]. We used a dataset containing 10k deformed tet meshes of an Asian dragon, and trained an 26 Figure 2.7: Interpolation from source to target using global or local latent codes. Figure 2.8: Reconstruction result of 3D tetrahedron meshes. auto-encoder using 7k for training, 1.3k for validation and 562 for testing. The dragon has 959 vertices and lies in a bounding box of roughly 20×20×20 cm. The auto-encoder has five down-sampling residual blocks and five up-sampling residual blocks. s = 2,r = 2,M = 37 for all blocks. Its bottle neck has two vertices, 16 latent codes per vertex. After 400 epochs of training, the error converged to 0.2 mm. Figure 2.8 shows the cases of input tet meshes, output tet meshes, and the fine surface mesh driven by the output tet meshes. 27 Figure 2.9: Reconstruction result of a non-manifold triangle mesh. To demonstrate our network on non-manifold data, we show our network on a 3D tree model. The tree model has 328 disconnected components. To connect all the components, we added an edge between each pair of close components. We constructed a dataset of 10 sequence of this tree’s animation simulated using random forces, 1000 frames for each clip, and used 2 clips for testing, 2 clips for validation and the rest for training. The auto-encoder had four down-sampling residual blocks to compress the original 20k vertices into 149 vertices, 8 latent codes per vertex and four up-sampling residual blocks. s = 2,r = 2,M = 27 for each block. After 36 epochs of training, the reconstruction error dropped to 4.1 cm. Figure 2.9 shows the ground truth meshes rendered by the receptive field and the output mesh rendered by per vertex error. These two datasets are generated using Vega [9]. 2.4 Discussion We have introduced a novel deep architecture that produces state-of-the-art results for the task of mesh reconstruction. Our method is a generalization of 2D CNN architectures onto arbitrary graph structures, and has natural analogs of the operations which make 2D CNNs so powerful. Furthermore, our method avoids the 28 high parameter cost of naive locally connected networks by using a shared kernel basis, while still maintaining good performance. Several future directions are possible with our formulation. For one thing, though our method can learn on arbitrary graphs, all the graphs in the dataset must have the same topology. We plan to extend our work in the future so that it can work on datasets with varying topology. 29 Chapter 3 3D Hair In this chapter, we will discuss how to use deep learning to model strand struc- tures like human hair. Realistic hair modeling is one of the most difficult tasks when digitizing virtual humans [19, 64, 87, 101, 49]. In contrast to objects that are easily parameterizable, like the human face, hair spans a wide range of shape variations and can be highly complex due to its volumetric structure and level of deformabilityineachstrand. Although[102,68,11,93,142]cancreatehigh-quality 3D hair models, but they require specialized hardware setups that are difficult to be deployed and populated. Chai et al. [22, 23] introduced the first simple hair modeling technique from a single image, but the process requires manual input and cannot properly generate non-visible parts of the hair. Hu et al. [62] later ad- dressed this problem by introducing a data-driven approach, but some user strokes were still required. More recently, Chai et al. [21] adopted a convolutional neural network to segment the hair in the input image to fully automate the modeling process, and [150] proposed a four-view approach for more flexible control. However, these data-driven techniques rely on storing and querying a huge hair modeldatasetandperformingcomputationally-heavyrefinementsteps. Thus, they are not feasible for applications that require real-time performance or have limited hard disk and memory space. More importantly, these methods reconstruct the target hairstyle by fitting the retrieved hair models to the input image, which may capture the main hair shape well, but cannot handle the details nor achieve high accuracy. Moreover, since both the query and refinement of hair models are based 30 on an undirected 2D orientation match, where a horizontal orientation tensor can either direct to the right or the left, this method may sometimes produce hair with incorrect growing direction or parting lines and weird deformations in the z-axis. To speed up the procedure and reconstruct hairs that preserve better style w.r.t the input image and look more natural, we propose a deep learning based approach to generate the full hair geometry from a single-view image, as shown in Figure 1.1. Different from recent advances that synthesize shapes in the form of volumetric grids [28] or point clouds [40] via neural networks, our method generates the hair strands directly, which are more suitable for non-manifold structures like hair and could achieve much higher details and precision. Our neural network, which we call HairNet, is composed of a convolutional encoder that extracts the high-level hair-feature vector from the 2D orientation field of a hair image, and a deconvolutional decoder that generates 32× 32 strand- featuresevenlydistributedontheparameterized2Dscalp. Thehairstrand-features could be interpolated on the scalp space to get higher (30K) resolution and further decoded to the final strands, represented as sequences of 3D points. In particular, the hair-feature vector can be seen as a compact and continuous representation of the hair model, which enables us to sample or interpolate more plausible hairstyles efficiently in the latent space. In addition to the reconstruction loss, we also introduce a collision loss between the hair strands and a body model to push the generated hairstyles towards a more plausible space. To further improve the accuracy, we uses the visibility of each strand based on the input image as a weight to modulate its loss. Obtaining a training set with real hair images and ground-truth 3D hair geome- tries is challenging. We can factor out the difference between synthetic and real hair data by using an intermediate 2D orientation field as network input. This 31 enables our network to be trained with largely accessible synthetic hair models and also real images without any changes. For example, the 2D orientation field can be calculated from a real image by applying a Gabor filter on the hair region automatically segmented using the method of [151]. Specifically, we synthesized a hair data set composed of 40K different hairstyles and 160K corresponding 2D orientation images rendered from random views for training. Compared to previous data-driven methods that could take minutes and ter- abytes of disk storage for a single reconstruction, our method only takes less than 1 second and 70 MB disk storage in total. We demonstrate the effectiveness and robustness of our method on both synthetic hair images and real images from the Internet, and show applications in hair interpolation and video tracking. Our contributions can be summarized as follows: 1. We propose the first deep neural network to generate dense hair geometry from a single-view image. To the best of our knowledge, it is also the first work to incorporate both collision and visibility in a deep neural network to deal with 3D geometries. 2. Our approach achieves state-of-the-art resolution and quality, and signifi- cantly outperforms existing data-driven methods in both speed and storage. 3. Our network provides the first compact and continuous representation of hair geometry, from which different hairstyles can be smoothly sampled and interpolated. 4. We construct a large-scale database of around 40K 3D hair models and 160K corresponding rendered images. 32 3.1 Related Work Hair Digitization. A general survey of existing hair modeling techniques can be found in Ward et.al [136]. For experienced artists, purely manual editing from scratchwithcommercialsoftwaressuchasXGenandHairfarmischosenforhighest quality, flexibility and controllability, but the modeling of compelling and realistic hairstyles can easily take several weeks. To avoid tedious manipulations on indi- vidual hair fibers, some efficient design tools are proposed in [27, 75, 43, 148, 137]. Meanwhile, hair capturing methods have been introduced to acquire hairstyle data from the real world. Most hair capturing methods typically rely on high- fidelity acquisition systems, controlled recording sessions, manual assistance such as multi-view stereo cameras[102, 11, 68, 93, 38, 142, 61], single RGB-D camera [63] or thermal imaging [55]. More recently, Single-view hair digitization methods have been proposed by Chai et.al [23, 22] but can only roughly produce the frontal geometry of the hair. Hu et.al [62] later demonstrated the first system that can model entire hairstyles at the strand level using a database-driven reconstruction technique with minimal user interactions from a single input image. A follow-up automatic method has beenlaterproposedby[21], whichusesadeepneuralnetworkforhairsegmentation and augments a larger database for shape retrieval. To allow more flexible control of side and back views of the hairstyle, Zhang et.al [150] proposed a four-view image-based hair modeling method to fill the gap between multi-view and single- view hair capturing techniques. Since these methods rely on a large dataset for matching, speed is an issue and the final results depend highly on the database quality and diversity. 33 Single-View Reconstruction using Deep Learning. Generation of 3D data by deep neural networks has been attracting increasing attention recently. Volu- metric CNNs [28, 45, 126, 67] use 3D convolutional neural networks to generate voxelized shapes but are highly constrained by the volume resolution and com- putation cost of 3D convolution. Although techniques such as hierarchical recon- struction [50] and octree [115, 125, 135] could be used to improve the resolution, generating details like hair strands are still extremely challenging. On the other hand, point clouds scale well to high resolution due to their un- structured representation. [111, 113] proposed unified frameworks to learn features from point clouds for tasks like 3D object classification and segmentation, but not generation. Following the pioneering work of PointNet, [48] proposed the PCPNet to estimate the local normal and curvature from point sets, and [40] proposed a network for point set generation from a single image. However, point clouds still exhibit coarse structure and are not able to capture the topological structure of hair strands. 3.2 Method The entire pipeline contains three steps. A preprocessing step is first adopted to calculate the 2D orientation field of the hair region based on the automatically estimated hair mask. Then, HairNet takes the 2D orientation fields as input and generates hair strands represented as sequences of 3D points. A reconstruction step is finally performed to efficiently generate a smooth and dense hair model. 34 3.2.1 Preprocessing We first adopt PSPNet [151] to produce an accurate and robust pixel-wise hair mask of the input portrait image, followed by computing the undirected 2D orientation for each pixel of the hair region using a Gabor filter [93]. The use of undirected orientation eliminates the need of estimating the hair growth direction, which otherwise requires extra manual labeling [62] or learning [21]. However, the hair alone could be ambiguous due to the lack of camera view information and its scale and position with respect to the human body. Thus we also add the segmentation mask of the human head and body on the input image. In particular, the human head is obtained by fitting a 3D morphable head model to theface[64]andthebodycouldbepositionedaccordinglyviarigidtransformation. All these processes could be automated and run in real-time. The final output is a 3×256×256 image, whose first two channels store the color-coded hair orientation and third channel indicates the segmentation of hair, body and background. 3.2.2 Data Generation Similar to Hu et. al [62], we first collect an original hair dataset with 340 3D hair models from public online repositories [7], align them to the same reference head, convert the mesh into hair strands and solve the collision between the hair and the body. We then populate the original hair set via mirroring and pair-wise blending. Different from AutoHair [21] which simply uses volume boundaries to avoid unnaturalcombinations, weseparatethehairsinto12classesbasedonstylesshown in table 3.1 and blend each pair of hairstyles within the same class to generate more natural examples. In particular, we cluster the strands of each hair into five central strands, and each pair of hairstyles can generate 2 5 − 2 additional combinations 35 Table 3.1: Hair classes and the number of hairs in each class. S refers to short, M refers to medium, L refers to long, X refers to very, s refers to straight and c refers to curly. Some hairs are assigned to multiple classes if its style is ambiguous. XS s 20 S s 110 M s 28 L s 29 XL s 27 XXL s 4 XS c 0 S c 19 M c 65 L c 27 XL c 23 XXL c 1 of central strands. The new central strands serve as a guidance to blend the detailed hairs. Instead of using all of the combinations, we randomly select the combination of them for each hair pair, leading to a total number over 40K hairs for our synthetic hair dataset. In order to get the corresponding orientation images of each hair model, we randomly rotate and translate hair inside the view port of a fixed camera and render 4 orientation images at different views. The rotation ranges from -90 ◦ to +90 ◦ for the yaw axis and -15 ◦ to +15 ◦ for the pitch and roll axis. We also add Gaussian noises to the orientation to emulate the real conditions. 3.2.3 Hair Prediction Network Hair Representation. We represent each strand as an ordered 3D point set ζ ={s i } M i=0 , evenly sam- pled with a fixed number (M = 100 in our experiments) of points from the root to end. Each samples i contains attributes of position p i and curvaturec i . Although the strands have large variance in length, curliness, and shape, they all grow from fixed roots to flexible ends. To remove the variance caused by root positions, we represent each strand in the local coordinate anchored at its root. The hair model can be treated as a set of N strands H = ζ N with fixed roots, and can be formulated as a matrixA N∗M , where each entryA i,j = (p i,j ,c i,j ) represents the jth sample point on the ith strand. In particular, we adopt the 36 method in [134] to parameterize the scalp to a 32× 32 grid, and sample hair roots at those grid centers (N = 1024). 64 128 conv2 conv1 256 512 512 256 512 512 512 300 100 conv3 conv4 max-pooling fc deconv1 deconv2 deconv3 MLP curvatures positions hair feature strand features MLP conv5 32 8 32 32 Figure 3.1: Network Architecture. The input orientation image is first encoded into a high-level hair feature vector, which is then decoded to 32× 32 individual strand-features. Each strand-feature is further decoded to the final strand geome- try containing both sample positions and curvatures via two multi-layer perceptron (MLP) networks. Network Architecture. As illustrated in Figure 3.1, our network first encodes the input image to a latent vector, followed by decoding the target hair strands from the vector. For the encoder, we use the convolutional layers to extract the high-level features of the image. Different from the common practices that use a fully-connected layer as the last layer, we use the 2D max-pooling to spatially aggregate the partial features (a total number of 8× 8) into a global feature vector z. This greatly reduces the number of network parameters. The decoder generates the hair strands in two steps. The hair feature vector z is first decoded into multiple strand feature vectors{z i } M i=0 via deconvolutional layers, andeachz i couldbefurtherdecodedintothefinalstrandgeometryζ viathe same multi-layer fully connected network. This multi-scale decoding mechanism 37 allows us to efficiently produce denser hair models by interpolating the strand features. According to our experiments, this achieves a more natural appearance than directly interpolating final strand geometry. It is widely observed that generative neural networks often lose high frequency details, as the low frequency components often dominates the loss in training. Thus, apart from the 3D position{p i } of each strand, our strand decoder also predicts the curvatures{c i } of all samples. With the curvature information, we can reconstruct the high frequency strand details. Loss Functions. We apply three losses on our network. The first two losses are the L 2 recon- struction loss of the 3D position and the curvature of each sample. The third one is the collision loss between the output hair strand and the human body. To speed up the collision computation, we approximate the geometry of the body with four ellipsoids as shown in Figure 3.2. Given a single-view image, the shape of the visible part of the hair is more reliable than the invisible part, e.g. the inner and back hair. Thus we assign adaptive weights to the samples based on their visibility — visible samples will have higher weights than the invisible ones. The final loss function is given by: L =L pos +λ 1 L curv +λ 2 L collision . (3.1) 38 L pos and L curv are the loss of the 3D positions and the curvatures respectively, written as: L pos = 1 NM N−1 X i=0 M−1 X j=0 w i,j ||p i,j − p ∗ i,j || 2 2 L curv = 1 NM N−1 X i=0 M−1 X j=0 w i,j (c i,j −c ∗ i,j ) 2 w i,j = 10.0 s i,j is visible 0.1 otherwise (3.2) where p ∗ i,j and c ∗ i,j are the corresponding ground truth position and curvature to p i,j and c i,j , and w i,j is the visibility weight. Figure 3.2: Ellipsoids for Collision Test. The collision loss L col is written as the sum of each collision error on the four ellipsoids: L col = 1 NM 3 X k=0 C k (3.3) Each collision error is calculated as the sum of the distance of each collided point to the ellipsoid surface weighted by the length of strand that is inside the ellipsoid, written C k = N−1 X i=0 M−1 X j=1 kp i,j − p i,j−1 kmax(0,Dist k ) (3.4) Dist k = 1− (p i,j,0 −x k ) 2 a 2 k − (p i,j,1 −y k ) 2 b 2 k − (p i,j,2 −z k ) 2 c 2 k (3.5) 39 wherekp i,j −p i,j−1 kistheL 1 distancebetweentwoadjacentsamplesonthestrand. x k , y k , z k , a k , b k , and d k are the model parameters of the ellipsoid. Training Details. ThetrainingparametersofEquation3.1arefixedtobeλ 1 = 1.0andλ 2 = 10 −4 . During training, we resize all the hair so that the hair is measured in the metric system. We use Relu for nonlinear activation, Adam [76] for optimization, and run the training for 500 epochs using a batch size of 32 and learning rate of 10 −4 multiplied by 2 after 250 epochs. (a) input image (b) output orientation, training input (c) 1K strands, training target Figure 3.3: The orientation image (b) can be automatically generated from a real image (a), or from a synthesized hair model with 9K strands. The orientation map and a down-sampled hair model with 1K strands (c) are used to train the neural network. 3.2.4 Reconstruction The output strands from the network may contain noise, and sometimes lose high-frequency details when the target hair is curly. Thus, we further refine the smoothness and curliness of the hair. We first smooth the hair strands by using a Gaussian filter to remove the noise. Then, we compare the difference between the predicted curvatures and the curvatures of the output strands. If the difference is higher than a threshold, we add offsets to the strands samples. In particular, 40 we first construct a local coordinate frame at each sample with one axis along the tangent of the strand, then apply an offset function along the other two axises by applying the curve generation function described in the work of Zhou et. al [147]. Figure 3.4: Hair strand upsampling in the space of (b) the strand-features and (c) the final strand geometry. (d) shows the zoom-in of (c). The network only generates 1K hair strands, which is insufficient to render a high fidelity output. To obtain higher resolution, traditional methods build a 3D direction field from the guide strands and regrows strands using the direction field from a dense set of follicles. However, this method is time consuming and cannot be used to reconstruct an accurate hair model. Although directly interpolating the hair strands is fast, it can also produce an unnatural appearance. Instead, we bilinearly interpolate the intermediate strand featuresz i generated by our network and decode them to strands by using the perceptron network, which enables us to create hair models with arbitrary resolution. Figure 3.4 demonstrates that by interpolating in strand-feature space, we can generate a more plausible hair model. In contrast, direct interpolation of the final strands could lead to artifacts like collisions. This is easy to understand, as the strand-feature could be seen as a non-linear mapping of the strand, and could fall in a more plausible space. 41 Figure 3.5: Reconstruction with and without using curliness. Figure 3.5 demonstrates the effectiveness of adding curliness in our network. Without using the curliness as an extra constraint, the network only learns the dominant main growing direction while losing the high-frequency details. In this paper, we demonstrate all our results at a resolution of 9K to 30K strands. 3.3 Evaluation 3.3.1 Quantitative Results and Ablation Study In order to quantitatively estimate the accuracy of our method, we prepare a synthetic test set with 100 random hair models and 4 images rendered from random views for each hair model. We compute the reconstruction errors on both the visible and invisible part of the hair separately using the mean square distance between points and the collision error using equation 3.3. We compare our result with Chai et al.’s method [21]. Their method first queries for the nearest neighbor in the database and then performs a refinement process which globally deforms the hair using the 2D boundary constraints and the 2D orientation constraints based on the input image. To ensure the fairness and efficiency of the comparison, we use the same database in our training set for the nearest neighbor query of [21] based on the visible part of the hair, and set the resolution at 1000 strands. 42 We also compare with Hu et al.’s method [62] which requires manual strokes for generating the 3D hair model. But drawing strokes for the whole test set is too laborious, so in our test, we use three synthetic strokes randomly sampled from the ground-truth model as input. In Table 3.2, we show the error comparison with the nearest neighbor query results and the methods of both papers. We also perform an ablation test by respectively eliminating the visibility-adaptive weights, the collision loss and the curvature loss from our network. Fromtheexperiments, weobservethatourmethodoutperformsalltheablation methods and Chai et al.’s method. Without the visibility-adaptive weights, the reconstruction error is about the same for both the visible and invisible parts, while the reconstruction error of the visible hair decreased by around 30% for all the networks that applies the visibility-adaptive weights. The curvature loss also helps decrease the mean square distance error of the reconstruction. The experiment also shows that using the collision loss will lead to much less error in collision. The nearest-neighbor method results have 0 collision error because the hairs in the database have no collisions. In Table 3.3, we compare the computation time and hard disk usage of our method and the data-driven method at the resolution of 9K strands. It can be seen that our method can be about three magnitude faster faster and only uses a small amount of storage space. The reconstruction time differs from straight hair styles and curly hair styles because for straight hair styles which have less curvature difference, we skip the process of adding curves. 3.3.2 Qualitative Results To demonstrate the generality of our method, we tested with different real portrait photographs as input, as shown in Figure B.3, B.4 and B.5. Our method 43 Table 3.2: Reconstruction Error Comparison. The errors are measured in met- ric. The Pos Error refers to the mean square distance error between the ground- truth and the predicted hair. "-VAW" refers to eliminating the visibility-adaptive weights. "-Col" refers to eliminating the collision loss, "-Curv" refers to eliminating the curvature loss. "NN" refers to nearest neighbor query based on the visible part of the hair. Visible Pos Error Invisible Pos Error Collision Error HairNet 0.017 0.027 2.26× 10 −7 HairNet - VAW 0.024 0.026 3.5× 10 −7 HairNet - Col 0.019 0.027 3.26× 10 −6 NairNet - Curv 0.020 0.029 3.3× 10 −7 NN 0.033 0.041 0 Chai et al.[21] 0.021 0.040 0 Hu et al.[62] 0.023 0.028 0 Table 3.3: Time and space complexity. ours preprocessing inference reconstruction total time total space 0.02 s 0.01 s 0.01 - 0.05 s 0.04 - 0.08 s 70 MiB Chai et al.[21] preprocessing NN query refinement total time total space 3 s 10 s 40 s 53 s 1 TiB can handle different overall shapes (e.g. short hairstyles and long hairstyles). In addition, ourmethodcanalsoreconstructdifferentlevelscurlinesswithinhairstyles (e.g. straight, wavy, and very curly) efficiently, since we learn the curliness as curvatures in the network and use it to synthesize our final strands. In Figure 3.8 and Figure 3.7, we compare our results of single-view hair re- construction with autohair [21]. We found that both methods can make rational inference of the overall hair geometry in terms of length and shape, but the hair from our method can preserve better local details and looks more natural, espe- cially for curly hairs. This is because Chai et al.’s method depends on the accuracy and precision of the orientation field generated from the input image, but the ori- entation field generated from many curly hair images is noisy and the wisps overlap with each other. In addition, they use helix fitting to infer the depth of the hair, 44 but it may fail for very curly hairs, as shown in the second row of Figure 3.7. Moreover, Chai et al.’s method can only refine the visible part of the hair, so the reconstructed hair may look unnatural from views other than the view of the in- put image, while the hair reconstructed with our method looks comparatively more coherent from additional views. [Wen et al. 2103] ours hairstyle A hairstyle B interpolation results Figure 3.6: Interpolation comparison. Input Images Ours Chai et al. Figure 3.7: Comparison with Autohair in different views. Figure B.2 show the interpolation results of our method. The interpolation is performed between four different hair styles and the result shows that our method 45 can smoothly interpolate hair between curly or straight and short or long hairs. We also compare interpolation with Weng et al.’s method [137]. In Figure 3.6, Weng et al.’s method produces a lot of artifacts while our method generates more natural and smooth results. The interpolation results indicate the effectiveness of our latent hair representation. We also show video tracking results (see Figure 3.9 and supplemental video). It shows that our output may fail to achieve sufficient temporal coherence. 3.4 Discussion We have demonstrated the first deep convolutional neural network capable of performing real-time hair generation from a single-view image. By training an end-to-end network to directly generate the final hair strands, our method can capture more hair details and achieve higher accuracy than current state-of-the- art. The intermediate 2D orientation field as our network input provides flexibility, which enables our network to be used for various types of hair representations, such as images, sketches and scans given proper preprocessing. By adopting a multi-scaledecodingmechanism, ournetworkcouldgeneratehairstyles of arbitrary resolution while maintaining a natural appearance. Thanks to the encoder-decoder architecture, our network provides a continuous hair representation, from which plausible hairstyles could be smoothly sampled and interpolated. We found that our approach fails to generate exotic hairstyles like kinky, afro or buzz cuts as shown in Figure 3.10. We think the main reason is that we do not have such hairstyles in our training database. Building a large hair dataset that covers more variations could mitigate this problem. Our method would also fail when the hair is partially occluded. Thus we plan to enhance our training in the future by 46 Input Images Ours Chai et al. Figure 3.8: Comparison with Autohair for local details. adding random occlusions. In addition, we use face detection to estimate the pose of the torso in this paper, but it can be replaced by using deep learning to segment the head and body. Currently, the generated hair model is insufficiently temporally coherent for video frames. Integrating temporal smoothness as a constraint for training is also an interesting future direction. Although our network provides a 47 more compact representation for the hair, there is no semantic meaning of such latent representation. It would be interesting to concatenate explicit labels (e.g. color) to the latent variable for controlled training. Frame 005 Frame 094 Frame 133 Frame 197 Frame 249 Figure 3.9: Hair tracking and reconstruction on video. Figure 3.10: Failure Cases. 48 Chapter 4 Rotation Representations Recently, there has been an increasing number of applications in graphics and vision, where deep neural networks are used to perform regressions on rotations. This has been done for tasks such as pose estimation from images [36, 139] and from point clouds [44], structure from motion [127], and skeleton motion synthesis, which generates the rotations of joints in skeletons [131]. Many of these works represent 3D rotations using 3D or 4D representations such as quaternions, axis- angles, or Euler angles. However, for 3D rotations, we found that 3D and 4D representations are not ideal for network regression, when the full rotation space is required. Empiri- cally, the converged networks still produce large errors at certain rotation angles. We believe that this actually points to deeper topological problems related to the continuity in the rotation representations. Informally, all else being equal, discon- tinuous representations should in many cases be “harder" to approximate by neural networks than continuous ones. Theoretical results suggest that functions that are smoother [143] or have stronger continuity properties such as in the modulus of continuity [141, 26] have lower approximation error for a given number of neurons. Based on this insight, we first present in Section 4.2 our definition of the con- tinuity of representation in neural networks. We illustrate this definition based on a simple example of 2D rotations. We then connect it to key topological concepts such as homeomorphism and embedding. 49 Next, we present in Section 4.3 a theoretical analysis of the continuity of ro- tation representations. We first investigate in Section 4.3.1 some discontinuous representations, such as Euler angle and quaternion representations. We show that for 3D rotations, all representations are discontinuous in four or fewer di- mensional real Euclidean space with the Euclidean topology. We then investigate in Section 4.3.2 some continuous rotation representations. For the n dimensional rotation groupSO(n), we present a continuousn 2 −n dimensional representation. We additionally present an option to reduce the dimensionality of this representa- tion by an additional 1 to n− 2 dimensions in a continuous way. We show that these allow us to represent 3D rotations continuously in 6D and 5D. While we focus on rotations, we show how our continuous representations can also apply to other groups such as orthogonal groups O(n) and similarity transforms. Finally, inSection4.4wetestourideasempirically. Weconductexperimentson 3D rotations and show that our 6D and 5D continuous representations always out- perform the discontinuous ones for several tasks, including a rotation autoencoder “sanity test," rotation estimation for 3D point clouds, and 3D human pose inverse kinematics learning. We note that in our rotation autoencoder experiments, dis- continuous representations can have up to 6 to 14 times higher mean errors than continuous representations. Furthermore they tend to converge much slower while still producing large errors over 170 ◦ at certain rotation angles even after conver- gence, which we believe are due to the discontinuities being harder to fit. This phenomenon can also be observed in the experiments on different rotation repre- sentations for homeomorphic variational auto-encoding in Falorsi et al. [39], and in practical applications, such as 6D object pose estimation in Xiang et al. [139]. We also show that one can perform direct regression on 3x3 rotation matrices. Empirically this approach introduces larger errors than our 6D representation as 50 shown in Section 4.4.2. Additionally, for some applications such as inverse and for- ward kinematics, it may be important for the network itself to produce orthogonal matrices. We therefore require an orthogonalization procedure in the network. In particular, if we use a Gram-Schmidt orthogonalization, we then effectively end up with our 6D representation. Our contributions are: 1) a definition of continuity for rotation representa- tions, which is suitable for neural networks; 2) an analysis of discontinuous and continuous representations for 2D, 3D, and n-D rotations; 3) new formulas for continuous representations of SO(3) and SO(n); 4) empirical results supporting our theoretical views and that our continuous representations are more suitable for learning. 4.1 Related Work In this section, we will first establish some context for our work in terms of neural network approximation theory. Next, we discuss related works that inves- tigate the continuity properties of different rotation representations. Finally, we will report the types of rotation representations used in previous learning tasks and their performance. Neural network approximation theory. We review a brief sampling of results from neural network approximation theory. Hornik [59] showed that neural networks can approximate functions in theL p space to arbitrary accuracy if theL p norm is used. Barron et al. [10] showed that if a function has certain properties in its Fourier transform, then at most O( −2 ) neurons are needed to obtain an order of approximation . Chapter 6.4.1 of LeCun et al. [82] provides a more thorough overviewofsuchresults. Wenotethatresultsforcontinuousfunctionsindicatethat 51 functions that have better smoothness properties can have lower approximation errorforaparticularnumberofneurons[141,26,143]. Fordiscontinuousfunctions, Llanas et al. [92] showed that a real and piecewise continuous function can be approximatedinanalmostuniformway. However, Llanasetal.[92]alsonotedthat piecewisecontinuousfunctionswhentrainedwithgradientdescentmethodsrequire many neurons and training iterations, and yet do not give very good results. These results suggest that continuous rotation representations might perform better in practice. Continuity for rotations. Grassia et al. [46] pointed out that Euler angles and quaternions are not suitable for orientation differentiation and integration operationsandproposedexponentialmapasamorerobustrotationrepresentation. Saxena et al. [120] observed that the Euler angles and quaternions cause learning problems due to discontinuities. However, they did not propose general rotation representations other than direct regression of 3x3 matrices, since they focus on learning representations for objects with specific symmetries. Neuralnetworksfor3Dshapeposeestimation. Deepnetworkshavebeen applied to estimate the 6D poses of object instances from RGB images, depth maps or scanned point clouds. Instead of directly predicting 3x3 matrices that may not correspond to valid rotations, they typically use more compact rotation representations such as quaternion [139, 74, 73] or axis-angle [127, 44, 36]. In PoseCNN [139], the authors reported a high percentage of errors between 90 ◦ and 180 ◦ , and suggested that this is mainly caused by the rotation ambiguity for some symmetric shapes in the test set. However, as illustrated in their paper, the proportion of errors between 90 ◦ to 180 ◦ is still high even for non-symmetric shapes. In this paper, we argue that discontinuity in these representations could be one cause of such errors. 52 Neural networks for inverse kinematics. Recently, researchers have been interestedintrainingneuralnetworkstosolveinversekinematicsequations. Thisis becausesuchnetworksarefasterthantraditionalmethodsanddifferentiablesothat they can be used in more complex learning tasks such as motion re-targeting [131] and video-based human pose estimation [72]. Most of these works represented rotations using quaternions or axis-angle [60, 72]. Some works also used other 3D representations such as Euler angles and Lie algebra [72, 153], and penalized the joint position errors. Csiszar et al. [31] designed networks to output the sine and cosine of the Euler angles for solving the inverse kinematics problems in robotic control. Euler angle representations are discontinuous for SO(3) and can result in large regression errors as shown in the empirical test in Section 4.4. However, those authors limited the rotation angles to be within a certain range, which avoided the discontinuity points and thus achieved very low joint alignment errors in their test. However, many real-world tasks require the networks to be able to output the full range of rotations. In such cases, continuous rotation representations will be a better choice. 4.2 Definition of Continuous Representation In this section, we begin by defining the terminology we will use in the paper. Next, we analyze a simple motivating example of 2D rotations. This allows us to develop our general definition of continuity of representation in neural networks. We then explain how this definition of continuity is related to concepts in topology. Terminology. To denote a matrix, we typically use M, and M ij refers to its (i,j) entry. We use the term SO(n) to denote the special orthogonal group, the space of n dimensional rotations. This group is defined on the set of n×n 53 Mapping g Connected Set of Rotations in S 1 Original Space Disconnected Set of Angular Representations in [0, 2π] 0 2π Representation Space Figure 4.1: A simple 2D example, which motivates our definition of continuity of representation. See Section 4.2 for details. real matrices with MM T = M T M = I and det(M) = 1. The group operation is multiplication, which results in the concatenation of rotations. We denote the n dimensional unit sphere as S n ={x∈R n+1 :||x|| = 1}. Motivating example: 2D rotations. We now consider the representation of 2D rotations. For any 2D rotation M∈SO(2), we can also express the matrix as: M = cos(θ) − sin(θ) sin(θ) cos(θ) (4.1) We can represent any rotation matrix M∈ SO(2) by choosing θ∈ R, where R is a suitable set of angles, for example, R = [0, 2π]. However, this particular representation intuitively has a problem with continuity. The problem is that if we define a mappingg from the original spaceSO(2) to the angular representation space R, then this mapping is discontinuous. In particular, the limit of g at the identity matrix, which represents zero rotation, is undefined: one directional limit gives an angle of 0 and the other gives 2π. We depict this problem visually in Figure 4.1. On the right, we visualize a connected set of rotations C⊂SO(2) by visualizing their first column vector [cos(θ), sin(θ)] T on the unit sphereS 1 . On the left, after mapping them through g, we see that the angles are disconnected. In particular, we say that this representation is discontinuous because the mapping g from the original space to the representation space is discontinuous. We argue 54 Representation Space R Input signal Original Space X Mapping f Neural Network Mapping g Figure 4.2: Our definition of continuous representation, as well as how it can apply in a neural network. See the body for details. that these kind of discontinuous representations can be harder for neural networks to fit. Contrarily, if we represent the 2D rotation M∈ SO(2) by its first column vector [cos(θ), sin(θ)] T , then the representation would be continuous. Continuous representation: We can now define what we consider a contin- uous representation. We illustrate our definitions graphically in Figure 4.2. Let R be a subset of a real vector space equipped with the Euclidean topology. We call R the representation space: in our context, a neural network produces an inter- mediate representation in R. This neural network is depicted on the left side of Figure 4.2. We will come back to this neural network shortly. LetX be a compact topological space. We call X the original space. In our context, any intermediate representation inR produced by the network can be mapped into the original space X. Define the mapping to the original space f :R→X, and the mapping to the representation space g : X → R. We say (f,g) is a representation if for every x∈ X,f(g(x)) = x, that is, f is a left inverse of g. We say the representation is continuous if g is continuous. Connection with neural networks: We now return to the neural network on the left side of Figure 4.2. We imagine that inference runs from left to right. Thus, the neural network accepts some input signals on its left hand side, outputs a representation inR, and then passes this representation through the mapping f to get an element of the original spaceX. Note that in our context, the mapping f is implemented as a mathematical function that is used as part of the forward pass 55 of the network at both training and inference time. Typically, at training time, we might impose losses on the original spaceX. We now describe the intuition behind why we ask that g be continuous. Suppose that we have some connected set C in the original space, such as the one shown on the right side of Figure 4.1. Then if we map C into representation space R, and g is continuous, then the set g(C) will remain connected. Thus, if we have continuous training data, then this will effectively create a continuous training signal for the neural network. Contrarily, if g is not continuous, as shown in Figure 4.1, then a connected set in the original space may become disconnected in the representation space. This could create a discontinuous training signal for the network. We note that the units in the neural network are typically continuous, as defined on Euclidean topology spaces. Thus, we require the representation space R to have Euclidean topology because this is consistent with the continuity of the network units. Domain of the mapping f: We additionally note that for neural networks, it is specifically beneficial for the mappingf to be defined almost everywhere on a set where the neural network outputs are expected to lie. This enables f to map arbitrary representations produced by the network back to the original space X. Connection with topology: Suppose that (f,g) is a continuous representa- tion. Note that g is a continuous one-to-one function from a compact topological space to a Hausdorff space. From a theorem in topology [79], this implies that if we restrict the codomain ofg tog(X) (and use the subspace topology for g(X)) then the resulting mapping is a homeomorphism. A homeomorphism is a continuous bijection with a continuous inverse. For geometric intuition, a homeomorphism is often described as a continuous and invertible stretching and bending of one space to another, with also a finite number of cuts allowed if one later glues back together the same points. One says that two spaces are topologically equivalent if there is 56 a homeomorphism between them. Additionally, g is a topological embedding of the original space X into the representation space R. Note that we also have the inverse ofg: if we restrictf to the domaing(X) then the resulting functionf| g(X) is simply the inverse ofg. Conversely, if the original spaceX is not homeomorphic to any subset of the representation space R then there is no possible continuous representation (f,g) on these spaces. We will return to this later when we show that there is no continuous representation for the 3D rotations in four or fewer dimensions. 4.3 Rotation Representation Analysis Here we provide examples of rotation representations that could be used in net- works. We start by looking in Section 4.3.1 at some discontinuous representations for 3D rotations, then look in Section 4.3.2 at continuous rotation representations in n dimensions, and show that how for the 3D rotations, these become 6D and 5D continuous rotation representations. We believe this analysis can help one to choose suitable rotation representations for learning tasks. 4.3.1 Discontinuous Representations Case 1: Euler angle representation for the 3D rotations. Let the original space X = SO(3), the set of 3D rotations. Then we can easily show discontinuity in an Euler angle representation by considering the azimuth angle θ and reducing this to the motivating example for 2D rotations shown in Section 4.2. Inparticular, theidentityrotationI occursatadiscontinuity, whereonedirectional limit gives θ = 0 and the other directional limit gives θ = 2π. We visualize 57 the discontinuities in this representation, and all the other representations, in the supplemental Section C.6. Case 2: Quaternion representation for the 3D rotations. Define the original space X =SO(3), and the representation space Y =R 4 , which we use to represent the quaternions. We can now define the mapping to the representation space g q (M) = M 32 −M 23 ,M 13 −M 31 ,M 21 −M 12 ,t T if t6= 0 √ M 11 +1,c 2 √ M 22 +1,c 3 √ M 33 +1,0 T if t = 0 (4.2) t = Tr(M)+1,c i = 1 if M i,1 +M i,2 > 0 −1 otherwise (4.3) Likewise, one can define the mapping to the original space SO(3) as in [1]: f q ([x 0 ,y 0 ,z 0 ,w 0 ]) = 1−2y 2 −2z 2 , 2xy−2zw, 2xz+2yw 2xy+2zw, 1−2x 2 −2z 2 , 2yz−2xw 2xz−2yw, 2yz+2xw, 1−2x 2 −2y 2 , (x,y,z,w) =N([x 0 ,y 0 ,z 0 ,w 0 ]) (4.4) Here the normalization function is defined as N(q) = q/||q||. By expanding in terms of the axis-angle representation for the matrix M, one can verify that for every M∈SO(3),f(g(M)) =M. However, we find that the representation is not continuous. Geometrically, this can be seen by taking different directional limits around the matrices with 58 180 degree rotations, which are defined by R π ={M ∈ SO(3) : Tr(M) =−1}. Specifically, in the top case of Equation (4.2), where t6= 0, the limit of g q as we approach a 180 degree rotation is [ 0, 0, 0, 0 ], and meanwhile, the first three coordinates ofg q (r) forr∈R π are nonzero. Note that our definition of continuous representationfromSection4.2requiresaEuclideantopologyfortherepresentation spaceY, in contrast to the usual topology for the quaternions of the real projective spaceRP 3 , which we discuss in the next paragraph. In a similar way, we can show that other popular representations for the 3D rotations such as axis-angle have discontinuities, e.g. the axis in axis-angle has discontinuities at the 180 degree rotations. Representations for the 3D rotations are discontinuous in four or fewer dimensions. The 3D rotation group SO(3) is homeomorphic to the real projective spaceRP 3 . The spaceRP n is defined as the quotient space ofR n+1 \{0} under the equivalence relation that x∼λx for all λ6= 0. In a graphics and vision context, it may bemostintuitivetothink ofRP 3 asbeing thehomogeneouscoordi- nates inR 4 with the previous equivalence relation used to construct an appropriate topology via the quotient space. Based on standard embedding and non-embedding results in topology [33], we know thatRP 3 (and thus SO(3)) embeds inR 5 with the Euclidean topology, but does not embed inR d for any d< 5. By the definition of embedding, there is no homeomorphism from SO(3) to any subset ofR d for any d< 5, but a continuous representation requires this. Thus, there is no such continuous representation. 59 4.3.2 Continuous Representations In this section, we developtwo continuous representations for then dimensional rotations SO(n). We then explain how for the 3D rotations SO(3) these become 6D and 5D continuous rotation representations. Case 3: Continuous representation with n 2 −n dimensions for the n dimensionalrotations. Therotationrepresentationswehaveconsideredthusfar areallnotcontinuous. Onepossibilitytomakearotationrepresentationcontinuous would be to just use the identity mapping, but this would result in matrices of size n×n for the representation, which can be excessive, and would still require orthogonalization,suchasaGram-Schmidtprocessinthemappingf totheoriginal space, if we want to ensure that network outputs end up back in SO(n). Based on this observation, we propose to perform an orthogonalization process in the representation itself. Let the original space X = SO(n), and the representation space be R = R n×(n−1) \D (D will be defined shortly). Then we can define a mappingg GS to the representation space that simply drops the last column vector of the input matrix: g GS a 1 ... a n = a 1 ... a n−1 (4.5) , where a i ,i = 1, 2,...,n are column vectors. We note that the set g GS (X) is a Stiefel manifold [2]. Now for the mapping f GS to the original space, we can define the following Gram-Schmidt-like process: f GS a 1 ... a n−1 = b 1 ... b n (4.6) 60 b i = N(a 1 ) if i=1 N(a i − P i−1 j=1 (b j ·a i )b j ) if 2≤i<n det e 1 b 1 ... b n−1 . . . e n if i =n. T (4.7) HereN(·) denotes a normalization function, the same as before, ande 1 ,...,e n are the n canonical basis vectors of the Euclidean space. The only difference of f GS from an ordinary Gram-Schmidt process is that the last column is computed by a generalization of the cross product ton dimensions. Now clearly,g GS is continuous. To check that for everyM∈SO(n),f GS (g GS (M)) =M, we can use induction and the properties of the orthonormal basis vectors in the columns of M to show that the Gram-Schmidt process does not modify the first n− 1 components. Lastly, we can use theorems for the generalized cross product such as Theorem 5.14.7 of Bloom [13], to show that the last component of f GS (g GS (M)) agrees with M. Finally, wecandefinethesetD asthatwheretheaboveGram-Schmidt-likeprocess does not map back to SO(n): specifically, this is where the dimension of the span of the n− 1 vectors input to g is less than n− 1. 6Drepresentationforthe3Drotations: Forthe3Drotations, Case3gives us a 6D representation. The generalized cross product forb n in Equation (4.7) sim- ply reduces to the ordinary cross product b 1 ×b 2 . We give the detailed equations in Section C.2 in the supplemental document. We specifically note that using our 6D representation in a network can be beneficial because the mapping f GS in Equation (4.7) ensures that the resulting 3x3 matrix is orthogonal. In contrast, suppose a direct prediction for 3x3 matrices is used. Then either the orthogonal- ization can be done in-network or as a postprocess. If orthogonalization is done 61 N 0 p p′ x y Figure 4.3: An illustration of stereographic projection in 2D. We are given as input a point p on the unit sphere S 1 . We construct a ray from a fixed projection point N 0 = (0, 1) through p and find the intersection of this ray with the plane y = 0. The resulting point p 0 is the stereographic projection of p. in network, the last 3 components of the matrix will be discarded by the Gram- Schmidt process in Equation (4.7), so the 3x3 matrix representation is effectively our 6D representation plus 3 useless parameters. If orthogonalization is done as a postprocess, then this prevents certain applications such as forward kinematics, and the error is also higher as shown in Section 4.4. Group operations such as multiplication: Suppose that the original space is a group such as the rotation group, and we want to multiply two representations r 1 ,r 2 ∈R. In general, we can do this by first mapping to the original space, multi- plying the two elements, and then mapping back: r 1 r 2 =g(f(r 1 )f(r 2 )). However, for the proposed representation here, we can gain some computational efficiency as follows. Since the mapping to the representation space in Equation (4.5) drops the last column, when computing f(r 2 ), we can simply drop the last column and compute the product representation as the product of ann×n and ann× (n− 1) matrix. Case4: Furtherreducingthedimensionalityforthendimensionalro- tations. Forn≥ 3dimensions, wecanreducethedimensionfortherepresentation in the previous case, while still keeping a continuous representation. Intuitively, a lower dimensional representation that is less redundant could be easier to learn. However, we found in our experiments that the dimension-reduced representation 62 does not outperform the Gram-Schmidt-like representation from Case 3. However, we still develop this representation because it allows us to show that continuous rotation representations can outperform discontinuous ones. We show that we can perform such dimension reduction using one or more stereographic projections combined with normalization. We show an illustration of a 2D stereographic projection in Figure 4.3, which can be easily generalized to higher dimensions. Let us first normalize the input point, so it projects to a sphere, and then stereographically project the result using a projection point of (1, 0,..., 0). We call this combined operation a normalized projection, and define it as P :R m →R m−1 : P (u) = v 2 1−v 1 , v 3 1−v 1 , ..., vm 1−v 1 T , v =u/||u||. (4.8) Now define a function Q :R m−1 →R m , which does a stereographic un-projection: Q(u) = 1 ||u|| 1 2 (||u|| 2 −1), u 1 , ..., u m−1 T (4.9) Note that the un-projection is not actually back to the sphere, but in a way that coordinates 2 through m are a unit vector. Now we can use between 1 and n− 2 normalized projections on the representation from the previous case, while still preserving continuity and one-to-one behavior. For simplicity, we will first demonstrate the case of one stereographic projec- tion. The idea is that we can flatten the representation from Case 3 to a vector and then stereographically project the lastn + 1 components of that vector. Note that we intentionally project as few components as possible, since we found that nonlin- earities introduced by the projection can make the learning process more difficult. These nonlinearities are due to the square terms and the division in Equation (4.9). 63 If u is a vector of length m, define the slicing notation u i:j = (u i ,u i+1 ,...,u j ), and u i: = u i:m . Let M (i) be the ith column of matrix M. Define a vectorized representation γ(M) by dropping the last column of M like in Equation (4.5): γ(M) = [M T (1) ,...,M T (n−1) ]. Now we can define the mapping to the representation space as: g P (M) = [γ 1:n 2 −2n−1 ,P(γ n 2 −2n: )] (4.10) Here we have dropped the implicit argument M to γ for brevity. Define the mapping to the original space as: f P (u) =f GS [u 1:n 2 −2n−1 ,Q(u n 2 −2n: )] (n×(n−1)) (4.11) Here the superscript (n×(n− 1)) indicates that the vector is reshaped to a matrix of the specified size before going through the Gram-Schmidt function f GS . We can now see why Equation (4.9) is normalized the way it is. This is so that projection followed by un-projection can preserve the unit length property of the basis vector that is a column of M, and also correctly recover the first component of Q(·), so we get f P (g P (M)) = M for all M ∈ SO(3). We can show that g P is defined on its domain and continuous by using properties of the orthonormal basis produced by the Gram-Schmidt process g GS . For example, we can show that γ6= 0 because components 2 through n + 1 of γ are an orthonormal basis vector, andN(γ) will never be equal to the projection point [1, 0,..., 0, 0] that the stereographicprojectionismadefrom. ItcanalsobeshownthatforallM∈SO(3), f P (g P (M)) =M. We show some of these details in the supplemental material. As a special case, for the 3D rotations, this gives us a 5D representation. This representation is made by using the 6D representation from Case 3, flattening it to a vector, and then using a normalized projection on the last 4 dimensions. 64 n = 3: 1 n = 4: 2 3 4 5 6 5 6 7 8 1 2 3 4 9 10 11 12 n = 5: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Legend No projection Projection #1 Projection #2 Projection #3 Figure 4.4: An illustration of how n− 2 normalized projections can be made to reduce the dimensionality for the representation of SO(n) from Case 3 by n− 2. In each row we show the dimension n, and the elements of the vectorized repre- sentation γ(M) containing the first n− 1 columns of M∈ SO(n). Each column is length n: the columns are grouped by the thick black rectangles. Each unique color specifies a group of inputs for the “normalized projection" of Equation (4.8). The white regions are not projected. We can actually make up ton− 2 projections in a similar manner, while main- taining continuity of the representation, as follows. As a reminder, the length of γ, the vectorized result of the Gram-Schmidt process, isn(n− 1): it containsn− 1 basis vectors each of dimensionn. Thus, we can maken−2 projections, where each projectioni = 1,...,n−2selectsthebasisvectori+1fromγ(M), prependstoitan appropriately selected element from the first basis vector of γ(M), such asγ n+1−i , and then projects the result. The resulting projections are then concatenated as a row vector along with the two unprojected entries to form the representation. Thus, after doingn− 2 projections, we can obtain a continuous representation for SO(3) inn 2 −2n+2 dimensions. See Figure 4.4 for a visualization of the grouping of the elements that can be projected. Other groups: O(n), similarity transforms, quaternions. In this paper, we focus mainly on the representation of rotations. However, we note that the preceding representations can easily be generalized to O(n), the group of orthog- onal n×n matrices M with MM T = M T M = I. We can also generalize to the similarity transforms, which we denote as Sim(n), defined as the affine maps ρ(x) on R n , ρ(x) = αRx +u, where α > 0, R is an n×n orthogonal matrix, and u∈ R n [5]. For the orthogonal group O(n), we can use any of the representa- tions in Case 3 or 4, but with an additional component in the representation that 65 indicates whether the determinant is +1 or -1. Then the Gram-Schmidt process in Equation (4.7) needs to be modified slightly: if the determinant is -1 then the last vector b n needs to be negated. Meanwhile, for the similarity transforms, the translation component u can easily be represented as-is. The matrix component αR of the similarity transform can be represented using any of the options in Case 3 or 4. The only needed change is that the Gram-Schmidt process in Equations 4.7 or 4.11 should multiply the final resulting matrix by α. The term α is simply the norm of any of the basis vectors input to the Gram-Schmidt process, e.g.||a 1 || in Equation (4.7). Clearly, if the projections of Case 4 are used, at least one basis vector must remain not projected, so α can be determined. In the supplemental material, we also explain how one might adapt an existing network that outputs 3D or 4D rotation representations so that it can use our 6D or 5D representations. 4.4 Empirical Results We investigated different rotation representations and found that those with better continuity properties work better for learning. We first performed a sanity test and then experimented on two real world applications to show how continuity properties of rotation representations influence the learning process. 4.4.1 Sanity Test We first perform a sanity test using an auto-encoder structure. We use a multi-layer perceptron (MLP) network as an encoder to map SO(3) to the chosen representation R. We test our proposed 6D and 5D representations, quaternions, axis-angle, and Euler angles. The encoder network contains four fully-connected 66 Mean(°) Max(°) Std(°) 6D 0.49 1.98 0.27 5D 0.49 1.99 0.27 Quat 3.32 179.93 5.97 AxisA 3.69 179.22 5.99 Euler 6.98 179.95 17.31 Sanity Test a. Mean errors during iterations. b.Percentile of errors at500k iteration. c. Errors at 500kiteration. Mean(°) Max(°) Std(°) 6D 2.85 179.83 9.16 5D 4.78 179.87 12.25 Quat 9.03 179.66 16.33 AxisA 11.93 179.7 21.35 Euler 14.13 179.67 23.8 Matrix 4.21 180.0 9.44 d.Mean errors during iterations. e. Percentile of errors at 2600kiteration. f.Errors at 2600kiteration. 3D PointCloudPoseEstimation Test HumanBodyInverse Kinematics Test Mean(cm) Max(cm) Std(cm) 6D 1.9 28.7 1.2 5D 2.0 33.3 1.4 Quat 3.3 87.1 3.1 AxisA 3.0 120.0 2.3 Euler 2.7 48.7 2.1 Matrix 22.9 53.6 4.0 g.Mean errors during iterations. h.Percentile of errors at1960k iterations. i.Errors at 1960k iteration. 0.1 0.1 Figure 4.5: Empirical results. In (b), (e), (h) we plot on the x axis a percentile p and on the y axis the error at the given percentile p. layers, where hidden layers have 128 neurons and Leaky ReLU activations. The fixed “decoder" mapping f :R7→SO(3) is defined in Section 4.3. For training, we compute the loss using the L2 distance between the input SO(3) matrix M and the output SO(3) matrix M 0 : note that this is invariant to the particular representation used, such as quaternions, axis-angle, etc. We use Adam optimization with batch size 64 and learning rate 10 −5 for the first 10 4 iterations and 10 −6 for the remaining iterations. For sampling the input rota- tion matrices during training, we uniformly sample axes and angles. We test the networks using 10 5 rotation matrices generated by randomly sampling axes and angles and calculate geodesic errors between the input and the output rotation 67 matrices. The geodesic error is defined as the minimal angular difference between two rotations, written as L angle =cos −1 ((M 00 00 +M 00 11 +M 00 22 −1)/2) (4.12) M 00 =MM 0−1 (4.13) Figure 4.5(a) illustrates the mean geodesic errors for different representations as training progresses. Figure 4.5(b) illustrates the percentiles of the errors at 500k iterations. The results show that the 6D and 5D representations have similar per- formance with each other. They converge much faster than the other representa- tions and produce smallest mean, maximum and standard deviation of errors. The Euler angle representation performs the worst, as shown in Table (c) in Figure 4.5. For the quaternion, axis angle and Euler angle representations, the majority of the errors fall under 25 ◦ , but certain test samples still produce errors up to 180 ◦ . The proposed 6D and 5D representations do not produce errors higher than 2 ◦ . We conclude that using continuous rotation representations for network training leads to lower errors and faster convergence. In Appendix C.7.2, we report additional results, where we trained using a geodesic loss, uniformly sampled SO(3), and compared to 3D Rodriguez vectors and quaternions constrained to one hemisphere [73]. Again, our continuous repre- sentations outperform common discontinuous ones. 4.4.2 Pose Estimation for 3D Point Clouds In this experiment, we test different rotation representations on the task of estimating the rotation of a target point cloud from a reference point cloud. The inputs of the networks are the reference and target point clouds P r ,P t ∈ R N×3 , 68 where N is the number of points. The network output is the estimated rotation R∈R D betweenP r andP t , whereD is the dimension of the chosen representation. We employ a weight-sharing Siamese network where each half is a simplified PointNetstructure[112], Φ :R N×3 7→R 1024 . ThesimplifiedPointNetusesa4-layer MLP of size 3× 64× 128× 1024 to extract features for each point and then applies max pooling across all points to produce a single feature vector z. One half of the Siamese network maps the reference point cloud to a feature vectorz r = Φ(P r ) and the other half maps the target point cloud to z t = Φ(P t ). Then we concatenate z r and z t and pass this through another MLP of size 2048× 512× 512×D to produce the D dimensional rotation representation. Finally, we transform the rotation representations to SO(3) with one of the mapping functions f defined in Section 4.3. We train the network with 2,290 airplane point clouds from ShapeNet [24], and test it with 400 held-out point clouds augmented with 100 random rotations. At each training iteration, we randomly select a reference point cloud and transform it with 10 randomly-sampled rotation matrices to get 10 target point clouds. We feed the paired reference-target point clouds into the Siamese network and minimize the L2 loss between the output and the ground-truth rotation matrices. We trained the network with 2.6× 10 6 iterations. Plot (d) in Figure 4.5 shows the mean geodesic errors as training progresses. Plot (e) and Table (f) show the percentile, mean, max and standard deviation of errors. Again, the 6D represen- tation has the lowest mean and standard deviation of errors with around 95% of errors lower than 5 ◦ , while Euler representation is the worst with around 10% of errors higher than 25 ◦ . Unlike the sanity test, the 5D representation here performs worse than the 6D representation, but outperforms the 3D and 4D representations. We hypothesize that the distortion in the gradients caused by the stereographic 69 projection makes it harder for the network to do the regression. Since the ground- truth rotation matrices are available, we can directly regress the 3× 3 matrix using L2 loss. During testing, we use the Gram-Schmidt process to transform the predicted matrix into SO(3) and then report the geodesic error (see the bottom row of Table (f) in Figure 4.5). We hypothesize that the reason for the worse performance of the 3× 3 matrix compared to the 6D representation is due to the orthogonalization post-process, which introduces errors. 4.4.3 Inverse Kinematics for Human Poses In this experiment, we train a neural network to solve human pose inverse kinematics (IK) problems. Similar to the method of Villegas et al. [131] and Hsu et al. [60], our network takes the joint positions of the current pose as inputs and predicts the rotations from the T-pose to the current pose. We use a fixed forward kinematic function to transform predicted rotations back to joint positions and penalize their L2 distance from the ground truth. Previous work for this task used quaternions. We instead test on different rotation representations and compare their performance. The input contains the 3D positions of the N joints on the skeleton marked as P = (p 1 ,p 2 ,p 3 ,...,p N ),p i = (x,y,z) | . The output of the network are the rotations of the joints in the chosen representation R = (r 1 ,r 2 ,r 3 ,...,r N ),r i ∈R D , where D is the dimension of the representation. We train a four-layer MLP network that has 1024 neurons in hidden layers with the L2 reconstruction loss L =||P−P 0 || 2 2 , where P 0 = Π(T,R). Here Π is the forward kinematics function which takes as inputs the “T" pose of the skeleton and the predicted joints rotations, and outputs the 3D positions of the joints. Due to the recursive computational structure of forward kinematics, the accuracy of 70 the hip orientation is critical for the overall skeleton pose prediction and thus the joints adjacent to the hip contribute more weight to the loss (10 times higher than other joints). WeusetheCMUMotionCaptureDatabase[30]fortrainingandtestingbecause itcontainscomplexmotionslikedancingandmartialarts, whichcoverawiderange of joint rotations. We picked in total 865 motion clips from 37 motion categories. We randomly chose 73 clips for testing and the rest for training. We fix the global position of the hip so that we do not need to worry about predicting the global translation. The whole training set contains 1.14× 10 6 frames of human poses and the test set contains 1.07× 10 5 frames of human poses. We train the networks with 1,960k iterations with batch size 64. During training, we augmented the poses with random rotation along the y-axis. We augmented each instance in the test set with three random rotations along the y-axis as well. The results, as displayed in subplots (g), (h) and (i) in Figure 4.5, show that the 6D representation performs the best with the lowest errors and fastest convergence. The 5D representation has similarperformanceasthe6Done. Onthecontrary, the4Dand3Drepresentations have higher average errors and higher percentages of big errors that exceed 10 cm. We also perform the test of using the 3×3 matrix without orthogonalization during training and using the Gram-Schmidt process to transform the predicted matrix into SO(3) during testing. We find this method creates huge errors as reported in the bottom line of Table (i) in Figure 4.5. One possible reason for this bad performance is that the 3×3 matrix may cause the bone lengths to scale during the forward kinematics process. In Appendix C.7.1, we additionally visualize some human body poses for quaternions and our 6D representation. 71 4.5 Discussion We investigated the use of neural networks to approximate the mappings be- tween various rotation representations. We found empirically that neural networks can better fit continuous representations. For 3D rotations, the commonly used quaternion and Euler angle representations have discontinuities and can cause problems during learning. We present continuous 5D and 6D rotation represen- tations and demonstrate their advantages using an auto-encoder sanity test, as well as real world applications, such as 3D pose estimation and human inverse kinematics. 72 Chapter 5 3D Human Motion Synthesis 73 There is a great demand for producing convincing performances for CG charac- ters in the game and animation industry. In order to obtain realistic motion data, production studios still rely mainly on manually keyframed body animations and professional motion capture systems [81]. Traditional keyframe animations require very dense pose specifications to ensure the naturalness of the motion, which is known to be a time-consuming and expensive task, requiring highly skilled ani- mators. Motion graphs methods based on databases [83, 119, 81] can synthesize smooth motions following user constraints by optimizing a path that combines a set of pre-captured motion clips. However, these methods are limited by the motion variations in the database and tend to perform poorly for user constraints that do not match the database motions well. Furthermore, the search space grows exponentially with the size of the dataset, which makes these approaches harder to apply on large databases. Recentdeeplearningapproacheshaveshownpromisingresultsintheautomatic motion synthesis for locomotion [57, 52] and more complex performances such as playing basketball [91] and soccer [58]. However, these methods typically respect high-level controls such as the path in locomotion synthesis [57] or task-specific goals. In contrast, when producing more expressive or complicated animations, an artist may find it useful to have more precise control on specific frames [65]. In this chapter, we aim to synthesize complex human motions of arbitrary length where keyframes are given by users in sparse and random locations with varying densities. We call this problem long-term inbetweening. Users can iterate on their motion designs by adjusting the keyframes. With very sparse keyframes, oursystemcanimproviseavarietyofmotionsunderthesamekeyframeconstraints. The first question consists on how to design a network to generate realistic motions. The dataset contains limited number of short motion clips and each 74 clip performs only one type of motion. But in real application scenarios, one might need to simulate the transitions between random poses from the same or different motion clips or even different motion classes, and the ground truth of such transitions does not exist. The second question consists on how to synthesize a sequence that can precisely follow the keyframes. To ensure the coherence of contents of the entire sequence, the motion within an interval should be synthesized holistically considering the global context of all the keyframes that could influence the current interval, in- steadofonlytwokeyframesateachendpointoftheinterval. Moreover, unlikesome works that generate motions based only on initial poses [88, 145], long-term in- betweening requires the generator to perform sophisticated choreography to make sure the synthesized motion not only look natural but also reach keyframe poses at specified time and location. A small difference in the middle of a motion can result in big time or space displacements at keyframes. The reason is that the local movements of the limbs and the body affect the speed and direction of the char- acter, and the integral of the velocities of sequential frames determine the global motion. Therefore, these differences can accumulate throughout the sequence and any modifications to a motion must consider both local and global effects. Facing the challenges above, we propose a conditional generative adversarial network (GAN) that can learn natural motions from a motion capture database (CMU dataset) [30]. We propose a two-stage method, where we first synthesize local motion, and then predict global motion. Our computation is highly efficient, where one minute of motion can be generated automatically under half a second. Real-time updates are possible for shorter sequences. To avoid synthesizing unrealistic body poses, we embed body biomechnical constraints into the network by restricting the rotation range of local joints with 75 a Euler angle representation. Following the observations of Zhou et al. [154], for the local joints, we carefully choose the order and the ranges of the Euler angles to avoid discontinuities in the angular representation. We call the integration of this angular representation with forward kinematics Range Constrained Forward Kinematics (RC-FK) layer. We also develop a design concept called Motion DNA to enable the network to generateavarietyofmotionssatisfyingthesamekeyframeconstraints. Themotion DNA seeds are inferred from a set of reference poses which are the representative frames of real motion sequences. By choosing the type of representative frames, the user is able to influence the style of the output motion. Here are our main contributions: We believe that our approach is a key com- ponent for production settings where fast turn around is critical, such as pre- visualization or the animation of secondary characters in large scenes. Further- more, we also anticipate potential adoption of our method in non-professional content creation settings such as educational and consumer apps. • We introduce the first deep generative model that can: 1) synthesize high- fidelity and natural motions between key frames of arbitrary lengths auto- matically, 2) ensure exact keyframe interpolation, 3) support motion style variations in the output, and 4) mix characteristics of multiple classes of motions. Existing methods do not support all these capabilities as a whole. • We propose a novel GAN architecture and training approach for automatic long-term motion inbetweening. Our two-stage approach makes the problem tractable by first predicting the local and then the global motion. • We propose a novel Range Constrained Forward Kinematics (RC-FK) layer to embed body biomechanical constraints in the network. 76 • We introduce the concept of motion DNA which allows the network to gen- erate a variety of motion sequences from a single set of sparse user-provided keyframes. • Our method is significantly faster than state-of-the-art motion graph tech- niques, while providing intuitive control and ensuring high-quality output. 5.1 Related Work Transition Generation. Statistical motion models have been used to gener- ate transitions in human animations. Chai and Hodgins [20] and Min et al. [96] developed MAP optimization frameworks, which can create transitions and also follow other constraints. Wang et al. [133] use Gaussian process dynamical models to create transitions and synthesize motion. Lehrmann et al. [84] use a nonlinear Markov model called a dynamic forest model to synthesize transitions and perform action recognition. Harvey and Pal [52] use a recurrent network model based on long short-term memory (LSTM) to synthesize human locomotion, including for transitions. Unlike Harvey and Pal [52], who train on a fixed interval of 60 frames for locomotion, our method can synthesize longer-term motions across different kinds of movement (e.g. walking, dance), can use a wide range of different tempo- ral intervals between an arbitrary number of keyframes, and can provide a variety of outputs. Kernel-based Methods for Motion Synthesis. Approaches using kernels such as radial basis functions (RBFs) and Gaussian process regression have been used for motion synthesis tasks. Rose et al. [116] call parameterized motions such as walking and running “verbs" and use RBFs and polynomials to blend between them. Rose et al. [117] use RBFs to perform interpolations of example 77 motions and positions for inverse kinematics tasks. Kovar and Gleicher [80] create a search algorithm that can identify similar motions in a dataset and use them to create continuous motions by a similar blending procedure. Grochow et al. [47] use Gaussian processes to perform inverse kinematics based on a dataset of human poses. Mukai and Kuriyama [99] use Gaussian process regression to better model correlations between motions, and also improve on artifacts in end effectors such as foot sliding. Levine et al. [86] use a Gaussian process latent variable model to perform animation and control tasks in a low-dimensional space. Unlike these works, our method can synthesize long-term transitions, and also is based on a deep generative adversarial architecture. Motion Graph Methods for Motion Synthesis. From an original dataset of motion clips, one can construct a motion graph where each state refers to a frame of pose and each edge represents the transition between two states. Given the start and end state, as well as the motion graph, the optimal path are searched tomatchtheinputconstraints. Butthesearchingspacegoesexponentiallywiththe number of the edges. To prune the motion graph, Lee et al .[83] clustered similar states, and Park et al .[104] and Safonova et al. [119] limited the transitions to states with contact changes. Safonova et al. further used an RAR* [89] algorithm to speed up the search process which takes a few minutes to compute a close to optimal 15 second long motion from a database of 6-7 minutes of motion. However, their algorithm uses a heuristic function aiming for searching the shortest path between the start and end poses, and cannot be applied when the time interval is constrained. In contrast, our method is trained on over 2.5 million frames (around 6 hours) of motion clips and only takes 0.16 second to infer a 15 second long motion. 78 Neural Network Controllers for Physics-based Motion. Neural network approaches have been used for motion controllers in physically-based animation. Allen and Faloutsos [4] evolve the topology of neural networks to allow for the creation of increasingly complicated motion behavior. Tan et al. [124] apply this to the task of learning controllers for bicycle stunts. Levine and Vladlen [85] use a neural network-based policy search to learn policies for complex tasks such as bipedal push recovery and walking on uneven terrain. Mordatch et al. [98] use a recurrent architecture to learn a controller for complex dynamic motions such as swimming, flying, and biped and quadruped walking. More recently, there has been a focus on locomotion, such as learning both low- and high-level controllers for locomotion [109], and achieving more realistic and symmetric [146] or biolog- ically plausible [71] locomotion. Networks have also been used recently to learn controllers from example motion clips in sports such as basketball [91] and soc- cer [58]. Naderi et al. [100] also address path and movement planning for wall climbing humanoids. Reinforcement learning has been used recently to learn a variety of motions, such as scheduling the order of a diverse array of “control frag- ments" for a motion controller [90] and learning a variety of skills by imitation such as acrobatics and martial arts [108]. Unlike these papers, we do not focus on physically-based animation, but instead synthesis of perceptually plausible mo- tions: these can be useful for applications such as games or movie scenes where physicality is not a strict requirement. Deep Learning for Motion Synthesis. Recently, researchers have investi- gated the use of deep learning approaches for synthesizing plausible but not neces- sarily physically correct motions. Holden et al. [57] learn a motion manifold from a large motion capture dataset, and then learn a feedforward network that maps user goals such as a motion path to the motion manifold. Li et al. [88] introduced the 79 first real-time method for synthesizing indefinite complex human motions using an auto-conditioned recurrent neural network, but only high-level motion styles can be provided as input. Holden et al. [56] later proposed a synthesis approach for character motions using a neural network structure that computes weights as cyclic functions of user inputs and the phase within the motion. Zhang et al. [149] synthesize quadruped motion by dynamically blending groups of weights based on the state of the character. Aristidou et al. [6] break motion sequences into motion words and then cluster these to find descriptive representations of the motion that are called motion signatures and motifs. In contrast, we focus here on synthesizing plausible long-term motion transitions. Followed by the work of Cai et al. [18] which focuses on video completion, recently Yan et al. [144] proposed a convolution network that can transform random latent vectors from Gaussian process to hu- man motion sequences, and demonstrated the ability to complete motion between disjoint short motion sequences. But their framework only generates and matches local motion without global motion, and does not apply to the cases of sparsely given individual keyframes. Moreover, they do not provide users explicit control on the style of output motion or consider the body flexibility constraints. We do not compare with their method because it is concurrent and does not provide open-sourced code. Joint Angle Limits and Forward Kinematics. Recently, there has been research into the representation of joint angles and the use of forward kinematics within networks. Akhter and Black [3] learn a pose-dependent model of joint angle limits, which generalizes well but avoids impossible poses. Jiang and Liu [70] use a physics simulation to accurately simulate human joint limits. We are inspired by these works and represent local joint angles in a limited range in our network. Villegas et al. [131] use a recurrent neural network with a forward kinematics layer 80 toperformunsupervisedmotionretargeting. Wesimilarlyuseaforwardkinematics layer in our network. Pavllo et al. [106] use a quaternionic representation for joint angles and a forward kinematics layer. Zhou et al. [154] showed recently that when the complete set of 3D rotations need to be represented, a 6D rotation representation often performs better: we therefore adopt that angle representation for the root node. 5.2 Method 5.2.1 Definitions In this paper, we use keyframes to refer to user-specified root position and local pose at each joint at the specific time. We use representative frames to refer to algorithmically extracted representative (or important) poses within a motion sequence. We use motion DNA to refer to a random sampling of representative frames from the same motion class, embedded to a 1D hidden vector space. This motion DNA helps characterize important properties of a motion: conditioning our network on the motion DNA allows the user to create variation in the model’s output. This concept is related to the “motion signatures" of Aristidou et al. [6]. We use the following notations throughout the paper: Root refers to the hip joint, which is also the root node. S ={S t } N t=1 : a sequence of synthesized skeleton poses including the global position of the root and the local position of the other joints relative to the root. S t 1 :t 2 : a sub sequence of skeleton poses from frame t 1 to t 2 . ˆ T ={ ˆ T t } ˆ N t=1 : a set of representative frames that capture the typical charac- teristics of a motion class. We rotate the representative frames so that their root joints are all facing the same direction. 81 N: total number of frames. N 0 : total number of user’s keyframes. φ ={φ t } k t=1 : the set of keyframe indices. ˆ N: total number of the representative frames in Motion DNA M: total number of joints in the skeleton. T root t : 3D translation of the root in world coordinates at frame t. T joints t : 3D translations of the joints relative to the root node in relative world coordinates at frame t. The dimension is M× 3. R root t : 3D rotation matrix of the root node in world coordinates. R joints t : 3D rotation matrices of the joints relative to the parent joints at frame t. The dimension is M× 3× 3. We use a frame rate of 60 fps for the whole paper. 5.2.2 Method Overview Figure 5.1 shows our system design. At a high level, we use a 1-D CNN to first predict the local pose of all frames conditioning upon user-provided sparse keyframes, and then from the local pose we use another 1-D CNN to predict the global pose. We will first motivate why we used such an approach and then return to a more detailed description of the components. Given user-specified sparse keyframes, the goal of long-term inbetweening is to generate a motion sequence in which the synthesized frames match the user’s inputkeyframesatthespecifiedtimes, andalsoallsub-sequencesofthesynthesized 82 Motion DNA Real/Fake ... input given frames generator local motion discriminators global path predictor global path root coordinates local poses time RC-FK Real/Fake Figure 5.1: Method Overview. Given user-specified keyframes and the correspond- ing mask, we generate the local motion of every joint and every frame in a rotation representation. We then use Range-Constrainted Forward Kinematics module to obtain the local joint positions based on which we use a global path predictor to estimate the global path of the root joint. motion look realistic. For simplicity, we use sub-sequences containing n frames to measure realism. The problem could then be formulated as: S = G 0 (S 0 φ ) min G 0 N−n X t=1 L realism (S t:t+n ) s.t.∀t∈φ, S t = S 0 t (5.1) where S is the synthesized motion sequence that contains all joint information, S 0 is the user-specified pose, which is defined for the keyframes at times φ, G 0 is 83 a generative model, andL realism is some loss that encourages realism. Note that given very sparse keyframes, there might be multiple good solutions. One approach to measure realism is to use a generative adversarial network (GAN). A GAN could be conditioned on the user-specified keyframes S 0 , use a discriminator to measure the realism, and use a generator to hallucinate complex motion while simultaneously matching the keyframes and deceiving the discrimi- nator. However, we found that empirically, training a GAN to generate global and local motions together is much more unstable than training it to generate local motions only. Fortunately, we found that the global motion can be inferred from the synthesized local motions using a pre-trained and fixed network that we call the “global path predictor". Thus, we design a two-stage method to first generate the local motions and then the global path. We reformulate the optimization as: T joints = G(S 0 φ ), T root = Υ(T joints ) S t ={T root t , T joints t } min Υ,G N−n X t=1 L realism T joints t:t+n +γ X t∈φ kS t − S 0 t k 2 (5.2) The first network G is a generator network that is conditioned on keyframes and outputs the local motion T joints t . The second network Y is a global path predictor, which is conditioned on the local motion and predicts the global motion T root t . ForL realism , we use a GAN loss (Section 5.2.5). Finally, we use an L 2 loss to measure the mismatch between the generated frame S t and the user’s keyframe S 0 t . In practice, we first train the global path predictorY using the motions in the CMU motion dataset [30]. After pretraining the global path predictor, we then freeze its weights and train the generator G. 84 Most deep learning based motion synthesis works [52, 98, 131] use recurrent neural network (RNN) which is specifically designed for modeling the dynamic temporal behavior of sequential data such as motion. However, traditional RNN only uses the past frames to predict future frames. Harvey and Pal [52] proposed RecurrentTransitionNetwork(RTN),amodifiedversionofLSTM,whichgenerates fixed-length transitions based on a few frames in the past and a target frame in the future. However, RTNcannotprocessmultiplefuturekeyframesspacedoutintime and is therefore only suitable for modeling simple locomotion which does not need to take into account a global context specified by multiple keyframes. In contrast, a 1-D convolutional network structure that performs convolutions along the time axis is more suitable for our scenario, because multiple keyframes can influence the calculation of a transition frame given large enough network receptive field. Thus, we use 1-D convolutional neural network for our global path predictor, motion generator and discriminator. WenowdiscussthedifferentcomponentsofoursystemillustratedinFigure5.1. Given the global coordinates of the root and the local positions of other joints at keyframes (black dots and frames in the leftmost region of Figure 5.1), a generator synthesizes the local pose in rotation representation for all the frames. A motion DNA vector is additionally fed into the generator to encourage the output to have diversity. Next, a Range-Constrained Forward Kinematics (RC-FK) layer trans- lates the local pose in rotation representation to 3D positions, which are then fed into the global path predictor to generate the trajectory of the root. We devel- oped the RC-FK layer to remove discontinuities in the angle representation, which can harm learning [154]. For the generated local motion, we use a discriminator to check the realism of each sub-sequence. We also apply an L 2 reconstruction loss at keyframes for both, the local poses as well as the global root coordinates. 85 Both networks are fully convolutional and thus, able to process arbitrary length of motion sequences. In the following sections, we start by explaining the two motion representations we use in our method and the RC-FK layers designed for simultaneously restricting therangeofmotionaccordingtohumanbodyflexibilityandensuringthecontinuity in the representation (Section 5.2.3). Then we explain the architecture for the global path predictor (Section 5.2.4). Next, we explain how we design the input format, the local motion generator, the discriminator and the Motion DNA with the corresponding loss functions (Section 5.2.5). Finally, we introduce an optional post-processing method to exactly enforce keyframe constraints (Section 5.2.6) and discuss remaining training details (Section 5.2.7). 5.2.3 Range-Constrained Motion Representation As a reminder, throughout our network, following Zhou et al. [154], we would like to use continuous rotation representations in order to make the learning pro- cess converge better. A skeleton’s local pose at a single frame can either be defined by the local joint translations T joints ={T i } M i=1 or hierarchically by the relative ro- tation of each joint from the parent joint R joints ={R i } M i=1 . By denoting the initial pose as T joints 0 (usually a T pose) and the parent index vectorϕ child_id = parent_id, we can map the rotation representation to the translation representation by a dif- ferentiable Forward Kinematics function FK(R joints , T joints 0 ,ϕ) = T joints [131]. Many public motion datasets do not capture the joint rotations directly, but instead use markers mounted on human performers to capture the joint coordi- nates, which are then transformed to rotations. Due to insufficient number of markers, the imprecision of the capture system, and data noise, it is not always guaranteed that the calculated rotations are continuous in time or within the range 86 of joint flexibility. In contrast, joint positional representation is guaranteed to be continuous and thus are suitable for deep neural network training [154]. There- fore, we use the joint translations as the input to our generator and discriminator. However, for the motion synthesized by the generator network, we prefer to use a rotation representation, since we want to guarantee the invariance of the bone length and restrict the joint rotation ranges according to the flexibility of the hu- man body. These kinds of hard constraints would be more difficult to enforce for a joint positional representation. We now explain the rotation representation we use for the local joints. We define a local Euler angle coordinate system for each joint with the origin at the joint, x and y axis perpendicular to the bone and the z axis aligned with the child bone (and if a joint has multiple child bones, then the mean of the child bones). Due to joint restrictions on range of motion, the rotation angles around each axis should fall in a range, [α,β], where α is the minimal angle and β is the maximal angle. We have the neural network output the Euler angles v = (v x ,v y ,v z ) T ∈R 3 for each joint except for the root joint and then map them to the feasible range by: u = u x u y u z = (α x −β x ) tanh(v x ) + (α x −β x )/2 (α y −β y ) tanh(v y ) + (α y −β y )/2 (α z −β z ) tanh(v z ) + (α z −β z )/2 (5.3) The local rotation matrix R l l in the local coordinate system can be computed by multiplying the rotation matrices derived from the rotation angles along each of the three axes. The local rotation matrixR w l in the world coordinate system (also centered at the joint) can be computed as R w l = M lw R l l M wl , where M lw is the transformation matrix from the local coordinate system to the world coordinate system and M wl from world to local. 87 Although it is convenient to enforce the rotation range constraint, the Euler angle representation can have a discontinuity problem, which can make it hard for neural networks to learn [154]. The discontinuity happens when the second rotation angle reaches (−90 ◦ or 90 ◦ ), which is known as the Gimbal lock problem. Fortunately, each body joint except the hip joint has at least one rotation axis along which the feasible rotation range is smaller than (−90 ◦ , 90 ◦ ). Thus, we can avoid Gimbal lock by choosing the order of rotations accordingly. Therefore, we compute the rotation matrix of the left and the right forearms in theY 1 X 2 Z 3 order and the other joints in the X 1 Z 2 Y 3 order as defined by Wikipedia for the intrinsic rotation representation of Tait-Bryan Euler angles [138]. For the hip, since there is no restriction in its rotation range, to avoid discontinuities, we use the continuous 6D rotation representation defined in [154]. The detailed information regarding the range and order of the rotations at each joint can be found in the appendix. 5.2.4 Global Path Predictor We infer the global translations of the root using a 1D fully convolutional network which takes the synthesized local joint translations as input and outputs thevelocitysequenceoftherootjoint dT root t = T root t+1 −T root t . Thenetworkistrained with the ground truth motions in CMU Motion Capture dataset [30]. Given the initial root position T root t0 , the root position for frame t can be computed as: T root t0−>t = T root t0 + t X i=1 dT root t (5.4) 88 MLP Encoder Decoder I 1 I2 I3 I n ... T 1 T 2 T 3 T N ... v 1 v 2 v 3 v N ... R1 R2 R3 R N ... z 1 ... z n/64 w ... w z w ... ... average pooling f2 RC-FK f1 Discriminator y 1 ... y n/64 Mx3+Mx3 6+(M-1)x3 Mx9 (M-1)x3 1 (M-1)x3 1024 1024+1024 receptive field: 190 frames n̂ 1 I t =(DM t , DS' t ) Generator Motion DNA Figure 5.2: Network Architecture for Local Motion Generation. As explained in Section 5.2.3, V t contains a 6-D vector that represents the rotation of the root and (M-1) 3-D vectors that represent the rotation of the other joints. R t contain the 3x3 rotation matrices for all the joints. T t and ˆ T t contain the local translation vectors for all the joints except for the root joint. w is concatenated with each z t to be fed into the decoder. To prevent drifting due to error accumulation in the predicted velocity, we penalize root displacement error in each n-frame interval, n = 1, 2, 4,..., 128. L = 1 8 n=7 X n=0 1 N− 2 n N−n X t=1 (T root t−>t+n − T 0 root t−>t+n ) (5.5) where T root t−>t+n is the root displacement from frame t to frame t +n calculated using the predicted dT t+1,...,t+n and T 0 root t−>t+n is the ground truth. 5.2.5 Local Motion Generation The local motion generation module takes as input a sequence of vectors that capture the locations and values of the keyframes and outputs the local motion for the whole sequence. As shown in Figure 5.2, it is composed of five parts: (1) an encoder to encode the input sequence into a sequence of 1-D latent vectors z t ∈ R 1024 , (2)aMotionDNAencodertoencodeasetofrandomlypickedrepresentative frames ˆ T into a 1-D latent vector w ∈ R 1024 , (3) a decoder to synthesize the 89 local motion of the entire sequence in the rotation representation, (4) a Range- Constrained Forward Kinematic (RC-FK) module to transform the local motion from the rotation representation back to the positional representation S, and (5) a discriminator to encourage realism of the generated sub-sequences. In the following subsections, we first explain our dense keyframe input format, then we discuss the architecture of our generator and discriminator, and the losses we use at training time. We then explain how our Motion DNA works. Dense Keyframe Input Format. The local motion generator receives sparse input keyframes that are specified by the user. We could format these keyframes in either a sparse or dense format: we will explain the sparse format first, because it is easier to explain, but in our experiments, we prefer the dense format, because it accelerates convergence. In the case of the sparse input format, the inputs are: (1) a mask of dimension N× 3M that indicates at each frame and each joint, whether the joint’s coordinate values have been specified by the user, and (2) a joint position tensor S 0 of dimension N× 3M that contains user-specified world- space joint positions at keyframes φ and zero for the rest of the frames. These are shown in the second and third rows in Figure 5.3. This sparse input format can encode the pose and timing information of the users’ keyframes and is flexible for cases of partial pose input, e.g. the user can give the root translation without specifying the local poses. However, since the keyframes are sparse, the input sequence usually contains mostly zeros, which results in useless computation in the encoder. Thus, we prefer the dense input format, which is shown in the fourth and fifth rows of Figure 5.3. In the dense input format, each frame in S 0 gets a copy of the pose from its closest keyframe. The mask is also replaced with a distance transform, so the values at the keyframes are zero, and the values at 90 Figure 5.3: Input Format for Local Motion Generator. The first row contains the frame indices. The second and third rows show the sparse input format, and the fourth and fifth rows show the dense input format. 1s and 0s in the second row indicate the presence and absence of keyframes. S 0 φ ,φ = 3, 64, 67 are the poses at user-specified keyframes. the other frames reflect the absolute difference in index between that frame and a closest keyframe. The dense input format increases the effective receptive field of the generator, since with keyframes duplicated at multiple input locations, each neuron in the bottleneck layer can see a larger number of keyframes. Empirically, this representation also leads to a faster network convergence. Generator and Discriminator Architecture. The encoder contains six con- volutionallayersthatreducetheinputdimensionby 2 6 =64times, andthedecoder contains eight convolutional layers and six transpose convolutional layers to up- sample the latent representation to the original input dimension. The size of the receptive field at the bottleneck layer is 318 (5.3 seconds) frames. With the dense input format introduced in Section 5.2.5, the effective receptive field becomes 636 frames (10.6 seconds): this is theoretically the longest time interval between any neighbouring keyframes, because otherwise the synthesized transition at a partic- ular frame will not be dependent on the two adjacent keyframes. The discriminator takes as input the entire synthesized sequence, passes it to seven convolutonal layers (6 layers perform downsampling and 1 layer maintains the input dimension) and outputs N/64 scalars y t . The discriminator looks at all 91 the overlapping sub-sequences (N/64 sub-sequences of length 64 frames each) and predicts whether each of them looks natural (one output scalar per sub-sequence). The receptive field of the discriminator is 190 frames: this is the length of the sub-sequence that each y t is responsible for. Duringexperiments, wefoundthatgeneratorsusingvanillaconvolutionallayers cannotpreciselyreconstructtheinputkeyframesandtendtotohavemodecollapse, by synthesizing similar motions for different inputs. We hypothesize that this is because the network is too deep and the bottleneck is too narrow. To this end, we introduce a modified residual convolution layer and a batch regularization scheme: x 0 = conv(x) residual(x) = PReLU(affine(x 0 )) + √ 1−σx y = √ σresidual(x) + √ 1−σx (5.6) Here x is the input tensor and y is the output. PReLU() is an activation function called Parametric Rectified Linear Unit [53]. affine() is a learned affine transformation, which multiplies each feature elementwise by a learned scale-factor and adds to each feature elementwise a learned per-feature bias. conv() stands for the convolutional filter with learnable kernels. The residual ratioσ is a scalar from 0 to 1. In our network design as shown in the appendix, we set σ in a decreasing manner. The intuition here is that the layers closer to the output mainly add low-level details, and therefore we want to pass more information directly from the previous layer, especially near keyframes. Losses for GAN and Regularization. We applied the Least Square GAN Losses [94] on our generator and discriminator. We use 1 as the real label and -1 92 as the fake label and therefore y real t should be close to 1 and y fake t should be close to -1. L D = 1/(N/64)( N/64 X t=1 (y real t − 1) 2 + N/64 X t=1 (y fake t + 1) 2 ) (5.7) For the generator, we want the corresponding y t to be close to neither -1 nor 1. According to the Least Square GAN paper [94], one can use any value in [0, 1), and based on experiments, we chose 0.2361, so we have: L G adv = 1/(N/64) N/64 X t=1 (y fake t − 0.2361) 2 (5.8) Before we feed synthesized local motion sequences S into the discriminator, we replace the pose at the keyframe locations with the input pose, because we found that this encourages the network to generate smooth movement around the keyframes. Forourproblem,wefoundthatstandardbatchnormalization[66]doesnotwork as well as a custom batch regularization loss. This appears to be because when using batch normalization, different mini-batches can be normalized differently due to different mean and standard deviations, and there may also be some difference betweenthesevaluesbetweentrainingandtesttimes. Weapplythefollowingbatch regularization loss to each modified residual convolution layer in the generator and discriminator: L br = 1/D X x∈{X 0 } (||E(x 0 )|| 2 2 +|| log(σ(x 0 ))|| 2 2 ) (5.9) Here{X 0 } is the set of outputs of the convolution layers in the residual block, D is the total dimension of{X 0 }, E and σ refer to the sample mean and standard deviation of the features. We also found empirically that this loss helps encourage the diversity of the output motions and avoid mode collapse. 93 Figure 5.4: Root translation error calculation. The blue path is the synthesized root path, and blue dots are the synthetic root positions at the keyframes. Red crosses indicate the input root positions at the keyframes. LossesforKeyframeAlignment. Tofurtherenforcethekeyframeconstraints, we use L2 reconstruction loss on the position vectors of all non-root joints at keyframes. However, for the root (hip) node, applying such a naive L2 distance calculation is problematic. We show one example of why this is the case in Figure 5.4. In that example, the output root trajectory matches well the input in terms of direction, but a large error is reported because the synthesized and target point clouds are not well-aligned. To fix this, we recenter the output path by placing the mean of the output root trajectory T c at the mean of the input root trajectory T 0c as shown in the right part of Figure 5.4. The updated loss function becomes: L root = 1 N 0 X t∈{φ} ||(T root t − T root t0 )− (T 0 root t − T 0 root t0 )− (T c − T 0c )|| 2 2 (5.10) Motion DNA To encourage the network to hallucinate different motion sequences following the same keyframe constraints, we introduce a novel conditioning method that we 94 call Motion DNA. By changing the Motion DNA, users can provide some hints about the type of motion to be generated. We were motivated to create such a scheme because we noticed that motion within the same class can sometimes exhibit dramatically different styles and it is impossible to classify all types of motions by limited labels. For example, in the CMU Motion Capture dataset, a class called “animal behaviors" contains human imitations of different animals such as elephants and monkeys, but their motions don’t share any common features. Instead of using labels, we found that a small set of representative frames of a motion sequence contain discriminative and consistent features that can well rep- resent the characteristics of a particular type of motion. For example, the yellow skeletons in the fifth line in Figure 5.9 are the representative frames extracted from a motion sequence, and it is not hard to tell that they are extracted from a martial arts sequence. We identify representative frames as those that have the maximum joint rotation changes from their adjacent frames. The extracted rep- resentative frames can be injected into the generator to control the characteristics of the generated motion. The advantage of using the representative frames as the controlling seeds is that it can be applied to general types of motions and the user can explicitly define their own combination of the representative frames. To encode the set of representative frames into a Motion DNA, we use a shared multilayer perceptron network to process each representative frame ˆ T j ,j = 0, 1,..., ˆ N and then apply average-pooling and an additional linear layer to obtain the final feature vectorw. We concatenatew at the end of eachz t and use that as input to the decoder. To encourage the local motion generator to adopt the Motion DNA to generate variations without conflicting with the keyframe constraints, we design two losses 95 to encourage the generator to place representative poses from the Motion DNA into the synthesized motion at suitable locations. The first loss L G DNA 1 encourages all of the representative poses to show up in the output sequence. For each representative frame ˆ T j , we find the most similar frame in the synthesized motion S and compute the error e j between the pair. L G DNA 1 = 1/ ˆ N X j=0,.., ˆ N e j e j = min S i ∈S ||Λ(S i )− ˆ T j || 2 2 (5.11) Λ(S i ) is the local pose of S i without root rotation. Likewise, the representative poses ˆ T j also does not contain root rotations. In this way, we only care about the relative posture and not direction. BecauseL G DNA 1 does not consider how the representative frames are distributed among the generated frames, it is possible that the representative poses all placed in a small time interval, which is not desirable. Therefore, we use a second loss L G DNA 2 to encourage the representative poses to be distributed evenly throughout theentiresynthesizedsequence. Specifically, wecutthewholesynthesizedsequence at the keyframe points, and divide it into K intervals, u k ,k = 0, 1,...,K− 1, with each interval having the length of N k , so S = [S u 0 ,S u 1 ,...,S u K−1 ]. If N k is shorter than 3 seconds, we consider it is hard to place representative poses in it, otherwise we want the synthesized sub-sequence to contain frames that closely resemble ˆ N k =bN k /3c input representative frames. In practice, for each frame S k,i in u k , we look for one of the representative poses in ˆ T that is closest to it, and compute their error as e 0 k,i . We consider the ˆ N k frames with the lowest errors 96 to be the frames where the representative poses should show up, and collect their errors. L G DNA 2 is defined as the average of those errors: L G DNA 2 = 1 P ˆ N k X e 0 k,i ∈E 0 k e 0 k,i e 0 k,i = min ˆ T j ∈ ˆ T ||Λ(S k,i )− ˆ T j || 2 2 (5.12) ,e 0 k,i is the minimum error between thei-th local pose in the intervalu k and all the representative frames. E 0 k is the set of the lowest ˆ N k errors in the interval u k . 5.2.6 Post-processing Though we use L2 reconstruction loss to enforce the keyframe constraints, the output sequence might not exactly match the input at keyframe locations. When needed, simple post processing can be applied to achieve exact keyframe matching [52]. With small keyframe mismatches in our synthesized motion, the adjustments are nearly undetectable. 5.2.7 Dataset and Training We construct our training dataset based on the CMU Motion Capture Dataset [30]. We only use motion clips in which the entire motion is performed on a flat plane. Weautomaticallyfilteredoutnoisyframes(seeappendix). Thefinaldataset contains 79 classes of motion clips of length ranging from 4 seconds to 96 seconds. We use 1220 clips for training and 242 clips for testing. For training, we first train the Global Path Predictor model (Section 5.2.4), fix its weights and then train the Local Motion Generator model (Section 5.2.5). 97 For each generator update, we update the discriminator once. We feed fixed- length 2048-frame sequences to the generator with a batch size of 4. We train the discriminator using 16 real sequences of length ranging from 256 to 1024 frames in each batch. We use the RMSprop optimizer with the default parameters from Pytorch [105]. Other training parameters can be found in the Appendix. Keyframe Sampling Scheme. As explained in Section 5.2.5, to synthesize natural transitions, the longest allowable time interval between keyframes is 636 frames. Therefore, we first randomly pick the length of the intervals to be between 0 and 600 frames (10 seconds). Based on the sampled interval lengths, we sample the keyframes from the beginning of the motion sequence to the end, based on two rules: (1) When the current interval is shorter than three seconds, we sample the two keyframes from the same clip so that their temporal order is maintained and the number of frames between their locations in the ground truth clip equals the current interval length. This is because it is hard for the network to improvise smooth transitions in a short time interval if sampled keyframes have dramatically different pose due to them coming from different motion clips or being far apart in the same clip. (2) When the current interval exceeds three seconds, we want to simulate the scenario of transitioning between different motion clips of the same or different classes. In such circumstance, with a certain probability, we will sample the next keyframe from a different motion clip other than the clip where the last keyframe comes from, and place it at a random position. The probability and the root distance between the two keyframes are proportional to the current interval length. The first row of Figure 5.6 gives a keyframe placement example. The sampling scheme helps the network adapt to diverse input keyframe conditions at test time. 98 Representative Frames for Motion DNA Sampling. For training the Motion DNA encoder, the number of representative frames we use as input to the network is proportional to the length of the interval between neighboring keyframes. Specifically, we use 1 representative frame for every 3-second-long interval. 5.3 Experiments and Evaluation We implemented our system using PyTorch [105] and conducted various exper- iments to evaluate the different components of our system. We first evaluated the effectiveness of our Range-Constrained motion representation. We then measured the accuracy of the global path predictor by comparing its results to the ground truth. We also evaluated the local motion generator in two ways: (1) quantita- tively, we measured how well the generated motion follows the input keyframe constraints, and (2) we conducted user studies to evaluate the quality of the syn- thesized motion given different keyframe placement strategies: we compare our approach with a prior work [52]. Finally, we examined the effectiveness of using Motion DNA to generate motion variations. 5.3.1 Effect of Range-Constrained Motion Representation WeverifytheeffectofourRange-Constrainedmotionrepresentationbytraining the same Inverse Kinematic networks as in [154]. We show in Table 5.1 that our representation and RC-FK layer outperform the other rotation representations. 99 Mean(cm) Max(cm) Std(cm) Quaternion 9.03 179.66 16.33 Vanilla Euler Angles 2.7 48.7 2.1 Range-Constrained Euler + 6D root (Ours) 2.3 34.9 1.5 Table 5.1: IK Reconstruction errors using different rotation representations. The quaternionandvanillaEuleranglerepresentationsareusedforallnodes. Oursuses the range-constrained Euler representation for non-root nodes and the continuous 6D representation [154] for the root node. V 1 V 2 V 4 V 8 V 16 V 32 V 64 V 128 Y 0.7 0.8 1.1 1.5 1.9 2.8 3.9 5.5 1.8 Table 5.2: Global path prediction mean errors in centimeters. V n is the mean error of the root (hip) translation differences in the x-z plane for poses predicted at n frames in the future. Y is the mean error of the root (hip) along the y axis. 5.3.2 Accuracy of the Global Path Predictor We trained the Global Path Predictor model until it reached the precision as shown in Table 5.2. The path prediction error is 0.7 cm per frame, and only 5.5 cm after 2 seconds (128 frames) of motion. This shows that our global path prediction is quite accurate for motion prediction over a few seconds. 5.3.3 Keyframe Alignment We next evaluate how well our network is able to match the keyframe con- straints. We first pretrain and fix the Global Path Predictor Network. We then use this to train the in-betweening network for 560,000 iterations and evaluate how well the synthesized motion aligns to the input keyframes. The input keyframes are given as the local joint coordinates plus the global root positions at some ran- dom frames. For testing, we generate 200 sets of 2048-frame-long motion sequences using random keyframes and DNA poses with the sampling scheme described in Section 5.2.7. We calculate the mean L2 distance error of the global root joints 100 Figure 5.5: Mean L2 errors (cm) of local joints (left) and root joint (right) at keyframes throughout the training process. Blue lines refer to the result of using the sparse input format. Red lines refer to the dense input format. and the local joints at the keyframes. As shown in the last two rows of Table 5.4, the alignment error is sufficiently small that it can be easily removed by the post-processing in a way that is hardly perceptible. We show an example rendering of the synthesized frames overlaid with the input keyframes in Figure 5.9. The faint pink skeletons are the input keyframes, while the blue skeletons are synthesized by our method. It can be seen that the synthesized frames smoothly transition between the keyframes, which are specified at frame 4 and 11, and faithfully match the keyframes. More results for longer term inbetweening tasks can be found in the supplementary video. Dense vs. Sparse Input. In Section 5.2.5, we mentioned that using a dense input format can accelerate the training. We show this effect in Figure 5.5 by evaluating the root and local joint alignment errors at the input keyframes throughout the training iterations. For each test, we generate 2048-frame-long motion sequences from 50 sets of randomly sampled keyframes and Motion DNAs, and we calculate the mean L2 errors between the input and output poses. The results show that the dense input format not only converges to lower errors but also converges much faster than the sparse input format. 101 Sequence Length (frames) 512 (8 s) 1024 (17 s) 2048 (34 s) 4096 (68 s) Local Motion Generation (s) 0.019 0.023 0.021 0.022 Global Path Prediction (s) 0.033 0.061 0.131 0.237 Post-processing (s) 0.008 0.017 0.036 0.070 Total (s) 0.059 0.101 0.188 0.329 Table 5.3: Mean Computation Time for Generating Different Lengths of Motions. s refers to second. 5.3.4 Runtime Performance We conduct all our experiments on a machine with an IntelR Core T M i9-7900X CPUwith3.30GHz, 94GBmemory, andaTitanXpGPU.Ourcodeisimplemented on an Ubuntu 16.04 and uses Python 3.6 and Torch 0.4. We tested the compu- tation time for generating motion sequences at different lengths. Table 5.3 gives the average computation time for the network inference (forward pass) and post- processing (to enforce keyframes matching and smooth output motion), and shows that our method can achieve real-time performance for generating 15 seconds of motion. For longer motions, the system is still fast, using only within 0.4 second for generating 120 seconds of motion. 5.3.5 Motion Quality We evaluated the quality of the synthesized motions from our model with a user study. We used the trained model to generate 50 34-second-long (2048 frame) mo- tion sequences using the keyframe and DNA sampling scheme described in Section 5.2.7. Next, we randomly and uniformly picked 100 4-second-long subsequences from these sequences. We ran an Amazon Mechanical Turk [17] study. We showed human workers pairs of 4-second real motions and 4-second synthetic motions, with randomization for which motion is on the left vs right. We asked the user 102 to choose which motion looks more realistic, regardless of the content and com- plexity of the motion. The real motions were randomly picked from our filtered CMU Motion Capture Dataset. We distributed all of the 100 pairs in 10 Human Intelligence Tasks (HITs) 1 . Each worker is allowed to do at most four HITs. Each HIT contains 10 testing pairs and 3 validation pairs. The validation pair contains a real motion sequence and a motion sequence with artificially large noise, so as to test if the worker is paying attention to the task. We only collected data from finished HITs that have correct answers for all three validation pairs. Ideally, if the synthetic motions have the same quality as the real motions, the workers will not be able to tell the real ones from fake ones, and the rate for picking up the synthetic ones out of all motion sequences should be close to 50%. AslistedinTable5.4underthetrainandtestsettings“All-Random,"theresults showthataround40%ofoursyntheticmotionsarepreferredbythehumanworkers, which indicates high quality. More results can be found in the supplemental video. 5.3.6 Extension and Comparison with RTN TherelatedworkRecurrentTransitionNetworks(RTN)[52]synthesizesmotion transitions in a different setting than ours. Based on a fixed number of past frames and two future frames, RTN predicts a short sequence of future motion of a fixed length. Our network on the other hand supports arbitrary keyframe arrangement. In this section, we are going to examine how RTN works for short-duration motions fromasingleclass, andhowitworksforlong-durationmotionsmixedfrommultiple classes. We will also show the performance of our network trained under our keyframe arrangement setting, but tested under the RTN setting. 1 A Human Intelligence Task (HIT) is a task sent out and completed by one worker. 103 As illustrated in the first row of Figure 5.6, our network is designed to generate various outputs with arbitrary length and arbitrary numbers of keyframes. It is trained and tested with sparsely sampled keyframes from 79 classes of motions that contain transitions within or across classes of poses: this scenario is labelled as “Ours-All-Random" in the experiment (Table 5.4 and Figure 5.6). Comparatively, RTN, in the original paper, is trained to predict 1 second of walking motion. In our experiment, we also trained and tested RTN’s performance for predicting 1 second of walking motions, which we illustrate in Figure 5.6 as "RTN-Walk-1s." Moreover, we tested its performance for predicting longer (4 seconds) motions trained with all 79 classes: this is labelled as "RTN-All-4s". Even though our model is always trained on all 79 classes of the CMU dataset (per Section 5.2.7), we also test our model’s generalization abilities by testing in the same scenarios as RTN. The "Ours-Walk-1s" test, as a counterpart to "RTN- Walk-1s", uses one-second motion transitions with keyframes and DNA poses all sampled from the walking class only. Likewise, the "Ours-All-4s" test, as a coun- terpart to "RTN-All-4s", uses four-second motion transitions with keyframes and DNA poses sampled from all classes. To evaluate the quality of the synthetic motions, we generated 100 1-second sequences from "RTN-Walk-1s" and "Ours-Walk-1s," and 100 4-second sequences from "RTN-All-4s" and "Ours-All-4s" and performed the same user study on these sequences as is described in Section 5.3.5. We also evaluated the keyframe align- ment errors for each method based on 200 testing sequences for each method. As shown in Table 5.4, RTN performs well in the "RTN-Walk-1s" case. However, for the 4 second case, RTN fails to generate reasonable results, with user preference rate dropping to 2.8% and Joint Error and Root Error increasing to 30.3 cm and 155.4 cm, respectively. On the other hand, our network, although not trained 104 User Study Ours RTN Train Setting All-Random Walk-1s All-4s Test Setting All-Random Walk-1s All-4s Walk-1s All-4s Real 193 216 217 156 314 Synthetic 124 94 163 154 9 User Preference 39.1% 30.3% 42.9% 49% 2.8% Margin of Error ±5.4% ±5.1% ±5.0% ±5.6% ±1.8% Input frame Alignment Error Joint Error (cm) 3.5 4.4 5.6 2.3 30.3 Root Error (cm) 10.0 5.1 4.5 10.7 155.4 Table 5.4: Evaluation results for motion quality and keyframe alignment. The first eight rows are the user study results collected from 101 human workers. The rows “real" and “synthetic" are the number of workers who chose the real or synthetic motions, respectively. “User preference" is the percentage of synthetic motions chosen out of total pairs, and the margins of error are listed on the next row with confidence level at 95%. The last two rows are the mean Euclidean error of the global root positions and the local joint positions of the input keyframes. for the RTN tasks, still has reasonable user preference rate with low keyframe alignment error for both tasks. This indicates that our method not only performs well in the original training scenario but also generalizes well to new and unseen test scenarios. The user preference rates indicate that "Ours-All-4s" has higher motion quality than "Ours-All-1s." We believe this is because during training, the interval length is between 0 to 10 seconds, so the 4-second interval is closer to the expectation of the interval length during training. In Figure 5.8, we visualize qualitatively results of each method. In the first two lines, bothRTN-Walk-1smodelandourmodelisgiventhesamepast40framesand the future 2 frames, and they all successfully generate plausible walking transitions for the one second interval. In the last two lines, given the same input, our model generates a realistic transition while the RTN-All-4s model generates a diverging motion. 105 Ours-All-Random R TN-W alk-1s R TN-All-4s One Given Key Frame Blank Frames (Interval) Ours-W alk-1s Ours-All-4s 40 Given Key Frames Figure 5.6: Training and testing setting for different approaches. 5.3.7 Variation Control with Motion DNA To quantitatively evaluate the effect of Motion DNAs, we computed the two DNA errors as defined in Equation 5.11 and 5.12 of the 200 synthetic motion sequences described in Section 5.3.3. The first DNA error is 8.4 cm, which means all of the representative poses in the motion DNA can be found somewhere in the synthesized motion sequence. The second DNA error is 6.5 cm, which means these representative poses are distributed evenly in the synthesis results. We also visualize the effect of Motion DNA in Figure 5.9. In the top two rows, weusemotionDNAextractedfromsalsaposes(toprow)andwalkingposes(second from the top) as input to the network. Under the same keyframe constraints (transparent pink skeletons), the network generates salsa (raised arms in the top row) and walking styles respectively. In the middle two rows, we use two different martial arts motion DNAs, and both results resemble martial arts but exhibit some differences in the gestures of the arms. A similar observation applies to the bottom two rows, where Indian Dance DNAs are used as the seed of motion. Figure 1.2 106 Figure 5.7: Examples of motion variety. Within each of the four sub-figures, we visualize four poses (blue) at the same frame from a motion sequence synthesized from the same set of keyframes but with four different Motion DNAs. The trans- parent pink skeleton is the target keyframe pose in the near future. 107 040 101 151 201 251 281 040 101 151 201 251 281 R TN-All-4s Ours-All-4s 040 51 61 65 71 040 51 61 65 71 R TN-W alk-1s Ours-W alk-1s Figure 5.8: Examples of generating one second and four second transitions given the 40 past frames and 2 future frames. The first two rows show selected frames of a one-second transition sampled at 30 fps. The last two rows show frames of a four- second transition sampled at 60 fps. Pink skeletons visualize the input keyframes. From left to right, the blue skeletons show how the synthesized motion transitions between two keyframes. Numbers at the top-right corners are the frame indices. Corresponding results can be found in the supplementary video. shows another example. Under the same keyframe constraints, the synthesized two motions have different trajectories and local motions, but they both travel to the same spots at the keyframes. Figure 5.7 gives more variation examples. We show the same frame of a synthesized sequence which is generated from the same keyframes but with four different Motion DNAs. We observe that variations happen more frequently in the upper body than the lower body. Since the foot motion has more impact on the global movement, it is harder to find foot patterns that exactly satisfy the keyframe constraints. 108 Salsa DNA Martial Arts DNA 1 Walking DNA Martial Arts DNA 2 Indian Dance DNA 1 Indian Dance DNA 2 1384 1399 1495 1552 1610 1692 1808 1824 1857 1886 1384 1399 1495 1552 1610 1692 1808 1824 1857 1886 0592 0700 0800 0923 0982 1030 1139 1163 0592 0700 0800 0923 0982 1030 1139 1163 0333 0382 0395 0454 0512 0647 786 902 0333 0382 0395 0454 0512 0647 786 902 Figure 5.9: Examples of inbetweening. The pink skeletons visualize the user- specifiedkeyframes. Fromlefttoright, theblueskeletonsshowhowthesynthesized motion transitions between two keyframes. The semi-transparent pink skeletons are the keyframes in the near future within 400 frames. Each group of two rows showsframesfromtwogenerated2048-frame-longsyntheticmotionsequencesgiven the same input keyframes but different Motion DNA. The yellow skeletons are the representative poses for the Motion DNA. 109 Figure 5.10: Motion Generation given sparse root coordinates. Two synthesized motion sequences given the same keyframes (position of the root joints as pink dots ontheground)anddifferentrepresentativeframes. Toprow: the 100th, 500th, and 700th frame from a synthesized sequence. Bottom row: the corresponding frames from another sequence synthesized with a different set of representative frames. 1 10 155 200 253 277 418 571 860 879 Figure 5.11: Motion Generation given only 2D keyframe poses. Left: 2D keyframe inputs given on the x-y plane. Right: the synthesized 3D motion sequence. The pink skeletons visualize the user-specified keyframes. From left to right, the blue skeletons show how the synthesized motion transitions between two keyframes. The semi-transparent pink skeletons are the keyframes in the near future within 400 frames. 5.4 Additional Features In addition to the features discussed above, our method can also be used when only part of the body pose is specified at keyframes. We can change the format of the input (Section 5.2.5) to the network, so that coordinates are only specified at the desired joints and the rest of the joints are masked out in the input matrix during both, training and testing. 5.4.1 Partial Body Control: Root Figure 5.10 shows the case where only the root joint positions are specified by the user at sparse locations (pink dots on the ground). The two rows show two 110 synthesized sequences generated with the same root joint positions but different representative frames. From left to right, the three frames are the 100th, 500th, and 700th frames. Note that the two generated sequences exhibit different root trajectories and local motions. 5.4.2 Partial Body Control: 2D Joints Sometimes, users might want to create 3D skeleton poses by sketching the keyframes in 2D. For this task, we trained a network by providing only the x and y coordinates of the joints and the root. In Figure 5.11, we show that our network is able to lift the 2D keyframes into 3D and synthesize natural 3D motions. The recent work VideoPose3D [107] also predicts 3D motion from the 2D motion, but it cannot perform the long-term motion inbetweening task. 5.4.3 Limitations Although our network can interpolate keyframe poses over a much longer time interval than traditional methods, to guarantee the naturalness of the synthe- sized transition, the maximum allowable time interval between any neighbouring keyframes is still limited by the size of the receptive field of the generator. A possible solution to support even longer time intervals is to do hierarchical synthe- sis which synthesizes some critical frames first and then based on the synthesized critical frames generates motions in shorter intervals. Our Motion DNA scheme provides a weak control on the style of the output motion. How much it impacts the output style depends on the keyframe con- straints. For example, as shown in the supplemental video, when the keyframes and the Motion DNA are both poses from martial arts, the output sequence usu- ally capture the characteristics of martial arts in both arms and legs. However, 111 when the keyframes are walking poses, the result will look less like martial arts, where the output has punching and defending gestures in the upper body but with walking pose in the lower body. Another limitation is that because it uses static representative frames, the Motion DNA contains only the iconic postures of the motion, but might fail to capture the style information reflected in the subtle body dynamics. 5.5 Discussion We propose a novel deep generative modeling approach, that can synthesize extended and complex natural motions between highly sparse keyframes of 3D human body poses. Our key insight is that conditional GANs with large recep- tive field and appropriate input representation are suitable for enforcing keyframe constraints in a holistic way. However, although GANs can generate realistic lo- cal motions, they tend to diverge during training for generating global motions. From extensive experiments, we discovered that global motions can be predicted from local motions using neural networks. Thus, we have introduced a two-stage generative adversarial network that first synthesizes local joint rotations and then predicts a global path for the character to follow. We also propose a novel range-constrained forward kinematics layer that helps the network explore a continuous rotation representation space with biomechanical constraints and thus, can produce natural-looking motions. To synthesize different results with the same keyframes, we introduced a new authoring concept, called Motion DNA, which allows users to influence the style of the synthesized output motion. 112 We compare our method with a closely related state-of-the-art work, RTN [52] and show superior results for our task. Not only does our method support addi- tional capabilities, our trained deep model also generalizes to RTN’s task. Com- pared to previous work, our method handles more complex motions and provides users the flexibility of choosing their preferred sequence among a variety of syn- thesized results. 113 Chapter 6 Conclusion 6.1 Summary In this dissertation, we address the problem of designing the deep representa- tions and neural networks architectures for complex 3D data, covering cases for 3D shapes, structures and motion. We begin with a novel convolutional neural network model for 3D meshes. Our method is a generalization of 2D CNN architectures onto arbitrary graph structures, and has natural analogs of the operations in 2D CNN which allows people to transfer advanced 2D deep learning mechanisms like residual modules to 3D or even higher dimensional mesh data. We validate the performance our method by training autoencoders on surface meshes, volumetric meshes, and non- manifold meshes. Next, we dive into human hair, a more special subject for 3D digitization. For applying deep learning on hair, We propose to embed each hair strand as 3D curves into a low-dimensional space. A global hair style feature vector is based on the embedding of the hair strands which are parameterized by a 2D convolution network on the scalp. This novel architecture enables us to interpolate naturally between hairstyles and infer high-resolution hair models from single-view portrait images. The proposed method runs 1000 times faster than prior arts on sing-view 3D hair reconstruction. 114 After exploring the topics of 3D modeling, we move on to transformation rep- resentations and motion synthesis. For transformations like rotations, our key observation is that all representations for 3D rotations are discontinuous in Eu- clidean spaces with four or fewer dimensions. Since neural networks are usually defined in the Euclidean space, widely used representations such as quaternions and Euler angles are discontinuous in nature and difficult for neural networks to learn. We demonstrated that 3D rotations have continuous representations in 5D and 6D, which are more suitable for learning. By simply replacing the discontin- uous rotation representations by the continuous alternatives, the sporadic large errors in previous works can be dramatically eliminated. Same observations also applies to pose estimation tasks for other 3D structures, e.g. pointclouds. For human poses, we introduce a novel range-constrained forward kinematics layer that helps the network explore a continuous rotation representation space with biomechanical constraints and thus, can produce natural-looking motions. To synthesize different results with the same keyframes, we introduce a new au- thoring concept, called Motion DNA, which allows users to influence the style of the synthesized output motion. Combined with the RC-FK layer, we propose a novel deep generative model- ing approach, that can synthesize extended and complex natural motions between highly sparse keyframes of 3D human body poses. Our key insight is that con- ditional GANs with large receptive field and appropriate input representation are suitable for enforcing keyframe constraints in a holistic way. However, although GANs can generate realistic local motions, they tend to diverge during training for generating global motions. From extensive experiments, we discovered that global motions can be predicted from local motions using neural networks. Thus, we have introduced a two-stage generative adversarial network that first synthesizes local 115 joint rotations and then predicts a global path for the character to follow. We compared our method with a closely related state-of-the-art work, RTN [52] and showed superior results for our task. Not only does our method support additional capabilities, our trained deep model also generalizes to RTN’s task. Compared to previous work, our method handles more complex motions and provides users the flexibility of choosing their preferred sequence among a variety of synthesized results. 6.2 Limitations and Future Work In this section, we discuss the limitations and future directions of our works in this thesis. For mesh convolution, our method can only work on mesh data that have the same topology. While from our experiments, state-of-the-art mesh convolution methods that can work for different topologies still can’t achieve compatible per- formance. Several future directions are possible with our formulation. For one thing, though our method can learn on arbitrary graphs, all the graphs in the dataset must have the same topology. We plan to extend our work in the future so that it can work on datasets with varying topology. Forhairmodeling, wefoundthatourapproachfailstogenerateexotichairstyles like kinky, afro or buzz cuts as shown in Figure 3.10. We think the main reason is that we do not have such hairstyles in our training database. Building a large hair dataset that covers more variations could mitigate this problem. Our method would also fail when the hair is partially occluded. Thus we plan to enhance our training in the future by adding random occlusions. In addition, we use face de- tection to estimate the pose of the torso in this paper, but it can be replaced by 116 using deep learning to segment the head and body. Currently, the generated hair model is insufficiently temporally coherent for video frames. Integrating tempo- ral smoothness as a constraint for training is also an interesting future direction. Although our network provides a more compact representation for the hair, there is no semantic meaning of such latent representation. It would be interesting to concatenate explicit labels (e.g. color) to the latent variable for controlled training. In terms of motion syntheis, in the near future, we plan to train a more flexible system that allows users to control various subsets of joints or specify an arbitrary trajectory. This can be easily achieved by masking out random joints or providing trajectories in the input during training, which is supported by our input format. Another interesting direction is to extend our method to long-term inbetweening of non-human characters, e.g. quadrupeds. While our current approach is highly efficient and can perform in real-time for short sequences, being also able to up- date extended motion sequences in real-time would enable new possibilities and impact next generation character animation control systems in gaming and motion planning in robotics. Baker! 117 Bibliography [1] Rotationmatrix. https://en.wikipedia.org/wiki/Rotation_matrix#Quaternion. [2] Stiefel manifold. https://en.wikipedia.org/wiki/Stiefel_manifold. [3] Akhter, I., and Black, M. J. Pose-conditioned joint angle limits for 3d human pose reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1446–1455. [4] Allen, B. F., and Faloutsos, P. Evolved controllers for simulated locomotion. In Motion in Games, Second International Workshop, MIG 2009, Zeist, The Netherlands, November 21-24, 2009. Proceedings (2009), pp. 219–230. [5] Allen-Blanchette, C., Leonardos, S., and Gallier, J. Motion interpolation in sim (3). [6] Aristidou, A., Cohen-Or, D., Hodgins, J. K., Chrysanthou, Y., and Shamir, A. Deep motifs and motion signatures. In SIGGRAPH Asia 2018 Technical Papers (2018), ACM, p. 187. [7] Arts, E. The sims resource, 2017. [8] Atwood, J., and Towsley, D. Diffusion-convolutional neural net- works. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 1993–2001. [9] Barbič, J., Sin, F. S., and Schroeder, D. Vega fem library, 2012. http://www.jernejbarbic.com/vega. [10] Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39, 3 (1993), 930–945. 118 [11] Beeler, T., Bickel, B., Noris, G., Marschner, S., Beardsley, P., Sumner, R. W., and Gross, M. Coupled 3d reconstruction of sparse facial hair and skin. ACM Trans. Graph. 31 (August 2012), 117:1–117:10. [12] Ben-Hamu, H., Maron, H., Kezurer, I., Avineri, G., and Lipman, Y. Multi-chart generative surface modeling. ACM Trans. Graph. 37 (2018), 215:1–215:15. [13] Bloom, D. M. Linear algebra and geometry. CUP Archive, 1979. [14] Boscaini, D., Masci, J., Rodolà, E., and Bronstein, M. Learning shape correspondence with anisotropic convolutional neural networks. In Advances in neural information processing systems (2016), pp. 3189–3197. [15] Bouritsas, G., Bokhnyak, S., Ploumpis, S., Bronstein, M., and Zafeiriou, S. Neural 3d morphable models: Spiral convolutional networks for 3d shape representation learning and generation. In Proceedings of the IEEE International Conference on Computer Vision (2019), pp. 7213–7222. [16] Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral net- works and locally connected networks on graphs. International Conference on Learning Representations (ICLR) (2014). [17] Buhrmester, M., Kwang, T., and Gosling, S. D. Amazon’s mechan- ical turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6, 1 (2011), 3–5. PMID: 26162106. [18] Cai, H., Bai, C., Tai, Y.-W., and Tang, C.-K. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 366–382. [19] Cao, X., Wei, Y., Wen, F., and Sun, J. Face alignment by explicit shape regression. International Journal of Computer Vision 107, 2 (2014), 177–190. [20] Chai,J.,andHodgins,J.K. Constraint-basedmotionoptimizationusing a statistical dynamic model. ACM Trans. Graph. 26, 3 (2007), 8. [21] Chai, M., Shao, T., Wu, H., Weng, Y., and Zhou, K. Autohair: Fully automatic hair modeling from a single image. ACM Transactions on Graphics (TOG) 35, 4 (2016), 116. [22] Chai, M., Wang, L., Weng, Y., Jin, X., and Zhou, K. Dynamic hair manipulation in images and videos. ACM Trans. Graph. 32, 4 (July 2013), 75:1–75:8. 119 [23] Chai, M., Wang, L., Weng, Y., Yu, Y., Guo, B., and Zhou, K. Single-view hair modeling for portrait manipulation. ACM Trans. Graph. 31, 4 (July 2012), 116:1–116:8. [24] Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015). [25] Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., and Chua, T.-S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 6298–6306. [26] Chen, Z., and Cao, F. The construction and approximation of neural networks operators with gaussian activation function. Mathematical Com- munications 18, 1 (2013), 185–207. [27] Choe, B., and Ko, H. A statistical wisp model and pseudophysical approaches for interactivehairstyle generation. IEEE Trans. Vis. Comput. Graph. 11, 2 (2005), 160–170. [28] Choy, C.B., Xu, D., Gwak, J., Chen, K., andSavarese, S. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. CoRR abs/1604.00449 (2016). [29] Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and ac- curatedeepnetworklearningbyexponentiallinearunits(elus). International Conference on Learning Representations (ICLR) (2016). [30] CMU. Cmu graphics lab motion capture database, 2010. Accessed: 2019- 05-19. [31] Csiszar, A., Eilers, J., andVerl, A. On solving the inverse kinematics problem using neural networks. In Mechatronics and Machine Vision in Practice (M2VIP), 2017 24th International Conference on (2017), IEEE, pp. 1–6. [32] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y. Deformable convolutional networks. 2017 IEEE International Conference on Computer Vision (ICCV) (2017), 764–773. [33] Davis,D.M. Embeddingsofrealprojectivespaces. Bol. Soc. Mat. Mexicana (3) 4 (1998), 115–122. 120 [34] Defferrard, M., Bresson, X., and Vandergheynst, P. Convolu- tional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems (2016), pp. 3844–3852. [35] Defferrard, M., Bresson, X., and Vandergheynst, P. Convo- lutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Asso- ciates, Inc., 2016, pp. 3844–3852. [36] Do, T.-T., Cai, M., Pham, T., and Reid, I. Deep-6dpose: Recovering 6d object pose from a single rgb image. arXiv preprint arXiv:1802.10367 (2018). [37] Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Convolutional networksongraphsforlearningmolecularfingerprints. InAdvances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2224– 2232. [38] Echevarria, J. I., Bradley, D., Gutierrez, D., and Beeler, T. Capturing and stylizing hair for 3d fabrication. ACM Trans. Graph. 33, 4 (July 2014), 125:1–125:11. [39] Falorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forré, P., and Cohen, T. S. Explorations in homeomorphic varia- tional auto-encoding. arXiv preprint arXiv:1807.04689 (2018). [40] Fan, H., Su, H., and Guibas, L. J. A point set generation network for 3d object reconstruction from a single image. CoRR abs/1612.00603 (2016). [41] Fey,M.,EricLenssen,J.,Weichert,F.,andMüller,H. Splinecnn: Fast geometric deep learning with continuous b-spline kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 869–877. [42] Fey, M., andLenssen, J.E. Fast graph representation learning with Py- Torch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (2019). [43] Fu, H., Wei, Y., Tai, C.-L., and Quan, L. Sketching hairstyles. In Proceedings of the 4th Eurographics Workshop on Sketch-based Interfaces and Modeling (New York, NY, USA, 2007), SBIM ’07, ACM, pp. 31–36. 121 [44] Gao, G., Lauri, M., Zhang, J., and Frintrop, S. Occlusion resis- tant object rotation regression from point cloud segments. arXiv preprint arXiv:1808.05498 (2018). [45] Girdhar, R., Fouhey, D. F., Rodriguez, M., and Gupta, A. Learn- ing a predictable and generative vector representation for objects. CoRR abs/1603.08637 (2016). [46] Grassia, F. S. Practical parameterization of rotations using the exponen- tial map. Journal of graphics tools 3, 3 (1998), 29–48. [47] Grochow, K., Martin, S. L., Hertzmann, A., and Popović, Z. Style-based inverse kinematics. In ACM transactions on graphics (TOG) (2004), vol. 23, ACM, pp. 522–531. [48] Guerrero, P., Kleiman, Y., Ovsjanikov, M., and Mitra, N. J. Pcpnet: Learning local shape properties from raw point clouds. Computer Graphics Forum (Eurographics) (2017). [49] Hadap, S., Cani, M.-P., Lin, M., Kim, T.-Y., Bertails, F., Marschner, S., Ward, K., and Kačić-Alesić, Z. Strands and hair: modeling, animation, and rendering. In ACM SIGGRAPH 2007 courses (2007), ACM, pp. 1–150. [50] Häne, C., Tulsiani, S., and Malik, J. Hierarchical surface prediction for 3d object reconstruction. CoRR abs/1704.00710 (2017). [51] Hanocka, R., Hertz, A., Fish, N., Giryes, R., Fleishman, S., and Cohen-Or, D. Meshcnn: a network with an edge. ACM Transactions on Graphics (TOG) 38, 4 (2019), 1–12. [52] Harvey, F. G., and Pal, C. Recurrent transition networks for character locomotion. In SIGGRAPH Asia 2018 Technical Briefs (2018), no. Article No. 4, ACM, ACM Press / ACM SIGGRAPH. [53] He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into recti- fiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (Washington, DC, USA, 2015), ICCV ’15, IEEE Computer Society, pp. 1026–1034. [54] Hermosilla, P., Ritschel, T., Vázquez, P.-P., Vinacua, À., and Ropinski, T. Monte carlo convolution for learning on non-uniformly sam- pled point clouds. ACM Transactions on Graphics (TOG) 37, 6 (2018), 1–12. 122 [55] Herrera, T. L., Zinke, A., and Weber, A. Lighting hair from the inside: A thermal approach to hair reconstruction. ACM Trans. Graph. 31, 6 (Nov. 2012), 146:1–146:9. [56] Holden, D., Komura, T., and Saito, J. Phase-functioned neural net- works for character control. ACM Transactions on Graphics (TOG) 36, 4 (2017), 42. [57] Holden, D., Saito, J., and Komura, T. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG) 35, 4 (2016), 138. [58] Hong, S., Han, D., Cho, K., Shin, J. S., and Noh, J. Physics-based full-body soccer motion control for dribbling and shooting. ACM Transac- tions on Graphics (TOG) 38, 4 (2019). [59] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks 4, 2 (1991), 251–257. [60] Hsu, H.-W., Wu, T.-Y., Wan, S., Wong, W. H., and Lee, C.-Y. Quatnet: Quaternion-based head pose estimation with multi-regression loss. IEEE Transactions on Multimedia (2018). [61] Hu,L.,Ma,C.,Luo,L.,andLi,H. Robust hair capture using simulated examples. ACM Transactions on Graphics (Proceedings SIGGRAPH 2014) 33, 4 (July 2014). [62] Hu, L., Ma, C., Luo, L., and Li, H. Single-view hair modeling using a hairstyle database. ACM Transactions on Graphics (TOG) 34, 4 (2015), 125. [63] Hu, L., Ma, C., Luo, L., Wei, L.-Y., and Li, H. Capturing braided hairstyles. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2014) 33, 6 (December 2014). [64] Hu, L., Saito, S., Wei, L., Nagano, K., Seo, J., Fursund, J., Sadeghi, I., Sun, C., Chen, Y.-C., and Li, H. Avatar digitization from a single image for real-time rendering. ACM Transactions on Graphics (TOG) 36, 6 (2017), 195. [65] Igarashi, T., Moscovich, T., and Hughes, J. F. Spatial keyframing forperformance-drivenanimation. InACMSIGGRAPH 2007courses (2007), ACM, p. 25. 123 [66] Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015). [67] Jackson, A. S., Bulat, A., Argyriou, V., and Tzimiropoulos, G. Large pose 3d face reconstruction from a single image via direct volumetric CNN regression. International Conference on Computer Vision (2017). [68] Jakob,W.,Moon,J.T.,andMarschner,S. Capturinghairassemblies fiber by fiber. ACM Trans. Graph. 28, 5 (Dec. 2009), 164:1–164:9. [69] Jia, X., De Brabandere, B., Tuytelaars, T., and Gool, L. V. Dynamic filter networks. In Advances in Neural Information Processing Sys- tems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates, Inc., 2016, pp. 667–675. [70] Jiang, Y., and Liu, C. K. Data-driven approach to simulating realistic human joint constraints. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (2018), IEEE, pp. 1098–1103. [71] Jiang,Y.,VanWouwe,T.,DeGroote,F.,andLiu,C.K. Synthesis of biologically realistic human motion using joint torque actuation. arXiv preprint arXiv:1904.13041 (2019). [72] Kanazawa, A., Black, M. J., Jacobs, D. W., and Malik, J. End- to-end recovery of human shape and pose. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018). [73] Kendall, A., Cipolla, R., et al. Geometric loss functions for camera pose regression with deep learning. In Proc. CVPR (2017), vol. 3, p. 8. [74] Kendall, A., Grimes, M., and Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision (2015), pp. 2938–2946. [75] Kim, T.-Y., and Neumann, U. Interactive multiresolution hair modeling and editing. ACM Trans. Graph. 21, 3 (July 2002), 620–629. [76] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR) (2015). [77] Kipf, T., and Welling, M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representa- tions (ICLR) abs/1609.02907 (2017). 124 [78] Kipf, T. N., and Welling, M. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016). [79] Kosniowski, C. A first course in algebraic topology. CUP Archive, page 53, 1980. [80] Kovar, L., and Gleicher, M. Automated extraction and parameteriza- tion of motions in large data sets. In ACM Transactions on Graphics (ToG) (2004), vol. 23, ACM, pp. 559–568. [81] Kovar, L., Gleicher, M., and Pighin, F. Motion graphs. In ACM SIGGRAPH 2008 classes (2008), ACM, p. 51. [82] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature 521, 7553 (2015), 436. [83] Lee, J., Chai, J., Reitsma, P. S., Hodgins, J. K., and Pollard, N. S. Interactive control of avatars animated with human motion data. In ACM Transactions on Graphics (ToG) (2002), vol. 21, ACM, pp. 491–500. [84] Lehrmann,A.M.,Gehler,P.V.,andNowozin,S. Efficientnonlinear markov models for human motion. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014 (2014), pp. 1314–1321. [85] Levine, S., and Koltun, V. Learning complex neural network policies with trajectory optimization. In Proceedings of the 31th International Con- ference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014 (2014), pp. 829–837. [86] Levine, S., Wang, J. M., Haraux, A., Popovic, Z., and Koltun, V. Continuous character control with low-dimensional embeddings. ACM Trans. Graph. 31, 4 (2012), 28:1–28:10. [87] Li,H.,Trutoiu,L.,Olszewski,K.,Wei,L.,Trutna,T.,Hsieh,P.- L., Nicholls, A., and Ma, C. Facial performance sensing head-mounted display. ACM Transactions on Graphics (TOG) 34, 4 (2015), 47. [88] Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., and Li, H. Auto- conditioned recurrent networks for extended complex human motion synthe- sis. International Conference on Learning Representations (ICLR). [89] Likhachev, M., Gordon, G. J., and Thrun, S. Ara*: Anytime a* with provable bounds on sub-optimality. In Advances in neural information processing systems (2004), pp. 767–774. 125 [90] Liu, L., and Hodgins, J. Learning to schedule control fragments for physics-based characters using deep q-learning. ACM Transactions on Graphics (TOG) 36, 3 (2017), 29. [91] Liu, L., and Hodgins, J. Learning basketball dribbling skills using tra- jectory optimization and deep reinforcement learning. ACM Transactions on Graphics (TOG) 37, 4 (2018), 142. [92] Llanas, B., Lantarón, S., and Sáinz, F. J. Constructive approxima- tion of discontinuous functions by neural networks. Neural Processing Letters 27, 3 (2008), 209–226. [93] Luo, L., Li, H., and Rusinkiewicz, S. Structure-aware hair capture. ACM Transactions on Graphics (Proceedings SIGGRAPH 2013) 32, 4 (July 2013). [94] Mao, X., Li, Q., Xie, H., Lau, R. Y. K., Wang, Z., and Smolley, S. P. Least squares generative adversarial networks. 2017 IEEE Interna- tional Conference on Computer Vision (ICCV) (2017), 2813–2821. [95] Masci, J., Boscaini, D., Bronstein, M., and Vandergheynst, P. Geodesic convolutional neural networks on riemannian manifolds. In Pro- ceedings of the IEEE international conference on computer vision workshops (2015), pp. 37–45. [96] Min, J., Chen, Y., and Chai, J. Interactive generation of human ani- mation with deformable motion models. ACM Trans. Graph. 29, 1 (2009), 9:1–9:12. [97] Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., and Bronstein,M.M. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 5115–5124. [98] Mordatch, I., Lowrey, K., Andrew, G., Popovic, Z., and Todorov, E. Interactive control of diverse complex characters with neural networks. In Advances in Neural Information Processing Systems 28: An- nual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada (2015), pp. 3132–3140. [99] Mukai, T., and Kuriyama, S. Geostatistical motion interpolation. In ACM Transactions on Graphics (TOG)(2005), vol.24, ACM,pp.1062–1070. [100] Naderi, K., Rajamäki, J., and Hämäläinen, P. Discovering and syn- thesizing humanoid climbing movements. ACM Transactions on Graphics (TOG) 36, 4 (2017), 43. 126 [101] Olszewski,K.,Lim,J.J.,Saito,S.,andLi,H. High-fidelity facial and speech animation for vr hmds. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2016) 35, 6 (December 2016). [102] Paris, S., Chang, W., Kozhushnyan, O. I., Jarosz, W., Matusik, W., Zwicker, M., and Durand, F. Hair photobooth: Geometric and photometric acquisition of real hairstyles. ACM Trans. Graph. 27, 3 (Aug. 2008), 30:1–30:9. [103] Park, J. J., Florence, P., Straub, J., Newcombe, R. A., and Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 165–174. [104] Park, S. I., Shin, H. J., and Shin, S. Y. On-line locomotion gen- eration based on motion blending. In Proceedings of the 2002 ACM SIG- GRAPH/Eurographics symposium on Computer animation (2002), ACM, pp. 105–111. [105] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De- Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Auto- matic differentiation in pytorch. In NIPS-W (2017). [106] Pavllo, D., Feichtenhofer, C., Auli, M., and Grangier, D. Modeling human motion with quaternion-based neural networks. CoRR abs/1901.07677 (2019). [107] Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M. 3d human pose estimation in video with temporal convolutions and semi- supervised training. In Conference on Computer Vision and Pattern Recog- nition (CVPR) (2019). [108] Peng, X. B., Abbeel, P., Levine, S., and van de Panne, M. Deep- mimic: Example-guided deep reinforcement learning of physics-based char- acter skills. ACM Transactions on Graphics (TOG) 37, 4 (2018), 143. [109] Peng,X.B.,Berseth,G.,Yin,K.,andVanDePanne,M. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG) 36, 4 (2017), 41. [110] Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), 77–85. 127 [111] Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learn- ing on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593 (2016). [112] Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1, 2 (2017), 4. [113] Qi, C. R., Yi, L., Su, H., and Guibas, L. J. Pointnet++: Deep hi- erarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017). [114] Ranjan, A., Bolkart, T., Sanyal, S., and Black, M. J. Generating 3dfacesusingconvolutionalmeshautoencoders. InThe European Conference on Computer Vision (ECCV) (September 2018). [115] Riegler,G.,Ulusoy,A.O.,andGeiger,A. Octnet: Learningdeep3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), vol. 3. [116] Rose, C., Cohen, M. F., and Bodenheimer, B. Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Ap- plications 18, 5 (1998), 32–40. [117] Rose III, C. F., Sloan, P.-P. J., and Cohen, M. F. Artist-directed inverse-kinematics using radial basis function interpolation. In Computer Graphics Forum (2001), vol. 20, Wiley Online Library, pp. 239–250. [118] Roweis, S. T., and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. science 290, 5500 (2000), 2323–2326. [119] Safonova, A., and Hodgins, J. K. Construction and optimal search of interpolated motion graphs. ACM Transactions on Graphics (TOG) 26, 3 (2007), 106. [120] Saxena, A., Driemeyer, J., and Ng, A. Y. Learning 3-d object ori- entation from images. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on (2009), IEEE, pp. 794–800. [121] Sin, F. S., Schroeder, D., and Barbič, J. Vega: non-linear fem de- formable object simulator. In Computer Graphics Forum (2013), vol. 32, Wiley Online Library, pp. 36–48. [122] Su, H., Jampani, V., Sun, D., Gallo, O., Learned-Miller, E., and Kautz, J. Pixel-adaptive convolutional neural networks. In Proceedings of 128 the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 11166–11175. [123] Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In 2014 IEEE Conference on Computer Vision and Pattern Recognition (June 2014), pp. 1701–1708. [124] Tan, J., Gu, Y., Liu, C. K., and Turk, G. Learning bicycle stunts. ACM Trans. Graph. 33, 4 (2014), 50:1–50:12. [125] Tatarchenko, M., Dosovitskiy, A., and Brox, T. Octree gener- ating networks: Efficient convolutional architectures for high-resolution 3d outputs. CoRR, abs/1703.09438 (2017). [126] Tulsiani, S., Zhou, T., Efros, A. A., and Malik, J. Multi-view supervision for single-view reconstruction via differentiable ray consistency. CoRR abs/1704.06254 (2017). [127] Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Doso- vitskiy, A., and Brox, T. Demon: Depth and motion network for learn- ing monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR) (2017), vol. 5, p. 6. [128] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. International Conference on Learning Representations (ICLR) (2018). [129] Verma, N., Boyer, E., and Verbeek, J. Dynamic filters in graph convolutional networks. ArXiv abs/1706.05206 (2017). [130] Verma, N., Boyer, E., and Verbeek, J. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In Proceedings of the IEEE con- ference on computer vision and pattern recognition (2018), pp. 2598–2606. [131] Villegas, R., Yang, J., Ceylan, D., and Lee, H. Neural kinematic networks for unsupervised motion retargetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 8639– 8648. [132] Wang, J., Chen,K.,Xu,R.,Liu,Z.,Loy,C.C.,andLin,D. Carafe: Content-aware reassembly of features. In Proceedings of International Con- ference on Computer Vision (2019). 129 [133] Wang, J. M., Fleet, D. J., and Hertzmann, A. Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 30, 2 (2008), 283–298. [134] Wang, L., Yu, Y., Zhou, K., and Guo, B. Example-based hair geom- etry synthesis. ACM Trans. Graph. 28, 3 (2009), 56:1–56:9. [135] Wang, P.-S., Liu, Y., Guo, Y.-X., Sun, C.-Y., and Tong, X. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Trans. Graph. 36, 4 (July 2017), 72:1–72:11. [136] Ward, K., Bertails, F., yong Kim, T., Marschner, S. R., paule Cani, M., and Lin, M. C. A survey on hair modeling: styling, simula- tion, and rendering. In IEEE TRANSACTION ON VISUALIZATION AND COMPUTER GRAPHICS (2006), pp. 213–234. [137] Weng, Y., Wang, L., Li, X., Chai, M., and Zhou, K. Hair Interpola- tion for Portrait Morphing. Computer Graphics Forum (2013). [138] Wikipedia contributors. Euler angles — Wikipedia, the free encyclopedia, 2019. https://en.wikipedia.org/w/index.php?title=Euler_angles&oldid=891315801. [139] Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. Robotics: Science and Systems (RSS) (2018). [140] Xu, Q., Sun, X., Wu, C.-y., Wang, P., and Neumann, U. Grid-gcn for fast and scalable point cloud learning. arXiv preprint arXiv:1912.02984 (2019). [141] Xu, Z., and Cao, F. The essential order of approximation for neural networks. Science in China Series F: Information Sciences 47, 1 (2004), 97–112. [142] Xu, Z., Wu, H.-T., Wang, L., Zheng, C., Tong, X., and Qi, Y. Dynamic hair capture using spacetime optimization. ACM Trans. Graph. 33, 6 (Nov. 2014), 224:1–224:11. [143] Xu, Z.-B., and Cao, F.-L. Simultaneous lp-approximation order for neu- ral networks. Neural Networks 18, 7 (2005), 914–923. [144] Yan, S., Li, Z., Xiong, Y., Yan, H., and Lin, D. Convolutional se- quence generation for skeleton-based action synthesis. 2019 IEEE Interna- tional Conference on Computer Vision (ICCV) (2019). 130 [145] Yan, X., Rastogi, A., Villegas, R., Sunkavalli, K., Shechtman, E., Hadap, S., Yumer, E., and Lee, H. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), pp. 265–281. [146] Yu, W., Turk, G., and Liu, C. K. Learning symmetry and low-energy locomotion. ArXiv e-prints (2018). [147] Yu, Y. Modeling realistic virtual hairstyles. In Computer Graphics and Applications, 2001. Proceedings. Ninth Pacific Conference on (2001), IEEE, pp. 295–304. [148] Yuksel, C., Schaefer, S., and Keyser, J. Hair meshes. ACM Trans. Graph. 28, 5 (Dec. 2009), 166:1–166:7. [149] Zhang, H., Starke, S., Komura, T., and Saito, J. Mode-adaptive neuralnetworksforquadrupedmotioncontrol. ACM Transactions on Graph- ics (TOG) 37, 4 (2018), 145. [150] Zhang,M.,Chai,M.,Wu,H.,Yang,H.,andZhou,K. A data-driven approach to four-view image-based hair modeling. ACM Trans. Graph. 36, 4 (July 2017), 156:1–156:11. [151] Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene parsing network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017). [152] Zhirong Wu, Song, S., Khosla, A., Fisher Yu, Linguang Zhang, Xiaoou Tang, and Xiao, J. 3d shapenets: A deep representation for vol- umetric shapes. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2015), pp. 1912–1920. [153] Zhou, X., Sun, X., Zhang, W., Liang, S., and Wei, Y. Deep kine- matic pose regression. In European Conference on Computer Vision (2016), Springer, pp. 186–201. [154] Zhou, Y., Barnes, C., Lu, J., Yang, J., and Li, H. On the continu- ity of rotation representations in neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition abs/1812.07035 (2019). 131 Appendix A Mesh Convolution A.1 Method Details To make it clearer, in vcConv or vcTransConv the weight coefficients A i,j are different for each edge between y i andN (i), butB is shared globally in one layer. Similarly, in vcPool or vcUnpool, the density coefficients ρ i,j are also different for each edge. They are all training parameters. Since the topology is fixed, their indices remain the same and their values are shared across the entire dataset. A.1.1 Network Implementation In our implementation, we precompute the graph sampling process and record {N (i)} in a table of vertex connections. Each line i in the table contains the indices of the vertices in{N (i)} in the input graph. In practice, we store the table using a N×Max(E i ) integer tensor. The training parameters for convolution coefficients are stored in a tensor of size N×Max(E i )×M and the basis is a tensor of M×O×I. We use a N×Max(E i ) mask to mask out the non-existed entries. For training and testing, the whole network forward and backward processes are fully parallelized in GPU written in Pytorch. During testing, the kernels are pre-computed using equation 2.2 to accelerate the inference time. 132 A.1.2 Graph Sampling Algorithm To select a subset of sampled vertices from the original graph with stride s, for each connected component, we start from a random vertex, mark it as selected and mark its (s− 1)-ring neighbors as removed, and then traverse the vertices on thes-ring until finding a vertex that is not marked and has no (s-1)-ring neighbors marked as selected. Next, we mark this vertex as selected and start a new round of searching. We repeat those steps recurrently using a queue structure until all vertices are marked. One can also manually assign certain vertices to be included in the sampling set by marking them before starting the searching. After collecting the sampled vertices, we create the input and output graphs for the up-sampling process and reverse them for the down-sampling process, and then create edges between the input and output graphs regarding the topology of the original graph under certain kernel radius size as illustrated in Figure A.1. The edges that connect the output vertex y i and the input vertices x i,j have learnable weights A i,j or ρ i,j , and we denote{x i,j } asN (i). 133 Figure A.1: Steps for graph up and down-sampling. Figure A.2: More up-sampling examples. 134 Appendix B Hair Synthesis B.1 Collision In Figure B.1, we show the results with and without using collision loss in our training process. Even when a simplified body model is used for collision detection, it could push the generated hair towards a more plausible space. (a) input image (b) With collision (c) Without collision (d) zoom-in Figure B.1: Without the collision loss in training (c,d), our method could produce less plausible results. 135 B.2 Hair Interpolation Figure B.2: Hair interpolation in the latent space of our model. 136 Table B.1: Details of our network architecture. "in_ch" means input channel size. "out_ch" means output channel size. Encoder (input: 2× 256× 256, output: 512) Conv2d(in_ch=3, out_ch=32, kernel_size=8, stride=2, padding=3) ReLU Conv2d(in_ch=32, out_ch=64, kernel_size=8, stride=2, padding=3) ReLU Conv2d(in_ch=64,out_ch=128, kernel_size=6, stride=2, padding=2) ReLU Conv2d(in_ch=128, out_ch=256, kernel_size=4, stride=2, padding=1) ReLU Conv2d(in_ch=256, out_ch=256, kernel_size=3, stride=1, padding=1) ReLU Conv2d(in_ch=256, out_ch=512, kernel_size=4, stride=2, padding=1) ReLU Conv2d(in_ch=512, out_ch=512, kernel_size=3, stride=1, padding=1) ReLU MaxPool2d(kernel_size=8) Tanh Decoder(input: 512, output: 512× 32× 32) Linear(512, 1024) ReLU Linear(1024, 4096) ReLU Bilinear Upsample(scale_factor=2) Conv2d(in_ch=256, out_ch=512, kernel_size=3, stride=1, padding=1) ReLU Bilinear Upsample(scale_factor=2) Conv2d(in_ch=512, out_ch=512, kernel_size=3, stride=1, padding=1) ReLU Bilinear Upsample(scale_factor=2) Conv2d(in_ch=512, out_ch=512, kernel_size=3, stride=1, padding=1) ReLU Strand Curvature Decoder(input: 512× 32× 32, output 100× 32× 32) Conv2d(in_ch=512, out_ch=512, kernel_size=1, stride=1, padding=0) ReLU Conv2d(in_ch=512, out_ch=512, kernel_size=1, stride=1, padding=0) Tanh Conv2d(in_ch=512, out_ch=100, kernel_size=1, stride=1, padding=0) Strand Position Decoder(input: 512× 32× 32, output 300× 32× 32) Conv2d(in_ch=512, out_ch=512, kernel_size=1, stride=1, padding=0) ReLU Conv2d(in_ch=512, out_ch=512, kernel_size=1, stride=1, padding=0) Tanh Conv2d(in_ch=512, out_ch=300, kernel_size=1, stride=1, padding=0) B.3 Detailed Network Architecture 137 B.4 Results Gallery Figure B.3: Our single-view reconstruction results for various hairstyles. 138 Figure B.4: Our single-view reconstruction results for various hairstyles. 139 Figure B.5: Our single-view reconstruction results for various hairstyles. 140 Figure B.6: Our single-view reconstruction results for various hairstyles. 141 Appendix C Rotation Representations C.1 Overview of the Supplemental Document In this supplemental material, in Section C.2, we first give explicitly our 6D representation for the 3D rotations. We then prove formally in Section C.3 that the formula of the 5D representation as defined in Case 4 of Section 4.3.2 satisfies all of thepropertiesofacontinuousrepresentation. Next, wediscussquaternionsinmore depth. WepresentinSectionC.4aresultthatweelidedfromthemainpaperdueto space limitations: that the unit quaternions are also a discontinuous representation for the 3D rotations. Then, in Section C.5, we show how the continuous 5D and 6D representations can interact with common discontinuous angle representations such as quaternions. Next, we visualize in Section C.6 discontinuities that are present in some of the representations. We finally present in Section C.7 some additional empirical results. C.2 6D Representation for the 3D Rotations The mapping from SO(3) to our 6D representation is: g GS a 1 a 2 a 3 = a 1 a 2 (C.1) 142 The mapping from our 6D representation to SO(3) is: f GS a 1 a 2 = b 1 b 2 b 3 (C.2) b i = N(a 1 ) if i = 1 N(a 2 − (b 1 ·a 2 )b 1 ) if i = 2 b 1 ×b 2 if i = 3 T (C.3) C.3 Proof that Case 4 gives a Continuous Rep- resentation Here we show that the functions f P ,g P presented in Case 4 of Section 4.3.2 are a continuous representation. We now prove some properties needed to show a continuous representation: that g P is defined on its domain and continuous, and that f P (g P (M)) = M for all M ∈ SO(n). In these proofs, we use 0 to denote the zero vector in the appropriate Euclidean space. We also use the same slicing notation from the main paper. That is, if u is a vector of length m, define the slicing notation u i:j = (u i ,u i+1 ,...,u j ), and u i: =u i:m . Proof that g P is defined on SO(n). Suppose M∈SO(n). The same as we did for Equation (4.10) in the main paper, define a vectorized representationγ(M) bydroppingthelastcolumnofM: γ(M) = [M T (1) ,...,M T (n−1) ], whereM (i) indicates the ith column of M. Following Equation (4.8), which defines the normalized projection, and Equation (4.10), let v = γ n 2 −2n: /||γ n 2 −2n: ||. The only way that g P (M) could not be defined is if the normalized projection P (v) is not defined, 143 which requiresv 1 = 1. However, ifv 1 = 1, then becausev is unit length, it follows thatγ n 2 −2n+1: has length zero. Butγ n 2 −2n+1: is a column vector fromM∈SO(n), and therefore has unit length. We conclude that v 1 6= 1 and g P is defined on SO(n). Proof that g P is continuous. This case is trivial becauseg P is the composi- tion of functions that are continuous on their domains and thus is also continuous on its domain. Lemma 1. We claim that if u∈R m and||u 2: || = 1, then Q(P (u)) = u. We now prove this. We have||u|| = q 1 +u 2 1 . Then we find by Equation (4.8) that ||P (u)|| = 1 ||u||−u 1 = 1 q 1 +u 2 1 −u 1 (C.4) Now let b = Q(P (u)). Components 2 through m of b are P (u)/||P (u)||, but this is just u 2: . Next, consider b 1 , the first component of b: b 1 = 1 2 " ||P (u)||− 1 ||P (u)|| # (C.5) = 1 2 1− (1 +u 2 1 ) + 2 q 1 +u 2 1 u 1 −u 2 1 q 1 +u 2 1 −u 1 (C.6) =u 1 (C.7) We find that b =u, so Q(P (u)) =u. Proof that f P (g P (M)) = M for all M ∈ SO(n). For the term f GS (A) of Equation (4.11), this is defined on R n(n−1) \D. Here A is the matrix argument to f GS in Equation (4.11), and D is the set where the dimension of the span of A is less than n− 1. Let M ∈ SO(n). The same as before, let γ(M) be the vectorized representation of M, which drops the last column. By Lemma 144 1, Q(P (γ n 2 −2n: )) = γ n 2 −2n: . Thus by Equation (4.11), we have f P (g P (M)) = f GS (γ (n×n−1) ) =f GS (g GS (M)) =M. C.4 The Unit Quaternions are a Discontinuous Representation for the 3D Rotations In Case 2 of Section 4.3.1, we showed that the quaternions are not a continuous representation for the 3D rotations. We intentionally used a simpler formulation for the quaternions, which is easier to understand and also saves space in the pa- per, due to its quaternions in general not being unit length. However, an attentive reader might wonder what happens if we use the unit quaternions: is the discon- tinuity removable? However, we show here that the unit quaternions are also not a continuous representation for the 3D rotations. We use a mappingg u to mapSO(3) to the unit quaternions, which we consider as the Euclidean spaceR 4 . We use the formula by Baker et al: g u (M) = copysign( 1 2 √ 1+M 11 −M 22 −M 33 ,M 32 −M 23 ) copysign( 1 2 √ 1−M 11 +M 22 −M 33 ,M 13 −M 31 ) copysign( 1 2 √ 1−M 11 −M 22 +M 33 ,M 21 −M 12 ) 1 2 √ 1+M 11 +M 22 +M 33 (C.8) Here copysign(a,b) = sgn(b)|a|. Now consider the following matrix in SO(3), which is parameterized by θ: B(θ) = cos(θ) − sin(θ) 0 sin(θ) cos(θ) 0 0 0 1 . (C.9) 145 By substitution, we can find the components ofg u (B(θ)) as a function ofθ. For example, as θ→π−, the third component is q (1− cos(θ))/2 = 1. Meanwhile, as θ→ π+, the third component is− q (1− cos(θ))/2 =−1. We conclude that the unit quaternions are not a continuous representation. A similar representation is the Cayley transformation which has a different scaling to the unit quaternion, wherew = 1 and the vector (x,y,z) is the unit axis of rotation scaled by tan(θ/2). The limit goes to infinity when approaching 180 ◦ . Thus, it is not a representation for SO(3). C.5 Interaction Between 5D and 6D Continuous Representations and Discontinuous Ones In some cases, it may be convenient to use a common 3D or 4D angle represen- tation, such as the quaternions or Euler angles. For example, the quaternions may be useful when interpolating between two rotations in SO(3), or when there is an existing neural network that already accepts quaternion inputs. However, as we showed in the main paper Case 2 of Section 4.3.1, all 3D and 4D representations for rotations are discontinuous. One solution for the above conundrum is to simply convert as needed from the continuous 5D and 6D representations that we presented in Case 3 and 4 of Section 4.3.2 to the desired representation. For concreteness, suppose the desired representation is the quaternions. Assume that any conversions done in the net- work are only in the direction that maps to the quaternions. Then the associated mapping in the opposite direction (i.e. from quaternions to the 5D or 6D repre- sentation) is continuous. If losses are applied only at points in the network where the representation is continuous (e.g. on the 5D or 6D representations), then the 146 learning should not suffer from discontinuity problems. One can convert from the 5D or 6D representation to quaternions by first applying Equation (4.5) or Equa- tion (4.10) and then using Equation (4). Of course, one could also make a similar argument for other discontinuous but popular angle representations such as Euler angles. C.6 Visualizing Discontinuities in 3D Rotations Here we visualize any discontinuities that might occur in the 3D rotation rep- resentations. We do this by forming three continuous curves in SO(3), which we call the “X, Y, and Z Rotations." We map each of these curves to each represen- tation, and then map the representation curve to 2D by retaining the top two components from Principal Components Analysis (PCA). We call the first curve inSO(3) the “X Rotations:" this curve is formed by taking the X axis (1, 0, 0), and constructing a curve consisting of all rotations around this axis as parameterized by angle. Likewise, we call the second and third curves inSO(3) the “Y Rotations" and “Z Rotations:" these curves are formed by rotating around the Y and Z axes, respectively. We show the resulting 2D curves in Figure C.1. In the three columns, we show three different curves in SO(3): the “X, Y, and Z” Rotations which consist of all rotations around the corresponding axis. We map each curve in SO(3) to each of the rotation representations in the different rows (plus the top row, which stays in the original space SO(3)), and then map to 2D using PCA. We use the hue to visualize the rotation angle in SO(3) around each of the three canonical axes X, Y, Z. If the representation is continuous then the curve in 2D should be homeomorphic to a circle, and similar colors should be nearby spatially. We can 147 clearly see that the topology is incorrect for the unit quaternion, axis-angle, and Euler angle representations. C.7 Additional Empirical Results In this section, we show some additional empirical results. C.7.1 Visualization of Inverse Kinematics Test Result In Figure C.2, we visualize the worst two frames with the highest pose errors generated by the network trained on quaternions, along with the corresponding results from the network trained with 6D representations. Likewise, we show the two frames with highest pose errors generated by the network trained on our 6D representation, along with the corresponding results from the network trained on quaternions. This shows that for the worst error frames, the quaternion represen- tation introduces bad qualitative results while the 6D one still creates a pose that is reasonable. C.7.2 Additional Sanity test In the main paper, Section 4.4.1, we show the sanity test result of the network trained with L2 loss between the ground-truth and the output rotation matrices. Another option for training is using the geodesic loss. Besides, the networks in the main paper are trained and tested using a uniform sampling of the axis and the angle which is not a uniform sampling on SO(3). We present the sanity test result of using the geodesic loss and the two sampling methods in Figure C.3. They are both similar to the result in the main paper. 148 SO(3) 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 6D 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 5D 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 Unit quaternions 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 Axis-angle 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 Euler angles 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 X Rotations Y Rotations Z Rotations Figure C.1: Visualization of discontinuities in 3D rotation representations. 149 Additional representations. In addition to common rotation representa- tions like Euler angles, axis-angles and quaternions, we investigated a few other rotation representations used in recent work including a 3D Rodriguez vector rep- resentation, and quaternions that are constrained to one hemisphere as given by Kendall et al. The 3D Rodriguez vector is given as R =ωθ, where ω is a 3D unit vector and θ is the angle. We will not provide the proofs for the discontinuity in these representations, but we show their empirical results in Figure C.3. We find that the errors are significantly worse than our 5D and 6D representations. 150 Worst Two Frames using Quaternion Groundtruth Quaternion 6D Groundtruth Quaternion 6D Groundtruth Quaternion 6D Groundtruth Quaternion 6D Worst Two Frames using the 6D representation Figure C.2: At top, we show IK results for the two frames with highest pose error from the test set for the network trained using quaternions, and the corresponding results on the same frames for the network trained on the 6D representation. At bottom, we show the two worst frames for the 6D representation network, and the corresponding results for the quaternion network. 151 Sanity Test – Geodesic Loss and Uniform Sampling on Axis and Angle a. Mean errors during iterations. b. Percentile of errors at 500k iteration. c. Errors at 500k iteration. Mean(°) Max(°) Std(°) 6D 0.54 1.82 0.26 5D 0.59 1.93 0.29 Quat 2.4 176.9 6.81 Quat-hemi 2.72 179.55 7.03 AxisA 2.89 175.53 6.77 Euler 9.12 179.93 24.01 Rodriguez 4.74 179.45 17.84 d. Mean errors during iterations. e. Percentile of errors at 500k iteration. f. Errors at 500k iteration. Sanity Test – Geodesic Loss and Uniform Sampling on SO(3) Mean(°) Max(°) Std(°) 6D 0.45 2.33 0.29 5D 0.52 3.19 0.33 Quat 2.02 180.0 6.08 Quat-hemi 3.47 179.22 6.72 AxisA 2.0 179.58 5.43 Euler 7.45 179.87 21.87 Rodriguez 4.17 180.0 17.78 Figure C.3: Additional Sanity test results. “Quat" refers to quaternions, “Quat- hemi" refers to quaternions constrained to one hemisphere, “AxisA" refers to axis angle and “Rodriguez" refers to the 3D Rodriguez-vector. 152 Appendix D Motion Synthesis D.1 Training Parameters For training the inbetweening network, we use learning rate 0.00001 for both the generator and the discriminator. The weight for the Least Square GAN loss is 1.0, 5.0 for the batch normalization loss, 300 for local joints error, 50 for each DNA loss. The weight for the global path loss increases by 1.0 from 10.0 to 80.0 every 1000 iterations. Table D.1 gives the range and orders of relative rotations at each joint. D.2 Dataset Filtering The original CMU Motion Capture dataset contains many noises. There are three types of noises: (1) motions underneath or cutting through the ground sur- face, (2) motions with poses outside the range of human body flexibility, (3) jit- tering due to the limitation of the capture device. We first remove the first type by calculating the average lowest position of the motion sequences over a short time window. For detecting the second and third type of noises, we trained a local motion auto-encoder with similar architecture to the local motion generator, but with less layers. The auto-encoder was trained with all the motion classes we use in this paper. Since the auto-encoder contains the RC-FC layer, the output poses are all under the rotation range constraints of human joints. Moreover, as 153 Joint x range y range z range Order Joint x range y range z range Order hip -360, 360 -360, 360 -360, 360 xzy lCollar -30, 30 -30, 30 -10, 10 xzy abdomen -45, 68 -30, 30 -45, 45 xzy lShldr -90, 135 -110, 30 -70, 60 xzy chest -45, 45 -30, 30 -45, 45 xzy lForeArm 0, 0 0, 150 -30, 120 yxz neck -37, 22 -30, 30 -45, 45 xzy lHand -90, 90 -20, 30 0, 0 xzy head -37, 22 -30, 30 -45, 45 xzy lThumb1 0, 0 0, 0 0, 0 xzy leftEye 0, 0 0, 0 0, 0 xzy lThumb2 0, 0 0, 0 0, 0 xzy leftEye_Nub - - - xzy lThumb_Nub - - - xzy rightEye 0, 0 0, 0 0, 0 xzy lIndex1 0, 0 0, 0 0, 0 xzy rightEye_Nub - - - xzy lIndex2 0, 0 0, 0 0, 0 xzy rCollar -30, 30 -30, 30 -10, 10 xzy lIndex2_Nub - - - xzy rShldr -90, 135 -30, 110 -70, 60 xzy lMid1 0, 0 0, 0 0, 0 xzy rForeArm 0, 0 0, 150 -30, 120 yxz lMid2 0, 0 0, 0 0, 0 yxz rHand -90, 90 -30, 20 0, 0 xzy lMid_Nub - - - xzy rThumb1 0, 0 0, 0 0, 0 xzy lRing1 0, 0 0, 0 0, 0 xzy rThumb2 0, 0 0, 0 0, 0 xzy lRing2 0, 0 0, 0 0, 0 xzy rThumb_Nub - - - xzy lRing_Nub - - - xzy rIndex1 0, 0 0, 0 0, 0 xzy lPinky1 0, 0 0, 0 0, 0 xzy rIndex2 0, 0 0, 0 0, 0 xzy lPinky2 0, 0 0, 0 0, 0 xzy rIndex2_Nub - - - xzy lPinky_Nub - - - xzy rMid1 0, 0 0, 0 0, 0 xzy rButtock -20, 20 -20, 20 -10, 10 xzy rMid2 0, 0 0, 0 0, 0 xzy rThigh -180, 100 -180, 90 -60, 70 xzy rMid_Nub - - - xzy rShin 0, 170 0, 0 0, 0 xzy rRing1 0, 0 0, 0 0, 0 xzy rFoot -31, 63 -30, 30 -20, 20 xzy rRing2 0, 0 0, 0 0, 0 xzy rFoot_Nub - - - xzy rRing_Nub - - - xzy lButtock -20, 20 -20, 20 -10, 10 xzy rPinky1 0, 0 0, 0 0, 0 xzy lThigh -180, 100 -180, 90 -60, 70 xzy rPinky2 0, 0 0, 0 0, 0 xzy lShin 0, 170 0, 0 0, 0 xzy rPinky_Nub - - - xzy lFoot -31, 63 -30, 30 -20, 20 xzy xzy lFoot_Nub - - - xzy Table D.1: Range and order of the rotations at each joint defined in the CMU MotionCaptureDataset. Notethatfingermotionsarenotcapturedinthisdataset. anto-encoder can filter out high-frequency signals, the output motions are usually smoother than the input motions. We apply the trained auto-encoder to all the motion sequences and compute the error between the input frames and output frames. When the error is higher than a threshold, we delete the corresponding frame and split the motion clip at that frame. 154
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D deep learning for perception and modeling
PDF
Data-driven 3D hair digitization
PDF
Learning distributed representations from network data and human navigation
PDF
Point-based representations for 3D perception and reconstruction
PDF
Invariant representation learning for robust and fair predictions
PDF
Single-image geometry estimation for various real-world domains
PDF
Effective data representations for deep human digitization
PDF
Learning to optimize the geometry and appearance from images
PDF
Deep learning models for temporal data in health care
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Visual representation learning with structural prior
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Robust and generalizable knowledge acquisition from text
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Leveraging training information for efficient and robust deep learning
PDF
Exploring complexity reduction in deep learning
Asset Metadata
Creator
Zhou, Yi
(author)
Core Title
Deep representations for shapes, structures and motion
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/03/2020
Defense Date
03/12/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D,deep learning,motion,OAI-PMH Harvest,representations,shape
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Li, Hao (
committee chair
), Itti, Laurent (
committee member
), Nealen, Andrew (
committee member
)
Creator Email
zhou859@usc.edu,zhouyisjtu2012@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-357594
Unique identifier
UC11666654
Identifier
etd-ZhouYi-8878.pdf (filename),usctheses-c89-357594 (legacy record id)
Legacy Identifier
etd-ZhouYi-8878.pdf
Dmrecord
357594
Document Type
Dissertation
Rights
Zhou, Yi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D
deep learning
motion
representations
shape