Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Digitizing human performance with robust range image registration
(USC Thesis Other)
Digitizing human performance with robust range image registration
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Digitizing Human Performance with Robust Range Image Registration by Ruizhe Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2017 Copyright 2017 Ruizhe Wang Acknowledgements Firstly, I would like to express my sincere gratitude to my advisor Prof. Medioni for the continuous support of my Ph.D study and related research, for his patience, motivation, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my Ph.D study. Secondly, I thank my fellow lab mates for the stimulating discussions, for the sleepless nights we were working together before deadlines, and for all the fun we have had in the last four years. In particular, I am grateful to Dr. Jongmoo Choi for enlightening me the first glance of research. Last but not the least, I would like to thank my family: my parents and my wife for supporting me spiritually throughout writing this thesis and my life in general. ii Table of Contents Acknowledgements ii List of Tables v List of Figures vi Abstract ix 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Related Work 6 2.1 Human Performance Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Range Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Home Monitoring System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Contour Coherence for Rigid and Non-Rigid Registration 13 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Contour Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Multi-view Rigid Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Multi-view Articulated Registration . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Application: Scanning a 3D Body Model for Animation . . . . . . . . . . . . . . 29 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Visibility Error Metric for Global Registration 34 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Robust Rigid Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Visibility Error Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2 Finding the Transformation . . . . . . . . . . . . . . . . . . . . . . . . 40 iii 4.3.3 Multi-view Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Global Registration Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 Dynamic Capture Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5 Home Monitoring Patients with Musculo-Skeletal Disorders 53 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.4 Temporal Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.5 Spatial Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.6 TASS Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6 Conclusion and Future Work 73 Reference List 75 iv List of Tables 3.1 Errors of recovered main rotation angle at different offset (in degrees) . . . . . . 27 4.1 Average success percentage of global registration algorithms on two data sets. Average running time is measured using a single thread on an Intel Core i7- 4710MQ CPU clocked at 2.5 GHz. . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1 Recognition Accuracy Comparison on MSR-Action 3D Dataset . . . . . . . . . 67 5.2 Results of PD and non-PD on walking-based tests . . . . . . . . . . . . . . . . . 68 v List of Figures 1.1 Range image and its corresponding point cloud obtained by using the intrinsic parameter of the range imaging device. . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Common two stages of the range image based 3D modeling system. We focus on the registration stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Three main types of registration problems. . . . . . . . . . . . . . . . . . . . . . 4 3.1 (a) Two roughly aligned wide baseline 2.5D range scans of the Stanford Bunny with the observed and predicted apparent contours extracted. The two meshed points cloud are generated from the two 2.5D range scans respectively (b) Regis- tration result after maximizing the contour coherence . . . . . . . . . . . . . . . 14 3.2 General pipeline of our Robust Closest Contour (RCC) method . . . . . . . . . . 17 3.3 (a) Front range scan (red) and side range scan (blue) of the Stanford Armadillo. (b)C 1!2 (red) andC 2 (blue). (c)C p 1!2 (red) andC p 2=1 (blue). (d) Bijective corre- spondences (black line) found in 3D. (e) Bijective correspondences (black line) found in 2D with red rectangle indicating mismatches . . . . . . . . . . . . . . . 19 3.4 (a) Pipeline of our Multi-view Iterative Closest Contour (M-ICC) method.. (b) 4 range scans of a Newton head statue taken approximately at 90 o apart, with limited overlap and poor initialization. (c) Initial right range scan and back range scan barely overlap and have a large rotation error. (d) Result of pairwise regis- tration using standard ICP algorithm. (e) Result of our M-ICC method. . . . . . . 20 3.5 (a) Pipeline of our Multi-view Articulated Iterative Closest Contour (MA-ICC) method. (b) Segmentation of human body into rigid parts. (c) Registration result after our M-ICC method. (d) Registration result after our MA-ICC method. . . . 23 3.6 (a) Initialization of two synthetic Stanford bunny range scans of 45% overlap and a 54 o offset in main rotation angle (b) Registration results of triICP algorithm (c) Registration result of our method . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Depth image quality of Kinect at different distances with red rectangle indicating the common working range to scan a human body without using our proposed super-resolution method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.8 (a) Our modeling result of the Newton head statue from 4 range scans. (b) Mod- eling result of the KinectFusion algorithm from approximately 1200 range scans. (c) Our modeling result of the mannequin torso from 4 range scans. (d) Modeling result of the KinectFusion algorithm from approximately 1200 range scans. (e) Heatmap of our modeling result compared with a ground truth laser scan with error range from 0mm (blue) to 10mm (red). (f) Heatmap of the KinectFusion modeling result compared with a ground truth laser scan with the same range. . . 32 3.9 Example 3D human body models. . . . . . . . . . . . . . . . . . . . . . . . . . 33 vi 3.10 Jumping animation sequence applied on a rigged body model. . . . . . . . . . . 33 4.1 An overview of our textured dynamic surface capturing system. . . . . . . . . . 36 4.2 Left: two partial scansP 1 (dotted) andP 2 (solid) of a 2D human. Middle: when viewed fromP 1 ’s camera, points ofP 2 are classified intoO (blue),F (yellow), andB (red). Right: when viewed fromP 2 ’s camera, points ofP 1 are classified intoO (blue),F (yellow), andB (red). . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 (a) Left: a pair of range images to be registered. Right: VEM evaluated on the entire rotation space. Each point within the unit ball represents the vector part of a unit quaternion; for each quaternion, we estimate its corresponding trans- lation component and evaluate the VEM on the composite transformation. The red rectangles indicate areas with local minima, and the red cross is the global minimum. (b) Example particle locations and displacements at iteration 1 andk. Blue vectors indicate displacement of regular (non-guide) particles following a traditional particle swarm scheme. Red vectors are displacements of guide parti- cles. Guide particles draw neighboring regular particles more efficiently towards local minima to search for the global minimum. . . . . . . . . . . . . . . . . . . 40 4.4 Translation estimation examples of our Hough Transform method on range scans with limited overlap. The na¨ ıve method, which simply aligns the corresponding centroids, fails to estimate the correct translation. . . . . . . . . . . . . . . . . . 42 4.5 Success percentage of the global registration method employing different opti- mization schemes on the Stanford 3D Scanning Repository. . . . . . . . . . . . . 46 4.6 Success percentage of our global registration method compared with other meth- ods. Left: Comparison on the Stanford 3D Scanning Repository. Right: Compar- ison on the Princeton Shape Benchmark. . . . . . . . . . . . . . . . . . . . . . . 47 4.7 Example registration results of range images with limited overlap. First and second row show examples from the Stanford 3D Scanning Repository and the Princeton Shape Benchmark respectively. Please see the supplementary material for more examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.8 Our registration method compared with 4PCS on real data. First two examples are captured by Kinect One sensors while the last example is captured by Structure IO sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.9 From left to right: Globally aligned partial scans from multiple depth sensors; The water-tight mesh model after Poisson reconstruction [42]; Denoised mesh after merging neighboring meshes by using [45]; Model after our dense corre- spondences based texture reconstruction; Model after directly applying texture- stitcher [23]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.10 Example capturing results. The sequence in the lower right corner is recon- structed from Structure IO sensors, while other sequences are reconstructed from Kinect One Sensors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 (a) Traditional patient-clinician evaluation mode (b) New home-based monitoring and evaluation mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2 General pipeline of the proposed method . . . . . . . . . . . . . . . . . . . . . . 57 5.3 Illustration of segmentation based on periodicity of feature space(X) . . . . . . 72 5.4 (a) Linear alignment path between two SAUs (b) Non-linear alignment path be- tween two SAUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 vii 5.5 Detection of Pause/FOG (Top to bottom: Skeletal stream; Temporal Aligning Matrix; Estimated density) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 viii Abstract The rekindling of interest in Augmented Reality and Virtual Reality has birthed a need for dig- itizing objects with full geometry and texture, especially human body and human performance. Recent advance of commodity depth sensors (e.g. , Kinect One and Occipital IO) has quickly gained popularity in both research community and industry since they are cost efficient, accessi- ble while providing decent depth measurements at an interactive frame rate. Registration, which aims to transform all range images into the same coordinate system, plays a key role in the 3D digitization process, as a small error in registration will lead to a large degradation of final re- construction quality. Registration is well addressed by existing techniques when the two views exhibit significant overlap and when a rough registration between them is given. Different meth- ods have been proposed to relax these conditions, no satisfactory solution exists, however, to consistently address the combined problem. While all previous methods perform registration di- rectly on the generated point clouds, we propose here a novel framework to explore the visibility information of the underlying range images. This allows us to efficiently handle the problem of insufficient overlap. In this dissertation, we adopt the visibility information to solve the problems of rigid registration, non-rigid registration and global registration respectively. This enables a 3D modeling system of rigid as well as non-rigid objects from as few as 4 range images, while ix other traditional 3D modeling methods normally require a few hundred range images. Experi- mental results on both synthetic and real data demonstrate the effectiveness and robustness of our registration methods, as well as our human performance digitization systems. x Chapter 1 Introduction 1.1 Motivation The rekindling of interest in immersive, 360-degree virtual environments, spurred on by the Ocu- lus, Hololens, and other breakthroughs in consumer AR and VR hardware, has birthed a need for digitizing objects with full geometry and texture from all views. One of the most important objects to digitize in this way are moving, clothed humans, yet they are also among the most challenging: the human body can undergo large deformations over short time spans, has complex geometry with occluded regions that can only be seen from a small number of angles, and has regions like the face with important high-frequency features that must be faithfully preserved. Recent advance of commodity depth sensors (e.g. , Kinect One and Occipital IO) has quickly gained popularity in both research community and industry since they are cost efficient, accessible while providing decent depth measurements at an interactive frame rate. Given a 2D range image streamed from any depth sensor and its intrinsic camera parameters, we can infer a 3D point ~ p2R 3 for each pixel locationu2R 2 (Figure 1.1). Hence, throughout the dissertation, the term range image and the term point cloud are interchangeably used. 1 K u (u) p Figure 1.1: Range image and its corresponding point cloud obtained by using the intrinsic pa- rameter of the range imaging device. It has been demonstrated in many successful applications that depth sensors are powerful tools for 3D digitization. The digitization process usually involves two stages, i.e. , registration and surface reconstruction (Figure 1.2). While each individual range image resides in its own local coordinate system, registration aims to align all range images into a single unified coordinate system. The surface reconstruction process then merges all registered range images into a single 3D representation. In this dissertation, we focus on the registration problem and use standard methods for surface reconstruction. All registration problems are roughly classified into three major types based on their assump- tions, namely rigid registration, non-rigid registration and global registration (Figure 1.3). Rigid registration. Rigid registration methods assume the underlying motion among range images is rigid, i.e. , 6 DoF with 3 for rotation and 3 for translation, and an approximation of the motion is given. Rigid registration is, by its nature, a local optimization method. The 2 Stage 2: Surface reconstruction Stage 1: Registration Figure 1.2: Common two stages of the range image based 3D modeling system. We focus on the registration stage. problem of pairwise registration with sufficient overlap is a solved problem by the Iterative Closest Point (ICP) algorithm and its variants. Non-rigid registration. Non-rigid registration methods relax the assumption of rigid mo- tion and aims to estimate a non-rigid deformation between range images given an initial approximation. All methods are further classified as articulated registration methods and general non-rigid registration methods based on the adopted deformation model. Global registration. Global registration methods relax the assumption of a given initial approximation, and aim to recover a rigid motion between a pair of point clouds from any initialization. Global registration is essentially a global optimization method. 3 In this dissertation, we introduce a novel framework, which explores the visibility information of the underlying range images, and propose robust methods to separately address rigid registra- tion, non-rigid registration as well as global registration. Rigid Registration Non-Rigid Registration Global Registration Figure 1.3: Three main types of registration problems. 1.2 Issues The current registration methods face the following challenges. Insufficient overlap. Almost all registration methods assume point cloud as input and inherently fail in the presence of limited overlap, when the assumption of all point-to- point correspondence is no longer valid. Detecting the overlapping area and registering point clouds are mutually dependent, and the overlapping area can not be reliably localized without knowing the accurate registration. Several attempts have been made towards robust registration in the presence of insufficient overlap. All methods, however, require fine parameters tuning and the registration results are still application dependent which do not consistently work on all cases. 4 Efficiency. When processing many range images for 3D modeling, or performing as a pre- processing step for other applications, the efficiency of the underlying registration method is crucial, especially for global registration when the entire 6 DoF solution space needs to be searched. Instead of directly registering point clouds, we explore the visibility information from the underlying range images for registration. This allows us to naturally handle the problem of po- tentially insufficient overlap in an efficient way. 1.3 Outline The rest of this dissertation is organized as follows: in Chapter 2, first we review the related works for digitizing human performance, then we discuss different range image registration methods, in- cluding rigid registration, non-rigid registration, and global registration. In Chapter 3, we propose to use contour coherence, extracted from the range image, for robust rigid and non-rigid registra- tion methods. We employ the proposed registration methods for scanning a complete, textured 3D human body model, and further rig the captured static model for animation. In Chapter 4, we introduce a novel Visibility Error Metric (VEM) and address the global registration problem as a optimization problem which aims to minimize VEM. We present an end-to-end system for reconstructing complete, watertight and textured models of moving subjects such as clothed hu- mans and animals, using only three or four hand-held depth sensors. In Chapter 5, on a separate by relevant topic, we use a single depth sensor to digitize and analyze motion of patients with musculo-skeletal disorders, e.g. , Parkinson’s Disease. Our system allows for a convenient home monitoring and evaluation system. We end with conclusion and future work in Chapter 6. 5 Chapter 2 Related Work We mainly focus our discussion on human performance digitization systems, different registration methods, as well as home monitoring systems for patients with musculo-skeletal disorders. 2.1 Human Performance Digitization Digitizing realistic, moving characters has traditionally involved an intricate pipeline including modeling, rigging, and animation. This process has been occasionally assisted by 3D motion and geometry capture systems such as marker-based motion capture or markerless capture methods involving large arrays of sensors [29]. Both approaches supply artists with accurate reference geometry and motion, but they require specialized hardware and a controlled studio setting. Real-time 3D scanning and reconstruction systems requiring only a single sensor, like Kinect- Fusion [39], allow casual users to easily scan everyday objects; however, as with most simultane- ous localization and mapping (SLAM) techniques, the major assumption is that the scanned scene is rigid. This assumption is invalid for humans, even for humans attempting to maintain a sin- gle pose; several follow-up works have addressed this limitation by allowing near-rigid motion, 6 and using non-rigid partial scan alignment algorithms [48, 83]. While the recent DynamicFu- sion framework [62] and similar systems [30] show impressive results in capturing non-rigidly deforming scenes, our goal of capturing and tracking freely moving targets is fundamentally dif- ferent: we seek to reconstruct a complete model of the moving target at all times, which requires either extensive prior knowledge of the subject’s geometry, or the use of multiple sensors to pro- vide better coverage. Prior work has proposed various simplifying assumptions to make the problem of capturing entire shapes in motion tractable. Examples include assuming availability of a template, high- quality data, smooth motion, and a controlled capture environment. Template-based Tracking: The vast majority of related work on capturing dynamic motion focuses on specific human parts, such as faces [51] and hands [63, 65], for which specialized shapes and motion templates are available. In the case of tracking the full human body, param- eterized body models [13] have been used. However, such models work best on naked subjects or subjects wearing very tight clothing, and are difficult to adapt to moving people wearing more typical garments. Another category of methods first capture a template in a static pose and then track it across time. Vlasic et al [88] use a rigged template model, and De Aguiar et al [28] apply a skeleton-less shape deformation model to the template to track human performances from multi-view video data. Other methods [44,107] use a smoothed template to track motion from a capture sequence. The more recent work of Wu et al. [97] and Liu et al. [53] track both the surface and the skeleton of a template from stereo cameras and sparse set of depth sensors respectively. 7 All of these template-based approaches handle with ease the problem of tracking moving targets, since the entire geometry of the target is known. However, in addition to requiring con- structing or fitting said template, these methods share the common limitation that they cannot handle geometry or topology changes which are likely to happen during typical human motion (picking up an object; crossing arms; etc). Dynamic Shape Capture: Several works have proposed to reconstruct both shape and mo- tion from a dynamic motion sequence. Given a series of time-varying point clouds, Wand et al. [91] use a uniform deformation model to capture both geometry and motion. A follow-up work [90] proposes to separate the deformation models used for geometry and motion capture. Both methods make the strong assumption that the motion is smooth, and thus suffer from pop- ping artifacts in the case of large motions between time steps. S¨ ußmuth et al. [81] fit a 4D space-time surface to the given sequence but they assume that the complete shape is visible in the first frame. Finally, Tevs et al. [82] detect landmark correspondences which are then extended to dense correspondences. While this method can handle a considerable amount of topological change, it is sensitive to large acquisition holes, which are typical for commercial depth sensors. Another category of related work aims to reconstruct a deforming watertight mesh from a dynamic capture sequence by imposing either visual hull [89] or temporal coherency con- straints [45]. Such constraints either limit the capture volume or are not sufficient to handle large holes. Furthermore, neither of these methods focus on propagating texture to invisible areas; in contrast, we use dense correspondences to perform texture inpainting in non-visible regions. Bojsen-Hansen et al. [14] also use dense correspondences to track surfaces with evolving topolo- gies. However, their method requires the input to be a closed manifold surface. Our goal, on the other hand, is to reconstruct such complete meshes from sparse partial scans. 8 The recent work of Collet et al. [27] uses multimodal input data from a stage setup to capture topologically-varying scenes. While this method produces impressive results, it requires a pre- calibrated complex setup. In contrast, we use a significantly cheaper and more convenient setup composed of three to four commercial depth sensors. 2.2 Range Image Registration We roughly classify range image registration into three categories, namely local methods when a good initialization is provided, global methods when no such initialization exists, as well as non- rigid methods when rigidity no longer suffice to explain for the transformation. We also briefly discuss previous attempts on using visibility information for registration. Local Methods. In the presence of good initialization and sufficient overlap, the probelm is well addressed by the Iterative Closest Point (ICP) algorithm [10, 18, 103] and its variants [67]. Several attempts have been made to relax the requirement on sufficent overalp. Trucco et al. [86] propose to find robust correspondences by enhancing ICP with the Random Sample Consensus (RANSAC) algorithm [32] and a robust Least Median of Squares (LMedS) metric. Zinßer et al. [106] build robust correspondences by analyzing statistics of distances between correspondences. Chetverikov et al. [20] introduce the Trimmed ICP algorithm which runs the basic ICP algorithm repetitively, while each time assuming a different overlap ratio between two range images. Li et al. [46] handle partial overlap by introducing and jointly optimizing a latent weight variable assigned for each correspondence. More recently, Wang et al. [94] present the notion of contour coherence and register two range images by iteratively minimizing the distance between correspondences on observed and predicted contours. 9 Global Methods. A common approach for global registration is to construct feature descrip- tors for a set of interest points which are then correlated to estimate a rigid transformation. Spin- images [41], integral volume descriptors [35], and point feature histograms (PFH, FPFH) [68,69] are among the popular descriptors proposed by prior work. Makadia et al. [54] represent each range image as a translation-invariant emphextended gaussian Image (EGI) [38] using surface normals. They first compute the optimum rotation by correlating two EGIs and further estimate the corresponding translation using Fourier transform. For noisy data as coming from a com- mercial depth sensor, however, it is challenging to compute reliable feature descriptors. Another approach for global registration is to align either main axes extracted by principal component analysis (PCA) [24] or a sparse set of control points in a RANSAC loop [16]. Silva et al. [77] introduce a robust surface interpenetration measure (SIM) and search the 6 DoF parameter space with a genetic algorithm. More recently, Yang et al. [100] adopt a branch-and-bound strategy to extend the basic ICP algorithm in a global manner. 4PCS [6] and its latest variant Super-4PCS [56] register a pair of range images by extracting all coplanar 4-points sets. Such approaches, however, are likely to converge to wrong alignments in cases of very little overlap between the range images. Non-rigid Registration. Allen et al. [7] extend rigid registration to a template-based ar- ticulated registration algorithm for aligning several body scans. Their method utilizes a set of manually selected markers. Pekelney and Gotsman [64] achieve articulated registration by per- forming ICP on each segment and their method requires a manual segmentation of the first frame. Chang and Zwicker [15] further remove the assumption of known segmentation by solving a dis- crete labeling problem to detect the set of optimal correspondences and apply graph cuts for optimization. Li et al. [47] develop a registration framework that simultaneously solves for point 10 correspondences, surface deformation, and region of overlap within a single global optimization. More recently, several methods based on the Embedded Deformation Model [80] have been pro- posed for modeling of non-rigid objects, either by initializing a complete deformation model with the first frame [84] or incrementally updating it [102]. Visibility for Registration. Several prior works have adopted silhouette-based constraints for aligning multiple images [5,33,37,55,57,79,94,98]. While the idea is similar to our approach, our registration algorithm also takes advantage of depth information, and employs a particle-swarm optimization strategy that efficiently explores the space of alignments. 2.3 Home Monitoring System Many works have been proposed to monitor and evaluate patients with musculo-skeletal disor- ders. The most prevailing methodology suggests using one or a combination of intrusive sensors, such as accelerometer, gyroscope, magnetometer, and pedometer. The clinically meaningful indi- cators are further extracted by analyzing the pattern of time series data generated by these sensors. Zijlstra and Hof [105] attach a triaxial accelerometer to the pelvis of the subject with Parkinson’s Disease as he/she walks. They differentiate left and right steps and extract measures including the duration of subsequent stride cycles, step size as well as walking speed. Weiss et al. [96] use the same setup and analyze data in the frequency domain instead of the time domain which is robust to noise. They find the dominant frequency, amplitude and slope of PD people is lower than non-PD people indicating a larger gait variability. Others [34, 87] use a wrist-worn activity monitor to examine the mobility patterns in PD patients. Salarian et al. [71,72] examined activity patterns in PD patients (e.g. tremor and bradykinesia) by placing 2 gyroscopes on the forearms 11 and 3 inertial sensors on the shanks and trunk. These methods are cumbersome by asking the subject to put on and put off sensors. A few vision-based methods have also been proposed. Cho et al. [22] differentiate the PD people and the non-PD people by analyzing the corresponding gait videos. They extract the subject’s silhouette in scene and further reduce the dimensionality using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). However their method requires spe- cific setup (e.g. pure color background for silhouette extraction) and is not feasible for the home setting. Also their method stays on the stage of differentiation instead of quantification. 12 Chapter 3 Contour Coherence for Rigid and Non-Rigid Registration 3.1 Introduction Registering 2 or more range scans is a fundamental problem with application to 3D modeling. It is well addressed in the presence of sufficient overlap and good initialization [7,15,17,47]. However, registering two wide baseline range scans presents a challenging task where two range scans barely overlap and the shape coherence no longer prevails. An example of two wide baseline range scans of the Stanford bunny with approximately 40% overlap is given in Figure 3.1(a). The traditional shape coherence based methods may fail as most closest-distance correspondences are incorrect. In computer vision dealing with intensity images, a large body of work have been devoted to study the apparent contour, or simply contour. An apparent contour is the projection of a contour generator, which is defined as the set of points on the surface where the tangent plane contains the line of sight from the camera. This contour has been shown to be a rich source of geometric information for motion estimation and 3D reconstruction [21, 25, 26, 37]. 13 1 2 1 2 (a) 1 2 1 2 (b) Figure 3.1: (a) Two roughly aligned wide baseline 2.5D range scans of the Stanford Bunny with the observed and predicted apparent contours extracted. The two meshed points cloud are gen- erated from the two 2.5D range scans respectively (b) Registration result after maximizing the contour coherence Inspired by their works, we propose the concept of contour coherence for wide baseline range scan registration. Contour coherence is defined as the agreement between the observed apparent contour and the predicted apparent contour. As shown in Figure 3.1(a), the observed contours extracted from the original 2.5D range scans, i.e. red lines in image 1 and blue lines in image 2, do not match the corresponding contours extracted from the projected 2.5D range scans, i.e. blue lines in image 1 and red lines in image 2. We maximize contour coherence by iteratively building robust correspondences among apparent contours and minimizing their distances. The registration result is shown in Figure 3.1(b) with the contour coherence maximized and two wide baseline range scans well aligned. The contour coherence is robust in the presence of wide 14 baseline in the sense that only the shape area close to the predicted contour generator is considered when building correspondences on the contour, thus avoiding the search for correspondences over the entire shape. The recently released low-cost structured light 3D sensors (e.g. Kinect, Primesense) enable handy complete scanning of objects in the home environment. While the state-of-the-art scanning methods [61, 84] generate excellent 3D reconstruction results of rigid or articulated objects, they assume small motion between consecutive views and as a result either the user has to move the 3D sensor carefully around the object or the subject must turn slowly in a controlled manner. Even with best effort to reduce the drifting error, sometimes the gap is still visible when closing the loop (Section 3.6.2). In our work, we explicitly employ contour coherence under a multi-view framework and de- velop the Multi-view Iterative Closest Contour (M-ICC) algorithm which rigidly aligns all range scans at the same time. We further extend M-ICC to handle small articulations and propose the Multi-view Articulated Iterative Closest Contour (MA-ICC) algorithm. Using our proposed registration methods, we successfully address the loop-closure problem from as few as 4 wide baseline views with limited overlap, and reconstruct accurate and complete rigid as well as artic- ulated objects, hence greatly reducing the data acquisition time and the registration complexity, while providing accurate results. To the best of our knowledge, we are first to introduce contour coherence for multi-view wide baseline range scan registration. Our main contributions come from the following perspectives: 1) The concept of contour coherence for robust wide baseline range scan registration; 2) Contour coherence based multi-view rigid registration algorithm M-ICC which allows 3D modeling from 15 as few as 4 views; 3) Extension to multi-view articulated registration algorithm MA-ICC and its application to 3D body modeling. Section 3.2 describes how to extract robust correspondences from the observed and predicted contours. Section 3.3 employs the contour coherence in a multi-view rigid registration frame- work, which is further extended to consider articulation in Section 3.4. Section 3.5 briefly covers implementation. Section 3.6 presents the experimental evaluation results while section 3.8 ends with a conclusion. 3.2 Contour Coherence We perform wide baseline range scan registration by maximizing contour coherence, i.e. the agreement between the observed and predicted apparent contours. From an implementation point of view, M-ICC and MA-ICC alternate between finding closest contour correspondences and minimizing their distances. However an intuitive closest matching scheme on all contour points fails, mainly due to the presence of self-occlusion, 2D ambiguity and outliers as described later. Hence we propose the Robust Closest Contour (RCC) algorithm for establishing robust contour correspondences on a pair of range scans (Figure 3.2). Preliminaries. A 2.5D range scanR i of framei provides depth valueR i (u) at each image pixelu = (u;v) T 2R 2 . We use a single constant camera calibration matrixK which transforms points from the camera frame to the image plane. We representV i (u) =K 1 R i (u)~ u as the back- projection operator which mapsu in framei to its 3D location, where~ u denotes the homogeneous vector ~ u = [u T j1] T . Inversely, we denote the projection operator asP(V i (u)) = g(KV i (u)) whereg represents dehomogenisation. 16 A meshed points cloudP i is generated for each framei considering the connectivity on the 2.5D range scanR i . We calculate the normalized 3D normal at each pixelN i (u)2R 3 following [61].N i (u) is further projected back to the image to obtain normalized 2D normaln i (u) of each image pixel. ProjectingP j to theith image, given current camera poses, leads us to a projected range scanR j!i . The inputs to our RCC method are observed and predicted range scans, namely R i andR j!i , and the output is the robust contour correspondencesM i;j!i (Equation 3.5 and Equation 3.9). Extracting contour points Robust Closest Contour (RCC) i R Pruning contour points Bijective closest matching in 3D , i j i M ji R Figure 3.2: General pipeline of our Robust Closest Contour (RCC) method Extracting contour points. Given pixels belonging to the object in frame i asU i , we set R i (u) =1 foru = 2U i . The contour pointsC i are extracted considering the depth discontinuity of a pixel and its 8-neighboring pixels, C i =fu2U i j9v2N 8 u ; R i (v)R i (u)>g; (3.1) where is the threshold to detect depth discontinuity. We also extract a set of occlusion points, O i =fu2U i j9v2N 8 u ; R i (u)R i (v)>g; (3.2) which are boundary points of surface holes created by self-occlusion. An example ofC i andC j!i , extracted fromR i andR j!i respectively, is demonstrated in Figure 3.3(b). 17 Pruning contour points. BothC i andC j!i must be pruned before the matching stage to avoid possible incorrect correspondences. First due to the self-occlusion of framej,C j!i contains false contour points which are actually generated by the meshes inP j connected withC j andO j . We mark and remove them to generate the pruned contour pointsC p j!i . Second again due to the self- occlusion of framej, some contour points inC i should not be matched with any contour point inC p j!i , e.g. the contour points in frame 2 belonging to the back part of the Armadillo are not visible in view 1 (Figure 3.3(b)). Hence we pruneC i based on the visibility of the corresponding contour generator in viewj, C p i=j =fu2C i jN i (u) T (o j!i V i (u))> 0g; (3.3) whereo j!i is the camera location of framej in camerai. An example of pruned contour points is shown in Figure 3.3(c). Bijective closest matching in 3D. After pruning, a one-way closest matching algorithm be- tweenC p i=j andC p j!i still fails, as contour points are sensitive to minor changes in viewing direc- tions, e.g. camera 1 observes only one leg while the contour points of two legs are extracted from the projected range scan (Figure 3.3(c)). Hence we follow a bijective matching scheme [102] when establishing robust correspondences (Equation 3.5 and Equation 3.9). Matching directly in the 2D image space leads to many wrong corresponding pairs. An example is shown in Figure 3.3(e), where the contour points of the right leg in frame 1 are wrongly matched with the contour points of the left leg in frame 2. The ambiguity imposed by the 2D nature is resolved by relaxing the search to the 3D space Figure 3.3(d), as we have the 3D point locationV i (u) for each contour point. It is worth mentioning that while we build 18 (a) (b) (c) (d) (e) Figure 3.3: (a) Front range scan (red) and side range scan (blue) of the Stanford Armadillo. (b) C 1!2 (red) andC 2 (blue). (c)C p 1!2 (red) andC p 2=1 (blue). (d) Bijective correspondences (black line) found in 3D. (e) Bijective correspondences (black line) found in 2D with red rectangle indicating mismatches correspondences in 3D, we are minimizing the distances between contour correspondences in 2D, as the real data given by most structured-light 3D sensors is extremely noisy along the rays of apparent contour. 3.3 Multi-view Rigid Registration The general pipeline of our Multi-view Iterative Closest Contour (M-ICC) method is shown in Figure 3.4(a). Given N roughly initialized range scans (Figure 3.4(b)), we alternate between updating the view graph, establishing robust contour correspondences from pairs of range scans 19 in the view graph and minimizing distances of all correspondences. While the standard pairwise ICP algorithm fails in the presence of wide baseline (Figure 3.4(d)), our M-ICC method jointly recovers accurate camera poses (Figure 3.4(e)). Update view graph Multi-view Iterative Closest Contour (M-ICC) initial RCC Minimization 12 , ,..., N R R R result Converge? (a) TOP (b) TOP (c) TOP (d) TOP (e) Figure 3.4: (a) Pipeline of our Multi-view Iterative Closest Contour (M-ICC) method.. (b) 4 range scans of a Newton head statue taken approximately at 90 o apart, with limited overlap and poor initialization. (c) Initial right range scan and back range scan barely overlap and have a large rotation error. (d) Result of pairwise registration using standard ICP algorithm. (e) Result of our M-ICC method. Preliminaries. Frame i is associated with a 6 DOF rigid transformation matrix w T i = 2 6 6 4 R i t i 0 T 1 3 7 7 5 whereR i is parameterized by a 3 DOF quaternion, namelyq i = [q w i ;q x i ;q y i ;q z i ] with kqk 2 = 1, andt i is the translation vector. Operator w i (u) = w T i ~ V i (u) transforms pixelu to its corresponding homogeneous back-projected 3D point in the world coordinate system, where ~ V i (u) is the homogeneous back-projected 3D point in the camera coordinate system of framei. 20 Inversely we have operator i w such thatu = i w ( w i (u)) =P(g( i T w w i (u))). GivenN frames, we have a total 6N parameters stored in a vector. Unlike [84, 85] where pairwise registration is performed before a final global error diffusion step, we do not require pairwise registration and explicitly employ contour coherence under a multi-view framework. We achieve that by associating two camera poses with a single contour correspondence. Assuming u and v is a corresponding pair belonging to frame i and frame j respectively, then their distance is modeled askv j w ( w i (u))k 2 . Minimizing this distance updates both camera poses at the same time, which allows us to globally align all frames together. It is worth mentioning that pairwise registration is a special case of the multi-view scenario in a way that the pairwise registration 2 T 1 is achieved as 2 T 1 = 2 T w w T 1 . View graph. View graphL is a set of pairing relationship among all frames. (i;j)2L indicates that framej is viewable in framei and hence robust contour correspondences should be established betweenR i andR j!i . Each frame’s viewing direction in the world coordinate is R i (0; 0; 1) T and frame j is viewable in frame i only if their viewing directions are within a certain angle, i.e. L =f(i;j)j acos((0; 0; 1)R i T R j (0; 0; 1) T )<g: (3.4) It is worth mentioning that (i;j)6= (j;i) and we establish two pairs of correspondences between framei and framej, namely betweenC p j!i andC p i=j , and betweenC p i!j andC p j=i . Another issue worth raising is that the loop closure is automatically detected and achieved if allN views form a loop. An example is shown in Figure 3.4(c), whereL is calculated asf(1; 2); (2; 1); (2; 3); (3; 2); (1; 4); (4; 1)g from initial , i.e. the gap between frame 3 and frame 4 is large and the loop is not closed from 21 the beginning. As we iterate and update the camera poses, linkf(3; 4); (4; 3)g is added toL and we automatically close the loop. Robust Closest Contour and Minimization. For each viewable pair (i;j)2L, we extract robust contour correspondencesM i;j!i betweenC p j!i andC p i=j using RCC algorithm as M i;j!i =f(u; j w ( w i (v)))jv = arg min m2C p j!i d(V i (u);V j!i (m)); u = arg min n2C p i=j d(V j!i (v);V i (n))g (3.5) whered(x;y) =kxyk 2 is the Euclidean distance operator. Pixelv is the closest (i.e. dis- tance in the back-projected 3D space) point on the pruned predicted contour to pixel u on the pruned observed contour, while at the same time pixel u is also the closest to pixel v, i.e. the bijectivity in 3D is imposed. We minimize the sum of point-to-plane [17] distances of all contour correspondences as E R = X (i;j)2L X (u;v)2M i;j!i j(u i w ( w j (v))) T n i (u)j: (3.6) In practice, we find that the point-to-plane error metric allows two contours sliding along each other and reaching better local optimum than the point-to-point error metric. 3.4 Multi-view Articulated Registration To handle small articulations, we further extend M-ICC to Multi-view Articulated Iterative Clos- est Contour (MA-ICC) algorithm (Figure 3.5(a)). Given N range scans, articulation structure 22 as well as known segmentationW 1 of all rigid parts in the first frame (Figure 3.5(b)), we first regard all range scans as rigid and apply the M-ICC method, after which all range scans are roughly aligned (Figure 3.5(c)). We then iteratively segment other frames, update the view graph, establish robust contour correspondences and minimize until convergence (Figure 3.5(d)). initial 12 , ,..., N R R R 1 W M-ICC Segmentation of other frames RCC Minimization Converge? 12 , ,..., N W W W result Multi-view Articulated Iterative Closest Contour (MA-ICC) Update view graph (a) (b) (c) (d) Figure 3.5: (a) Pipeline of our Multi-view Articulated Iterative Closest Contour (MA-ICC) method. (b) Segmentation of human body into rigid parts. (c) Registration result after our M-ICC method. (d) Registration result after our MA-ICC method. Preliminaries. We employ a standard hierarchical structure, where each rigid segment k of frame i has an attached local coordinate system related to the world coordinate system via transform w T i k . This transformation is defined hierarchically by recurrence w T i k = w T i kp kp T i k where k p is the parent node of k. For the root node, we have w T i root = w T i where w T i can be regarded as camera pose of frame i. kp T i k has a parameterized rotation component and a 23 translation component completely dependent on the rotation component. As such, for a total of N range scans where the complete articulated structure contains M rigid segments, there is a total number ofN (M 3 + 3) parameters stored in the vector. We employ the Linear Blend Skinning (LBS) scheme where each pixelu in framei is given a weight vectorW i (u)2R M with P j=1:::M W i (u) j = 1, indicating its support from all rigid seg- ments. As such, operator w i in the rigid case is rewritten as w A i (u) = P j=1:::M w T i j ~ V i (u)W i (u) j in the articulated case, which is a weighted transformation of all rigid segments attached to u. Similarly we have operator i A w as the inverse process such thatu = i A w ( w A i (u)). Segmentation of other frames. Given the segmentationW 1 of the first frame and predicted pose, we segment pixelu2U i of framei as W i (u) =W 1 (arg min v2U 1 d(v; 1 A w ( w A i (u)))); (3.7) i.e. the same weight as the closest pixel in the first frame. To simplify the following discussion, we defineF(S;k) =fu2SjW S (u) k = 1g which indicates the subset ofS with pixels exclusively belonging to thek-th rigid part. View graph. In the presence of articulation, we only build contour correspondences on the corresponding rigid body parts, as such (i;j;k)2L A indicates that rigid segmentk of framej is viewable in framei and we should build robust contour correspondences amongF(C p i=j ;k) and F(C p j!i ;k). Besides considering the viewing direction of cameras, we consider self-occlusion 24 and build contour correspondences only when there are enough contour points (i.e. more than ) belonging to the rigid segmentk in both views, L A =f(i;j;k)j acos((0; 0; 1)R i T R j (0; 0; 1) T )<; #(F(C p i=j ;k))> ; #(F(C p j!i ;k))> jg (3.8) Robust closest contour and minimization. For each viewable pair (i;j;k)2L A , the set of bijective contour correspondencesM A i;j!i;k betweenF(C p i=j ;k) andF(C p j!i ;k) are extracted by RCC as M A i;j!i;k =f(u; j A w ( w A i (v)))jv = arg min m2F(C p j!i ;k) d(V i (u);V j!i (m)); u = arg min n2F(C p i=j ;k) d(V i (n);V j!i (v))g: (3.9) We minimize the sum of point-to-plane distances between all contour correspondences E A = X (i;j;k)2L A X (u;v)2M A i;j!i;k j(u i A w ( w A j (v))) T n i (u)j + T ; (3.10) where we use T as the regularization term favoring the small articulation assumption. 3.5 Implementation We use a standard stopping condition for our iterative process: (1) the maximum iteration number has been achieved, or (2) the distance per contour correspondence is relatively small, or (3) the de- crease in distance per contour correspondence is relatively small. For each iteration Equation 3.6 25 and Equation 3.10 are non-linear in parameters, as such we employ the Levenberg-Marquardt algorithm [59] as our solver. The Jacobian matrix for the Levenberg-Marquardt algorithm is calculated by the chain rule. In all our experiments, we set the depth discontinuity threshold = 50mm (Equation 3.1, 3.2). The viewable angle threshold is = 120 o (Equation 3.4, 3.8) while the rigid segment minimum number of points threshold = 500 (Equation 3.8). The weight for regularizer is set as = 100 (Equation 3.10). For scanning real rigid and articulated objects, we use the Kinect sesnor. It is worth mentioning that for a specific range scanning device the parameters work for a large range and do not require specific tuning. In practice, our M-ICC and MA-ICC methods converge within 10 iterations. Specifically for pair-wise rigid registration, within each iteration, we perform two projections, extract two sets of robust contour correspondences, and minimize the cost function with the Levenberg-Marquardt algorithm. Since the projection is easily parallelized and the closest-point matching is searching over a limited number of contour points in 2D space, our algorithm can easily run in real time on a GPU. 3.6 Experiments We evaluate our contour coherence based multi-view registration algorithms with two sets of experiments. First we evaluate our method on pair-wise registration between synthetic wide baseline range scans. Then we show reconstruction results of rigid and articulated objects using a low-cost Kinect device, and compare with other state-of-the-art scanning algorithms. 26 3.6.1 Synthetic Data (a) (b) (c) Figure 3.6: (a) Initialization of two synthetic Stanford bunny range scans of 45% overlap and a 54 o offset in main rotation angle (b) Registration results of triICP algorithm (c) Registration result of our method We compare our contour coherence method with the Trimmed-ICP (trICP) algorithm [19] which is the most robust variant of ICP in the presence of wide baseline. The basic idea of trICP is that since we don’t know the exact percentage of overlap, we can manually select a set of overlap percentages in [0; 1] and run the basic ICP algorithm at each overlap level by forcing the ICP to only use the predefined percentage of closest correpondences. A robust measure of overlap () =e() (1+) is introduced for the selection of optimal overlap percentage where is the amount of overlap,e() is the final RMSE at the predefined overlap level and is a preset parameter. For our experiment, we use the standard point-to-plane ICP , test a range of overlap percentages from 10% to 100% increased by 10% and set = 2 as suggested in the original paper. Table 3.1: Errors of recovered main rotation angle at different offset (in degrees) b 12 24 30 36 42 48 54 60 trICP < 1 < 1 < 1 44.1 44.2 44 44.2 44.1 Ours < 1 < 1 < 1 < 1 < 1 < 1 < 1 34 27 We generate pairs of synthetic wide baseline range scans of different 3D objects by moving a virtual camera around them. When aligning two synthetic range scans, an increasing offset offset is added to the main rotation angle while small random perturbations (0 10 o ) are added to its other rotation angles. This setup simulates the 3D reconstruction scenario where we are trying to align two wide baseline range scans yet only have a rough approximation of the main rotation angle between them. We initialize the two range scans by aligning their centers together (Figure 3.6(a)). We register the two range scans, estimate the preset offset in the main rotation angle and compare with the ground truth to obtain the error in estimation. Table 3.1 summarizes the alignment result of two Stanford Bunny range scans with a 45% overlap on the red range scan. Small error indicates successful registration and we denote as error< 1. The triICP algorithm fails to align these two range scans beyond a 30 o main rotation offset and stops at a local minimum (Figure 3.6(b)), while our method successfully recovers up to 54 o as shown in Figure 3.6(c). Experiments on wide baseline range scans of other 3D objects, including the Stanford Armadillo and the Stanford Dragon, produce similar results and are not listed here. 3.6.2 Real Data When scanning rigid objects, we capture four depth images of rigid objects at approximately 90 o apart using a single Kinect. The background depth pixels are removed from the range scans by simply thresholding depth, detecting and removing planar pixels using RANSAC [32], assuming the object is the only thing on the ground or table. The four wide baseline range scans are initial- ized by aligning the centers together and assuming a pairwise 90 o rotation angle (Figure 3.4(b)). We register four frames using our proposed M-ICC method and use the Poisson Surface Re- construction (PSR) algorithm [42] to generate the final water-tight model (Figure 3.8(a)). We 28 compare our method with the KinectFusion algorithm [61] (Figure 3.8). In practice, KinectFu- sion easily loses track of the camera even when we carefully hold the camera and move slowly. As a result, we put the object on a turntable, rotate it for approximately 2 minutes and scan it with a fixed Kinect. As KinectFusion incrementally updates the model, the drifting error accumulates and creates artifacts when closing the loop (Figure 3.8(b) 3.8(d)), while our method successfully reconstruct smooth 3D objects (Figure 3.8(a) 3.8(c)) from only four views. We further compare our recon- structed model with a laser scanned ground truth model and achieve a median error of 2:12mm (Figure 3.8(e)), while the KinectFusion achieves a median error of 5:17mm (Figure 3.8(f)). The heatmaps clearly show that while our model accurately captures the global shape, KinectFusion suffers from an accumulated drifting error. 3.7 Application: Scanning a 3D Body Model for Animation Among all articulated objects, human body is of most interest in 3D modeling [49, 84, 93, 102] for its potential applications in 3D printing, animation and apparel design. The subject is scanned by turning in front of a fixed Kinect sensor while showing 4 key poses, i.e. front, left, back and right in order. Overview. Using the proposed contour coherence based registration methods, we introduce a system to scan a complete, fully textured 3D model of subjects using only one fixed commodity depth sensor, and without requiring any additional operator. Upon requiring a static model, we further rig it and apply animation. The overview of our system is summarized as follow: 29 1. Data Capture: The subject is scanned by turning in front of a fixed Kinect sensor while remaining static at 4 key poses, i.e. front, left, back and right in order. 2. Global Registration: As key module of our system, we first apply multi-view rigid reg- istration on the obtained four key frames, then use multi-view articulated registration to reduce error caused by articulation during turning. 3. Textured Surface Reconstruction: We first use Poisson Surface Reconstruction [42] to reconstruct a complete water-tight 3D model from 4 registered range scans. Then we use Texture- Stitcher [23] to apply texture on visible areas as well as to interpolate texture on non-visible areas. 4. Rigging and Animation: First we use [9] to rig the captured static model, then we apply different animation. Please refer to [73] for more details. Depth Super-resolution. When scanning human body, due to Kinect’s field of view limit, the subject must stand at approximately 2 meters away which leads to a large degradation in the input data quality (Figure 3.7). As such, inspired by [50], we ask the subject to come closer and stay rigid for 5 to 10 seconds at each key pose while the Kinect, controlled by a built-in motor, swipes up and down to scan the subject using KinectFusion. The reconstructed partial 3D scene is further projected back to generate a super-resolution range scan. We remove background objects by thresholding depth value and remove remaining dominant planar segments in a RANSAC fashion. Results. Figure 3.9 shows a few 3D reconstruction results of our system. Figure demonstrates some animation results. Using our system, it takes around 1 minute for data capture, 1 minute to reconstruct 3D model, and 3 minutes for rigging and animation, measured with a single thread Intel Core i7-4710MQ CPU clocked at 2.5 GHz. 30 Figure 3.7: Depth image quality of Kinect at different distances with red rectangle indicating the common working range to scan a human body without using our proposed super-resolution method. As opposed to most existing 3D body scanning methods, our methods neither requires setting up a booth full of sensors around the subject of interest, nor asks another operator to hold the scanner and walk around while the subject remains perfectly still. Our method easily turns a single commodity depth sensor into a powerful scanning station, which allows daily users to apply their body scans into different applications, including virtual try-on, gaming, and other social applications for Virtual Reality. 3.8 Conclusion We propose the concept of contour coherence for solving the problem of wide baseline range scan registration. Our M-ICC and MA-ICC methods allow complete reconstruction of rigid and articulated objects from as few as 4 frames. In the future, we plan to apply contour coherence in other interesting fields, e.g. object recognition and pose estimation from range scans. 31 (a) (b) (c) (d) (e) (f) Figure 3.8: (a) Our modeling result of the Newton head statue from 4 range scans. (b) Modeling result of the KinectFusion algorithm from approximately 1200 range scans. (c) Our modeling result of the mannequin torso from 4 range scans. (d) Modeling result of the KinectFusion algo- rithm from approximately 1200 range scans. (e) Heatmap of our modeling result compared with a ground truth laser scan with error range from 0mm (blue) to 10mm (red). (f) Heatmap of the KinectFusion modeling result compared with a ground truth laser scan with the same range. 32 Figure 3.9: Example 3D human body models. Figure 3.10: Jumping animation sequence applied on a rigged body model. 33 Chapter 4 Visibility Error Metric for Global Registration 4.1 Introduction The rekindling of interest in immersive, 360-degree virtual environments, spurred on by the Ocu- lus, Hololens, and other breakthroughs in consumer AR and VR hardware, has birthed a need for digitizing objects with full geometry and texture from all views. One of the most important objects to digitize in this way are moving, clothed humans, yet they are also among the most challenging: the human body can undergo large deformations over short time spans, has complex geometry with occluded regions that can only be seen from a small number of angles, and has regions like the face with important high-frequency features that must be faithfully preserved. Most techniques for capturing high-quality digital humans rely on a large array of sensors mounted around a fixed capture volume. The recent work of Collet et al. [27] uses such a setup to capture live performances and compresses them to enable streaming of free-viewpoint videos. Unfortunately, these techniques are severely restrictive: first, to ensure high-quality reconstruc- tion and sufficient coverage, a large number of expensive sensors must be used, leaving human 34 capture out of reach of consumers without the resources of a professional studio. Second, the sub- ject must remain within the small working volume enclosed by the sensors, ruling out subjects interacting with large, open environments or undergoing large motions. Using free-viewpoint sensors is an attractive alternative, since it does not constrain the capture volume and allows ordinary consumers, with access to only portable, low-cost devices, to capture human motion. The typical challenge with using hand-held active sensors is that, obviously, multiple sensors must be used simultaneously from different angles to achieve adequate coverage of the subject. In overlapping regions, signal interference causes significant deterioration in the quality of the captured geometry. This problem can be avoided by minimizing the amount of overlap between sensors, but on the other hand, existing registration algorithms for aligning the captured partial scans only work reliably if the partial scans significantly overlap. Template-based methods like the work of Ye et al [101] circumvent these difficulties by warping a full geometric template to track the moving sparse partial scans, but templates are only readily available for naked humans [8]; for clothed humans a template must be precomputed on a case-by-case basis. We thus introduce a new shape registration method that can reliably register partial scans even with almost no overlap, sidestepping the need for shape templates or sensor arrays. This method is based on a visibility error metric which encodes the intuition that if a set of partial scans are properly registered, each partial scan, when viewed from the same angle at which it was captured, should occlude all other partial scans. We solve the global registration problem by minimizing this error metric using a particle swarm strategy, to ensure sufficient coverage of the solution space to avoid local minima. This registration method significantly outperforms state of the art global registration techniques like 4PCS [6] for challenging cases of small overlap. 35 Contributions. We present the first end-to-end free-viewpoint reconstruction framework that produces watertight, fully-textured surfaces of moving, clothed humans using only three to four handheld depth sensors, without the need of shape templates or extensive calibration. The most significant technical component of this system is a robust pairwise global registration algorithm, based on minimizing a visibility error metric, that can align depth maps even in the presence of very little (15%) overlap. 4.2 System Overview Our pipeline for reconstructing fully-textured, watertight meshes from three to four depth sensors can be decomposed into four major steps. See Figure 4.1 for an overview. 1. Data Capture: We capture the subject (who is free to move arbitrarily) using uncali- brated hand-held real-time RGBD sensors. We experimented with both Kinect One time-of-flight cameras mounted on laptops, and Occipital Structure IO sensors mounted on iPad Air 2 tablets (section 4.5). 2. Global Rigid Registration: The relative positions of the depth sensors constantly change over time, and the captured depth maps often have little overlap (10%-30%). For each frame, we globally register sparse depth images from all views (section 4.3). This step produces registered, but incomplete, textured partial scans of the subject. Figure 4.1: An overview of our textured dynamic surface capturing system. 36 3. Surface Reconstruction: To reduce flickering artifacts, we adopt the shape completion pipeline of Li et al [45] to warp partial scans from temporally-proximate frames to the current frame geometry. A weighted Poisson reconstruction step then extracts a single watertight surface. There is no guarantee, however, that the resulted fused surface has complete texture coverage (and indeed typically texture will be missing at partial scan seams and in occluded regions.) 4. Dense Correspondences for Texture Reconstruction: We complete regions of missing or unreliable texture on one frame by propagating data from other (perhaps very temporally- distant) frames with reliable texture in that region. We adopt a recently-proposed correspondence computation framework [95] based on a deep neural network to build dense correspondences between any two frames, even if the subject has undergone large relative deformations. Upon building dense correspondences, we transfer texture from reliable regions to less reliable ones. We next mainly describe the details of the global registration method. Please refer to the supplementary material for more details of the other components. 4.3 Robust Rigid Registration The key technical challenge in our pipeline is registering a set of depth images accurately without assuming any initialization, even when the geometry visible in each depth image has very little overlap with any other depth image. We attack this problem by developing a robust pairwise global registration method: let P 1 and P 2 be partial meshes generated from two depth images captured simultaneously. We seek a global Euclidean transformationT 12 which alignsP 2 toP 1 . Traditional pairwise registration based on finding corresponding points onP 1 andP 2 , and min- imizing the distance between them, has notorious difficulty in this setting. As such we propose 37 a novel visibility error metric (VEM) (Section 4.3.1), and we minimize the VEM to find T 12 (Section 4.3.2). We further extend this pairwise method to handle multi-view global registration (Section 4.3.3). 4.3.1 Visibility Error Metric Figure 4.2: Left: two partial scans P 1 (dotted) and P 2 (solid) of a 2D human. Middle: when viewed from P 1 ’s camera, points of P 2 are classified intoO (blue),F (yellow), andB (red). Right: when viewed fromP 2 ’s camera, points ofP 1 are classified intoO (blue),F (yellow), and B (red). SupposeP 1 andP 2 are correctly aligned, and consider looking at the pair of scans through a camera whose position and orientation matches that of the sensor used to captureP 1 . The only parts ofP 2 that should be visible from this view are those that overlap withP 1 : parts ofP 2 that do not overlap should be completely occluded byP 1 (otherwise they would have been detected and included inP 1 ). Similarly, when looking at the scene through the camera that capturedP 2 , only parts ofP 1 that overlap withP 2 should be visible. Visibility-Based Alignment Error We now formalize the above idea. LetP 1 ;P 2 be two partial scans, with P 1 captured using a sensor at position c p and view direction c v . For every point x2 P 2 , letI(x) be the first intersection point ofP 1 and the ray ! c p x. We can partitionP 2 into 38 three regions, and associate to each region an energy density d(x;P 1 ) measuring the extent to which pointsx in that region violate the above visibility criteria: pointsx2O that are occluded byP 1 :kxc p kkI(x)c p k: To points in this region we associate no energy: d O (x;P 1 ) = 0: pointsx2F that are in front ofP 1 :kxc p k <kI(x)c p k: Such points might exist even whenP 1 andP 2 are well-aligned, due to surface noise and roughness, etc. However, we penalize large violations using: d F (x;P 1 ) =kxI(x)k 2 : pointsx2B for whichI(x) does not exist. Such points also violate the visibility criteria. It is tempting to penalize such points proportionally to the distance betweenx and its closest point onP 1 , but a small misalignment could create a point inB that is very distant fromP 1 in Euclidean space, despite being very close toP 1 on the camera image plane. We therefore penalizex using squared distance on the image plane, d B (x;P 1 ) = min y2S 1 kP cv xP cv yk 2 ; whereP cv is the projectionIc v c T v onto the plane orthogonal toc v . Figure 4.2 illustrates these regions on a didactic 2D example. Alignment ofP 1 andP 2 from the point of view ofP 1 is then measured by the aggregate energyd(P 2 ;P 1 ) = P x2P 2 d(x;P 1 ). 39 Finally, every Euclidean transformationT 12 that produces a possible alignment betweenP 1 and P 2 can be associatedd(P 2 ;P 1 ) = P x2P 2 d(x;P 1 ) with an energy to define our visibility error metric onSE(3), E(T 12 ) =d T 1 12 P 1 ;P 2 +d (T 12 P 2 ;P 1 ): (4.1) 4.3.2 Finding the Transformation Low High (a) Iteration 1 Iteration k (b) Figure 4.3: (a) Left: a pair of range images to be registered. Right: VEM evaluated on the entire rotation space. Each point within the unit ball represents the vector part of a unit quaternion; for each quaternion, we estimate its corresponding translation component and evaluate the VEM on the composite transformation. The red rectangles indicate areas with local minima, and the red cross is the global minimum. (b) Example particle locations and displacements at iteration 1 andk. Blue vectors indicate displacement of regular (non-guide) particles following a tradi- tional particle swarm scheme. Red vectors are displacements of guide particles. Guide particles draw neighboring regular particles more efficiently towards local minima to search for the global minimum. Minimizing the error metric (4.1) consists of solving a nonlinear least squares problem and so in principle can be optimized using e.g. the Gauss-Newton method. However, it is non-convex, and prone to local minima (Figure 4.3(a)). Absent a straightforward heuristic for picking a good initial guess, we instead adopt a Particle Swarm Optimization (PSO) [43] method to efficiently minimize (4.1), where “particles” are candidate rigid transformations that move towards smaller energy landscapes in SE(3). We could independently minimize E starting from each particle 40 as an initial guess, but this strategy is not computationally tractable. So we iteratively update all particle positions in lockstep: a small set of the most promising guide particles, that are most likely to be close to the global minimum, are updated using an iteration of Levenberg-Marquardt. The rest of the particles receive PSO-style weighted random perturbations. This procedure is summarized in Algorithm 1, and each step is described in more detail below. Algorithm 1 Modified Particle Swarm Optimization 1: Input: A set of initial “particles” (orientations)fT 0 1 ;:::;T 0 N g2SE(3) N 2: evaluate VEM on initial particles 3: for each iteration do 4: select guide particles 5: for each guide particle do 6: update guide particle using Levenberg-Marquardt 7: end for 8: for each regular particle do 9: update particle using weighted random displacement 10: end for 11: recalculate VEM at new locations 12: end for 13: Output: The best particleT b Initial Particle Sampling We begin by samplingN particles (we useN = 1600), where each particle represents a rigid motionm i 2 SE(3). SinceSE(3) is not compact, it is not straight- forward to directly sample the initial particles. We instead uniformly sample only the rotational componentR i of each particle [75], and solve for the best translation using the following Hough- transform-like procedure. For everyx2P 1 andy2R i P 2 , we measure the angle between their respective normals, and if it is less than 20 , the pair (x;y) votes for a translation ofyx. These translations are binned (we use 10mm 10mm 10mm bins) and the best translationt 0 i is ex- tracted from the bin with the most votes. The translation estimation procedure is robust even in the presence of limited overlap amount (Figure 4.4). 41 The above procedure yields a setT 0 =fT 0 i g =f(R 0 i ;t 0 i )g ofN initial particles. We next describe how to step the orientation particles from their valuesT k at iteration k toT k+1 at iterationk + 1. Hough Transform Naive Method Figure 4.4: Translation estimation examples of our Hough Transform method on range scans with limited overlap. The na¨ ıve method, which simply aligns the corresponding centroids, fails to estimate the correct translation. Identifying Guide Particles We want to select as guide particles those particles with lowest visibility error metric; however we don’t want many clustered redundant guide particles. There- fore we first promote the particleT k i with lowest error metric to guide particle, then remove from consideration all nearby particles, e.g. those that satisfy d (R k j ;R k i ) r ; whered (R k i ;R k j ) = log h R k j i 1 R k i is the bi-invariant metric onSO(3), e.g. the least angle of all rotationsR withR k i =RR k j : We use = 30 . We then repeat this process (promoting the remaining particle with lowest VEM, removing nearby particles, etc) until no candidates remain. 42 Guide Particle Update We update each guide particle T k i to decrease its VEM. We pa- rameterize the tangent space of SE(3) at T k i by two vectors u;v 2 R 3 with exp(u;v) = exp([u] )R k i ;t k i +v , where [u] is the cross-product matrix. We then use the Levenberg- Marquardt method to find an energy-decreasing direction (u;v), and set T k+1 i = exp(u;v). Please see the supplementary material for more details. Other Particle Update Performing a Levenberg-Marquardt iteration on all particles is too expensive, so we move the remaining non-guide particles by applying a randomly weighted sum- mation of each particle’s displacement during the previous iteration, the displacement towards its best past position, and the displacement towards the local best particle within radius r (mea- sured usingd ) with lowest energy, as in standard PSO [43]. While the guide particles rapidly descend to local minima, they are also local best particles and drag neighboring regular particles with them for a more efficient search of all local minima, from which the global one is extracted (Figure 4.3(b)). Please refer to the supplementary material for more details. Termination Since the VEM of each guide particle is guaranteed to decrease during every iteration, the particle with lowest energy is always selected as a guide particle, and the local minima ofE must lie in a bounded subset ofSE(3). In the above procedure the particle with lowest energy is guaranteed to converge to a local minimum ofE. We terminate the optimization when min i jE(T k i )E(T k+1 i )j 10 4 : In practice this occurs within 5–10 iterations. 43 4.3.3 Multi-view Extension We extend our VEM-based pairwise registration method to globally align a total of M partial scans P 1 ;:::;P M by estimating the optimum transformation set T 12 ;:::;T 1M . First we per- form pairwise registration between all pairs to build a registration graph, where each vertex repre- sents a partial scan and each pair of vertices are linked by an edge of the estimated transformation. We then extract all spanning trees from the graph, and for each spanning tree we calculate its cor- responding transformation set T 12 ;:::;T 1M and estimate the overall VEM as, E M = X i6=j d A T 1 1j A T 1i P i ;P j +d A T 1 1i A T 1j P j ;P i : (4.2) We select the transformation set with the minimum overall VEM. We perform several iterations of Levenberg-Marquardt algorithm to minimize Equation 4.2 to further jointly refine the trans- formation set. We enforce temporal coherence into the global registration framework by adding the final estimated transformation set of the previous frame to the pool of transformation sets of the current frame before selecting the best one. 4.4 Global Registration Evaluation Data Sets. We evaluate our registration algorithm on the Stanford 3D Scanning Repository and the Princeton Shape Benchmark [74]. We use 4 models from the Stanford 3D Scanning Repository (the Bunny, the Happy Buddha, the Dragon, and the Amardillo), and use all 1814 models from the Princeton Shape Benchmark. We believe these two data sets, especially the latter, are general enough to cover shape variation of real world objects. For each data set, we 44 generated 1000 pairs of synthetic depth images with uniformly varying degrees of overlap; these range maps were synthesized using randomly-selected 3D models and randomly-selected camera angles. Each pair is then initialized with a random initial relative transformation. As such, for each pair of range images, we have the ground truth transformation as well as their overlap ratio. Evaluation Metric. The extracted transformation, if not correctly estimated, can be at any distance from the ground truth transformation, depending on the specific shape of the underlying surfaces and the local minima distribution of the solution space. Thus, it is not very informative to directly use the RMSE of rotation and translation estimation. It is rather straightforward to use success percentage as the evaluation metric. We claim the global registration to be successful if the errord (R est ;R gt ) of the estimated rotationR est is smaller than a small angle 10 . We do not enforce the translation to be close since it is scale-dependent and the translation component is easily recovered by a robust local registration method if the rotation component is close enough (e.g., by using surface normals to prune incorrect correspondences [67]). Effectiveness of the PSO Strategy. To demonstrate the advantage of the particle-swarm opti- mization strategy, we compare our full algorithm to three alternatives on the Stanford 3D Scan- ning Repository: 1) a baseline method that simply reports the minimum particles from all initially- sampled particles, with no attempt at optimization; 2) using only a traditional PSO formulation, without guide particles; and 3) updating only the guide particles, and applying no displacement to ordinary particles. Figure 4.5 compares the performance of the four alternatives. While updating guide particles alone achieves good registration results, incorporating the swarm intelligence further improves the performance, especially when overlap ratios drop below 30%. 45 Figure 4.5: Success percentage of the global registration method employing different optimiza- tion schemes on the Stanford 3D Scanning Repository. Comparisons. To demonstrate the effectiveness of the proposed registration method, we com- pare it against four other alternatives: 1) a baseline method that aligns principal axes extracted with weighted PCA [24], where the weight of each vertex is proportional to its local surface area; 2) Go-ICP [100], which combines local ICP with a branch-and-bound search to find the global minima; 3) FPFH [68, 70], which matches FPFH descriptors; 4) 4PCS, a state-of-the-art method that performs global registration by constructing a congruent set of 4 points between range images [6]. We do not compare with its latest variant SUPER-4PCS [56] as only efficiency is improved for the latter. For Go-ICP, FPFH and 4PCS, we use the authors’ original implementation and tune parameters to achieve optimum performance. Figure 4.6 compares the performance of the five methods on the two data sets respectively. The overall performance on the Princeton Shape Benchmark is lower as this data set is more challenging with many symmetric objects. As expected the baseline PCA method only works well when there is sufficient overlap. All previous methods experience a dramatic fall in accuracy once the overlap amount drops below 40%; 4PCS performs the best out of these, but because 4PCS is essentially searching for the most consistent area shared by two shapes, for small overlap 46 Figure 4.6: Success percentage of our global registration method compared with other methods. Left: Comparison on the Stanford 3D Scanning Repository. Right: Comparison on the Princeton Shape Benchmark. ratio, it can converge to false alignments (Figure 4.7). Our method outperforms all previous approaches, and doesn’t experience degraded performance until overlap falls below 15%. The average performance of different algorithms is summarized in Table 4.1. Table 4.1: Average success percentage of global registration algorithms on two data sets. Average running time is measured using a single thread on an Intel Core i7-4710MQ CPU clocked at 2.5 GHz. PCA GO-ICP FPFH 4PCS Our Method Stanford (%) 19.5 34.1 49.3 73.0 93.6 Princeton (%) 18.5 22.0 33.0 73.2 81.5 Runtime (sec) 0.01 25 3 10 0.5 Performance on Real Data. We further compare the performance of our registration method with 4PCS on pairs of depth maps captured from Kinect One and Structure IO sensors. The hardware setup used to obtain this data is described in detail in the next section. These depth maps share only 10%-30% overlap and 4PCS often fails to compute the correct alignment as shown in Figure 4.8. 47 Input Range Images PCA GO-ICP FPFH 4PCS Our Method Figure 4.7: Example registration results of range images with limited overlap. First and second row show examples from the Stanford 3D Scanning Repository and the Princeton Shape Bench- mark respectively. Please see the supplementary material for more examples. Limitations. Our global registration method works best when there is sufficient visibility infor- mation in the underlying range images, i.e. , when the depth sensor’s field of view contains the entire object and the background is removed. It tends to fail when the visibility information does not prevail, e.g. , range scans of indoor scenes depicting large planar surfaces. We plan to extend our method to handle those challenging cases in future work. 48 4.5 Dynamic Capture Results Hardware. We experiment with two popular depth sensors, namely the Kinect One (V2) sensor and the Structure IO sensor. We mount the former on laptops and extend the capture range with long power extension cables. For the latter, we attach it to iPad Air 2 tablets and stream data to laptops through wireless network. Kinect One sensors stream high-fidelity 512x424 depth images and 1920x1080 color images at 30 fps. We use it to cover the entire human body from 3 or 4 views at approximately 2 meters away. Structure IO sensors stream 640x480 for both depth and color (iPad RGB camera after compression) images at 30 fps. Per pixel depth accuracy of the Structure IO sensor is relatively low and unreliable, especially when used outdoor beyond 2 meters. Thus, we use it to capture small objects, e.g. , dogs and children, at approximately 1 meter away. Our mobile capture setting allows the subject to move freely in space in stead of being restricted to a specific capture volume. Pre-processing. For depth images, first we remove background by thresholding depth value and removing dominant planar segments in a RANSAC fashion. For temporal synchronization across sensors, we use visual cues, i.e. , jumping, to manually initialize the starting frame. Then we automatically synchronize all remaining frames by using the system time stamps, which are accurate up to milliseconds. Input Range Images 4PCS [Aiger et al. 08] Our Method Input Range Images 4PCS [Aiger et al. 08] Our Method Input Range Images 4PCS [Aiger et al. 08] Our Method Figure 4.8: Our registration method compared with 4PCS on real data. First two examples are captured by Kinect One sensors while the last example is captured by Structure IO sensors. 49 aligned scans Poisson reconstruction Denoised mesh texture reconstruction Poisson blending [Chuang et al. 09] Figure 4.9: From left to right: Globally aligned partial scans from multiple depth sensors; The water-tight mesh model after Poisson reconstruction [42]; Denoised mesh after merging neighbor- ing meshes by using [45]; Model after our dense correspondences based texture reconstruction; Model after directly applying texture-stitcher [23]. Performance. We process data using a single thread Intel Core i7-4710MQ CPU clocked at 2.5 GHz. It takes on average 15 seconds to globally align all the views for each frame, 5 minutes for surface denoising and reconstruction, and 3 minutes for building dense correspondences and texture reconstruction. Results. We capture a variety of motions and objects, including walking, jumping, playing Tai Chi and dog training (see the supplementary material for a complete list). For all captures, the performer(s) are able to move freely in space while 3 or 4 people follow them with depth sensors. As shown in Figure 4.9, our geometry reconstruction method reduces flickering artifacts of the original Poisson reconstruction, and our texture reconstruction method recovers reliable texture on occluded areas. Figure 4.10 provides several examples that demonstrate the effectiveness and flexibility of our capture system. Our global registration method plays a key role as most range images share only 10% to 30% overlap. While we demonstrate successful sequences with 3 depth sensors, an additional sensor typically improves the reconstruction quality since it provides higher overlap between neighboring views leading to a more robust registration. 50 As opposed to most existing free-form surface reconstruction techniques, our method can handle performances of subjects that move through a long trajectory instead of being constrained to a capture volume. Since our method does not require a template, it is not restricted to human performances and can successfully capture animals for which obtaining a static template would be challenging. The global registration method employed for each frame effectively reduces drift for long capture sequences. We can recover plausible textures even in occluded regions. Figure 4.10: Example capturing results. The sequence in the lower right corner is reconstructed from Structure IO sensors, while other sequences are reconstructed from Kinect One Sensors. 4.6 Conclusion We have demonstrated that it is possible, using only a small number of synchronized consumer- grade handheld sensors, to reconstruct fully-textured moving humans, and without restricting the subject to the constrained environment required by stage setups with calibrated sensor arrays. Our system does not require a template geometry in advance and thus can generalize well to a variety of subjects including animals and small children. Since our system is based on low-cost devices 51 and works in fully unconstrained environments, we believe our system is an important step toward accessible creation of VR and AR content for consumers. Our results depend critically on our new alignment algorithm based on the visibility error metric, which can reliably align partial scans with much less overlap than is required by current state-of-the-art registration algorithms. Without this alignment algorithm, we would need to use many more sensors, and solve the sensor interference problem that would arise. We believe this algorithm is an important contribution on its own, as a significant step forward in global registration. 52 Chapter 5 Home Monitoring Patients with Musculo-Skeletal Disorders 5.1 Introduction The population of patients with musculo-skeletal disorders (e.g. Parkinson’s Disease, Stroke) has been continuously increasing worldwide over these years, with approximately 50,000 new cases of Parkinson’s Disease each year. The musculo-skeletal disorder evaluation plays a key role in determining the patient’s medication prescription as well as the rehabilitation plan and it is usu- ally carried out by asking the subject to perform several standardized tests (e.g. walk back and forth, sit and stand that are components of the United Parkinson’s Disease Rating Scale (UPDRS) [31]) while the clinicians observe the activity for stability, smoothness and coordination. Al- though currently the patient-clinician interactive evaluation mode shown in Fig 5.1(a) dominates, there is clearly room for better, more effective and efficient approaches due to the following six reasons. First, the clinician’s evaluation is mostly subjective instead of quantitative. Though there are clinical scales such as the UPDRS, these tools suffer from low resolution and the need for sig- nificant training before one can obtain valid and reliable metrics. Second, some of the evaluation process is simple and is often repeated providing strong support for automatization. Third, the 53 (a) (b) Figure 5.1: (a) Traditional patient-clinician evaluation mode (b) New home-based monitoring and evaluation mode entire process can be time consuming considering the patient needs to travel to the appointment, prepare and even take medication in advance. Fourth, the patients usually prefer to stay at home instead of travelling to the clinician’s office where the risk of injury could be higher. Five, the patients may behave differently when examined in the outpatient clinics [11]. Six, there are not enough medical resources to satisfy all patients’ day-to-day requirements. A reliable and accurate home monitoring and evaluation system is an attractive alternative (Fig 5.1(b)). The system acquires the subject’s data and processes it to enable meaningful inter- pretations (e.g. objective measurements). The processed data is sent to the clinician via network 54 for further clinical recommendations. The home monitoring system then provides feedback to the user combining the clinician’s and the system’s recommendations. Since the activity assessment is carried out in the patient’s familiar environment, the data are more ecologically valid and the inconvenience and potential risks associated with a clinic visit are reduced. More importantly, the recommendations are now based on both the clinician’s experience and the quantitative analysis. This enables several types of applications. Rapid adjustment for more precise medication level. Patients with musculo-skeletal disorders may need to take medication everyday to maintain the mobility level. For patients with PD, excessive anti-Parkinsonian medication can cause overactivity while insufficient dosage could lead to Freezing of Gait (FOG). The system recommends a more precise medication level based on the extracted clinical measurements compared with the baseline. Long term evaluation and prediction. The subject’s functionalities fluctuate over time and inevitably decreases with disease progression. The system evaluates and predicts the trend long term by acquiring and analyzing the patients’ data at certain intervals (e.g. once a week). In other words, the system builds a long term motion profile of the subject. Rehabiliation. The patient’s long term motion profile provides the clinician with rich infor- mation to adjust and manage the rehabilitation plan in a customized manner. The system also helps evaluate the patient’s progress, such as a response to a specific rehabilitation program. Several attempts have been made at automating and quantifying home-based musculo-skeletal disorders evaluation. The Objective Parkinson’s Disease Measurement (OPDM) System [4] can extract the motor score of a patient by putting a combination of accelerometers, gyroscopes and magnetometers on the subject’s sternum, wrists, ankles, and sacrum. After the subject performs 55 several standardized tests, all motion data are processed to calculate a single motion score indi- cating the subject’s mobility level. Several similar systems like Kinesia HomeView [1], Motus Movement Monitor [3] also exist. While these existing systems have been clinically validated and partially solve the home mon- itoring challenge, they are quite intrusive by asking the subject to put on several sensors each time before use. Also, these systems provide single measurements instead of detailed analysis, which is a much richer representation. These incomplete solutions present an opportunity for develop- ing better tools using computer vision. While traditional Motion Capture (MOCAP) systems can accurately track a subject, their cost and cumbersome set-up prevent wide applications outside the laboratory environment. The recent release of low-cost 3D sensors which generate depth streams provides a possible solution. We acquire the sequential skeletal data of the subject using a single 3D sensor. The skele- tal stream is further processed to decouple the complex spatio-temporal information. Finally we generate a representative skeletal sequence which exhibits the subject’s most consistent motion pattern. Based on this representation, we extract detailed spatio-temporal objective measure- ments. Our main contributions are: 1) A non-invasive home monitoring and evaluation system for patients with musculo-skeletal disorders using remote sensor technologies; 2) A methodology for skeletal data processing to decouple the spatio-temporal information; 3) An accurate and reliable skeletal sequence representation based on which multiple objective measurements are extracted; 4) Experimental evaluation results of PD people and non-PD people. 56 Figure 5.2: General pipeline of the proposed method Sec 5.2 describes our proposed method in details including overview, data acquisition, seg- mentation, temporal alignment, spatial summarization and TASS validation. Sec 5.3 demon- strates the effectiveness of our method by several experiments on the PD and non-PD subjects. Sec 5.4 ends with the conclusion and future work. 57 5.2 Method 5.2.1 Overview The general pipeline of our method is shown in Fig 5.2. While the outer loop displays the flowchart, the inner square exhibits the data. The corresponding symbols (i.e. brackets, arrows) visualize our process and intermediate results. We use the 3D sensor to capture a depth stream of the subject performing a standardized clinical assessment and further extract a skeletal stream (Sec 5.2.2). While the complete skeletal stream is cyclical in nature, we segment it into multiple repeated Skeletal Action Units (SAU) by detecting the periodicity in a projected feature space (Sec 5.2.3). Although all SAUs are motion sequences of the same subject in a short time period, they differ in both temporal domain and spatial domain. We propose the key Temporal Align- ment Spatial Summarization (TASS) method to decouple the complex spatial-temporal informa- tion and generate a single robust representation. First, a SAU is selected as the reference and all other SAUs are temporally aligned with the referece (Sec 5.2.4) to build the temporal correspon- dences. Then all SAUs are summarized spatially to generate the Representative Skeletal Action Unit (RSAU) (Sec 5.2.5) which encodes the most consistent motion pattern observed among all SAUs. We validate the choice of TASS method on the MSR-Action 3D Dataset (Sec 5.2.6). 5.2.2 Data Acquisition The system set-up is easy and convenient. A 3D sensor is horizontally fixed (e.g. on table, top of TV) and a skeletal stream is extracted while the subject performs standardized tests in front of it. 58 3D sensor and skeleton extraction algorithm. We use the Kinect sensor which can pro- vide 640480 depth images at 30 fps. Skeletons are extracted in real time using [76] which is implemented in the Microsoft Kinect SDK [2]. Standardized tests. A number of standardized instruments for patients with musculo-skeletal disorders have been proposed and clinically validated. We list the ones we use for our experi- ments. Walking Test. The subject is asked to walk back and forth multiple times along a line for a one way distance of four meters. The test is adapted from a standardized six meters walking test considering the working range of the 3D sensor. Walking-with-counting Test. The subject is asked to count from 1 to 100 while walking. A constant number is added each time, e.g. 1, 5, 9,...,97. The added cognitive load of counting is an indicator of automatic control of the primary walking task. Sit-to-stand Test. The subject is asked to sit down and stand up as fast as possible multiple times. Each test usually consists of five repetitions and can be carried out by a subject in several minutes. All tests are cyclical in nature and consist of multiple simple action units. For the walking- based tests, we define an action unit to be two steps. For the Sit-to-stand Test, we define an action unit to be sitting and standing one time. Although the test is performed by a subject in a short period of time, we often observe variations between repeated action units. We analyze and summarize them into two categories. Large motion variation. For a person with musculo-skeletal disorders, some action units exhibit extreme pose. For example, in the Walking Test while the patient is doing fine with most 59 steps, he/she shows extremely unbalanced pose during a few steps. On the other hand, for a person without musculo-skeletal disorders, large variation is rarely observed and the motion is more consistent. Small motion variation. No one performs the same task exactly the same every repetition. For example, each step actually differs slightly in length while people walk. This type of variation is generally small and observed for all people. The small motion variation is random, and we regard as noise. Our purpose is to eliminate it by averaging multiple repeated action units (Sec 5.2.5). In other words, we enhance the consistent motion pattern by removing this noise. On the other hand, we use a robust method to detect the few action units with large motion variation. The action units are regarded as outliers and should not be taken into account for averaging. In order to average or reject an action unit, we must decouple the complex spatio-temporal information so that frame-to-frame correspondenes are built and comparing two skeletons from two different SAUs in Euclidean Space is meaningful. 5.2.3 Segmentation A key observation is that the skeletal stream consists of multiple repeated Skeletal Action Units (SAU). For example, walking two steps is defined as a SAU for walking-based tests while sitting down and standing up one time is defined as a SAU for the Sit-to-stand Test. We formulate the segmentation problem of the skeletal stream as follows. A skeletal stream is represented as X = [x 1 ;x 2 ;:::;x N ]2R 3MN whereN is the total number of frames,M is the number of 3D 60 joints per frame, andx i = [p t i;1 ;p t i;2 ;:::;p t i;M ] T 2R 3M1 withp i;j = (x i;j ;y i;j ;z i;j ) t 2R 31 . We are trying to find the Segmentation Vector S = [b 1 ;e 1 ;b 2 ;e 2 ;:::;b K ;e K ] t 2R 2K1 (5.1) such that the K segmented SAUs can be represented as X S = fX b i :e i 1 ;i = 1; 2;:::;Kg. In other words, (b i ;e i ) are the indices of the start frame and the end sentinel frame of the ith segmented SAU and they must satisfy 1b 1 <e 1 b 2 <e 2 :::<t K N. We define a feature function :R 3M1 !R onX such that(X) = [(x 1 );(x 2 );:::;(x N )] = [ 1 ;:::; N ]2R 1N . While the periodicity of motion is observed in theR 3MN space, it is pro- jected to theR 1N subspace by a well designed function. To better deal with noise, the feature subspace is further convoluted with a Gaussian filter such that ~ i = P i+h j=ih e (ji) 2 2 2 p 2 j andS is finally determined by detecting the local maximas of [ ~ 1 ; ~ 2 ;:::; ~ N ]. For the sit-to-stand test, is defined to be the height of the head joint and an example is shown in Fig 5.3. For the walking- based tests, is defined to be the signed distance between two feet. 5.2.4 Temporal Alignment Temporal alignment methods. Many temporal alignment algorithms have been initially pro- posed to solve the video synchronization problem [66], using Dynamic Time Warping (DTW) or its variant. Recently, Canonical Component Analysis (CCA), which is proposed for learning the shared subspace between two high dimensional features, has been extended as Canonical Time Warping (CTW) [104] to address the spatio-temporal alignment between two human motion se- quences. Another work proposed by Gong and Medioni [36] address the problem using Dynamic 61 Manifold Warping (DMW). They extend previous works on spatio-temporal alignment by incor- porating manifold learning and employing a novel robust similarity metric. In this paper, we use their method and give an intuition and brief introduction. We strongly encourage the interested readers to read the original paper. Problem formulation. Given multiple segmented SAUsX S , we pick one SAU as the refer- ence (as explained later) and all other SAUs are aligned with the reference in a pairwise manner. Two simplify the notations, we formulate the problem of temporal alignment between two SAUs. Given two SAUsX 1:Lx 2 R 3MLx andY 1:Ly 2 R 3MLy , we try to find the Optimal Align- ment PathQ = [q 1 ;q 2 ;:::;q Ly ]2R 1Ly that alignsY i withX q i . We do that by minimizing the following loss function (kk F is the Frobenius norm operator), Ł DMTW (F();F();W) =kF(X 1:Lx )F(Y 1:Ly )W T k 2 F (5.2) whereF : R 3M1 ! R L1 mapsX 1:Lx andY 1:Ly toF(X 1:Lx )2 R LLx andF(Y 1:Ly )2 R LLy in aL dimensional subspace. W2f0; 1g LxLy encodesQ such that whenq i = k we haveW i;k = 1;i = 1; 2;:::;L x . Completion score. Although eachx i inX 1:Lx lies in a high dimensional space, the natural property of human pose suggests thatx i has lower intrinsic number of degrees of freedom. In other words, X 1:Lx can be regarded as traversing along a pathM p on a spatial-manifoldM. The geodesic distance d Geo (x i ;x i+1 ;M p )inM can be further estimated using Tensor V oting [58] which is a non-parametric framework proposed to estimate the geometric information of 62 manifolds. Knowing the geodesic distance between consecutive frame, we assign a completion score i to framex i as i = P i1 s=1 d Geo (x s ;x s+1 ;M p ) P Lx1 s=1 d Geo (x s ;x s+1 ;M p ) : (5.3) By definingF() asF(x i ) = x i we can rewrite the loss term in (2) as Ł DMTW (W) =k x y Wk 2 F where x = [ x 1 ; x 2 ;:::; x Lx ] and y = [ y 1 ; y 2 ;:::; y Ly ]. Algorithm. The key for solving Eq 5.2 lies in the Temporal Aligning MatrixA =fa i;j g LxLy wherea i;j =kF(x i )F(y j )k 2 . And in our casea i;j = ( x i y j ) 2 . Two examples are shown in Fig 5.4. The Optimal Alignment Path is found by looking for the shortest path traversing from A 1;1 toA Lx;Ly (white arrows in Fig 5.4). Dynamic programming provides an efficient solution in O(L x L y ) to obtain Q after calculatingA following the traditional Dynamic Time Warping (DTW) algorithm. Temporal Aligning Score . The Optimal Alignment Path indicates how two time series match each other temporally. Fig 5.4(a) displays a linear path indicating that while one motion sequence is slower than another the delay is equally distributed. On the contrary, Fig 5.4(b) exhibits non-linear correspondences between two time series. From a medical perspective, this indicates possible pauses, or even worsely, Freezing of Gait (FOG) (Sec 5.3) of the subject during specific repetition. While not covered in the original paper, we introduce to quantify the non- linearity of the Optimal Alignment Path. GivenX 1:Lx , Y 1:Ly and the optimal alignment path Q2R 1Ly , we define the the Temporal Aligning Score as / P Ly i=1 j(L x 1)(i 1) (L y 1)(q i 1)j L y ((L x 1) 2 + (L y 1) 2 ) : (5.4) 63 is a normalized scale-invariant score quantifying how much the alignment path Q deviates from the liney = Lx1 Ly1 (x 1) + 1. The larger the score, the bigger the non-linearity among the temporal correspondences between two SAUs. For Fig 5.4(a), = 0:0093 while for Fig 5.4(b) = 0:0355. Reference selection. The reference SAU must match all other SAUs as much as possible in the temporal domain. So we pick one SAU at a time, align it with all remaining SAUs and calculate the sum of the Temporal Aligning Scores. The sum of all Temporal Aligning Scores indicates how worse the current SAUs conform to other SAUs and we pick the one with the minimum score. In other words, the reference SAU is arg mini X j6=i i;j (5.5) when defining i;j as the Temporal Aligning Score between theith and thejth SAUs. 5.2.5 Spatial Summarization After segmentation and temporal alignment, we have segmented SAUs X S =fX b i :e i 1 ;i = 1; 2;:::;Kg as well as the Optimal Alignment Path for each SAUfQ i 2R 1 ~ L ;i = 1; 2;:::;Kg whereQ i = [q i 1 ;q i 2 ;:::;q i ~ L ] and ~ L is the reference SAU’s total number of frames. In other words, we have a set of temporally aligned SAUsfX Q i;i = 1; 2;:::;Kg whereX Q i = [x q i 1 ;x q i 2 ;:::;x q i ~ L ]. As mentioned before (Sec 5.2.2), these SAUs exhibit small and large motion variations. We re- gard the small motion variation as noise and the large motion variation as outlier. We try to cancel out small variations by averaging over SAUs of consistent motion pattern while detecting the SAUs with large motion variation. 64 Problem Formulation. The RSAU is represented asX R = [x R 1 ;x R 2 ;:::;x R ~ L ]2R 3M ~ L . We find the optimal RSAU by minimizing K X i=1 ~ L X j=1 D(x R j ;x q i j ) + ~ L1 X j=1 D(x R j ;x R j+1 ); (5.6) whereD : R 3M1 R 3M1 ! R is interpreted as a distance measure between two skeletal frames. In all our experiments, we defineD(;) to beD(x i ;x j ) = P M k=1 kp i;k p j;k k 2 , i.e. the Euclidean Distance betweenx i andx j . Eq 5.6 has a data term and a smoothness term and is a parameter used to tune the ratio between them. Data Term. P K i=1 P ~ L j=1 D(x R j ;x q i j ) minimizes the distance between the RSAU with all SAUs. This term forces RSAU to behave like an average result of multiple SAUs and is used to remove small motion variations. Smoothness Term. P ~ L1 j=1 D(x R j ;x R j+1 ) minimizes the distance between two consecutive frames of the RSAU. While the 3D sensor extracts skeletons at 30fps, the relative motion between two consecutive frames is small. Hence this term is useful for rejecting large measurement noise (e.g. Some joints of the skeleton jumping between consecutive frames). tunes the ratio between these two terms. When the input skeleton is very noisy, is set larger to enforce smoothness. In all our experiments, we set = 0:1 for the Microsoft Kinect SDK skeletal data (actually this data has already been smoothed). Algorithm. Eq 5.6 is rewritten as a Linear Least Square problem which is efficiently solved by Gaussian Elimination. In practice, we combine RANSAC with Eq 5.6 to help detect the SAUs with large motion variation. At each iteration, we select a subset of SAUs and calculate the 65 corresponding RSAU. Then we compare the distance between the RSAU with all SAUs to find the inliers. The RSAU with the most number of inliers is output as our final result. 5.2.6 TASS Validation TASS is used to capture the most consistent motion pattern among multiple SAUs and incorporate such information into a single RSAU. RSAU can be regarded as a rich and robust representation from which many clinically relevant parameters can be extracted. To validate our choice of the TASS approach, we apply it on the 3D activity recognition task and compare its performance on the MSR-Action 3D Dataset [52] with several state-of-the-art 3D activity recognition algorithms. MSR-Action 3D Dataset contains 20 actions performed 3 times by each of 10 subjects. We removed some extreme outliers and used 547 out of the original 567 SAUs. These SAUs cover various movement of arms, legs and torso. First we normalized all skeletons to the same size. Then for each action type we trained a single RSAU using the TASS method. In this case, RSAU captures the most consistent motion pattern of a specific action type crossing different subjects. At the testing stage, given a SAU, we temporally aligned all 20 RSAUs with it and calculate the Euclidean Distance. The testing SAU is associated with the action with the minimum Euclidean Distance. We report accuracy on the cross-subject test setting [52] and compare our method with other state-of-the-art methods (Tab 5.1). If we randomly select a SAU to represent an action type, we achieve an accuracy of 0.606. After we enhance the representation using the TASS method, we boost the accuracy up to 0.819. Our method reaches competitive performance even comparing with most fine-tuned 3D activity 66 Table 5.1: Recognition Accuracy Comparison on MSR-Action 3D Dataset Method Accuracy Dynamic Temporal Warping [60] 0.540 Random Sampled SAU 0.606 Action Graph on Bag of 3D Points [52] 0.747 View-invariant Histogram of 3D Joints [99] 0.789 RSAU 0.819 Actionlet Ensemble [92] 0.882 recognition algorithms. The result demonstrates the effectiveness of the TASS method and shows that RSAU is a much richer representation than single SAU. 5.3 Experiments Parkinson’s Disease is a degenerative disorder of the central nervous system [40]. The most ob- vious symptoms of Parkinson’s Disease are motion related which include slowness of movement (i.e. Bradykinesia), resting tremor, rigidity and postural instability. While some symptoms are consistent, others occur on an episodic basis (e.g. Festination and Freezing of Gait) [12]. De- tecting and quantifying these symptoms are crucial for deciding exact medicine level, designing rehabilitation plan and preventing falls which are most risky for PD patients [11]. Two experi- ments were conducted and the data was processed using our proposed TASS method. The RSAU was able to quantify a series of spatio-temporal clinical measurements which are crucial for eval- uation of the subject’s mobility level. Walking-based experiment. This experiment was performed by a PD subject and a non-PD subject. These two subjects were similar in size. The non-PD subject performed the Walking Test while the PD subject carried out both the Walking Test and the Walking-with-counting Test. 67 Three RSAUs were generated from our data. A series of spatio-temporal indicators were then automatically extracted. Step size. We project the joints of two feet to the ground. The step size is calculated based on the difference between the start position of the first frame and the end position of the last frame. Postural Swing Level. Postural Swing Level (PSL) quantifies how stable the pose is while a person walks. We extract it by projecting the joint of torso (i.e. Center of Mass) to the ground. Then we find its Maximum Absolute Deviation along the direction perpendicular to the subject’s moving direction. The larger the value, the more unstable the pose is. Arm Swing Level. Arm Swing Level (ASL) indicates the degree of arm motion. We extract it by projecting the hand joint and the torso joint to the ground and calculating the signed distance between them for each frame. Again we use the Maximum Absolute Deviation. The smaller the ASL, the stiffer the specific arm. Stepping time. This indicates the most consistent time of the subject walking two steps. Table 5.2: Results of PD and non-PD on walking-based tests Indicator non-PD W PD W PD W&C L step size 95.2 96.1 56.7 R step size 91.3 90.2 57.3 PSL 3.05 5.13 8.35 L ASL 12.29 5.49 3.76 R ASL 10.68 10.66 3.28 Stepping time 0.89 1.61 2.14 The experimental results are displayed in Tab 5.2 with spatial indicators measured in cen- timeters and temporal indicators measured in seconds. We have the following observations. —While the non-PD subject and the PD subject show similar step sizes during the Walking Test, the PD subject’s step size decreases dramatically as soon as the dual counting task is added. —The PD subject exhibits larger Postural Swing Level than the non-PD subject and the situation 68 gets worse after adding the dual task. The result quantifies the PD subject’s Postural Instability. —The PD subject has a stiff left arm during the walking test and both arms become stiff with the dual task added. The result quantifies the Rigidity of the PD subject’s arms. —It takes more time for the PD subject to walk two steps and even more time to walk two steps while counting. The result quantifies the Bradykinesia of the PD subject. Sit-to-stand experiment. The same PD subject and non-PD subject carried out the Sit-to- stand Test. Similar spatio-temporal measurements were extracted from the respective RSAU which quantified the motion differences between the PD subject and the non-PD subject. They are not displayed considering redundancy. The PD subject exhibited pauses during a specific repetition, and we temporally aligned the corresponding SAU with the RSAU. Fig 5.5 displays the corresponding Temporal Aligning Matrix. During pauses, the Optimal Alignment Path shows sparser points and this is well captured by the Kernel Density Estimation [78] method as local minimas. In Fig 5.5 we picked up two pauses with one representing the subject leaning against the chair and the other representing the subject having difficulty standing up. Detection of pause and/or FOG is important for the clinician to evaluate the subject’s postural instability as well as risk of fall. Applications. Quantifying the PD patients’ musculo-skeletal disorders has many potential applications. —Falls prevention and warning. Abnormal pause, occurence of Freezing of Gait, and extreme postural instability are often strong indicators of falls which are very dangerous for the PD pa- tients. Hence quantification of them is crucial for prevention or alerting the patient/clinician of possible falls. —More precise medication level. The PD patient usually takes medicine everyday to maintain 69 their ’on’ state, i.e. smooth skeletal muscle movements. As the medicine effect ’wears off’, the person becomes very stiff, slow and may even be unable to move in a few minutes. In some cases of the PD patients, the ’on-off’ fluctuations are unpredictable. Our system can recommend them to take medicine at the exact time with the exact amount by analyzing several objective measure- ments, e.g. step size, stepping time. —Long term monitoring to observe disease progression and/or treatment effectiveness. The PD patient gradually lose the motor ability as an inevitable effect of the progressive deterioration of the nervous system. Building a long term quantified profile for each specific patient is important for deterioration evaluation and prediction, rahabilitation plan design and even help the clinicians better understand the underlying mechanism and come up with a more effective solution. Discussion. The key idea here is not what medical measurements we can extract from this specific experiment, but rather we provide a robust methodology to decouple the complex spatio- temporal information. Instead of providing single scores, we provide the source (i.e. RSAU) from which many well-defined medical indicators can be extracted and used in many related applications. It is worth noting that, however, our main focus in this paper is not clinical validation of these spatio-temporal gait parameters. This process asks for more experiments on the PD subjects and close collaboration with experienced clinicians. 5.4 Conclusion To summarize, we address the following problems: A non-invasive home monitoring and evalu- ation system for patients with musculo-skeletal disorders using a 3D sensor; A methodology to decouple the spatio-temporal information between multiple skeletal sequences; A compact and 70 robust skeletal sequence representation based on which multiple medical indicators are extracted; Experimental evaluation results on the PD and non-PD subjects. 71 Figure 5.3: Illustration of segmentation based on periodicity of feature space(X) (a) (b) Figure 5.4: (a) Linear alignment path between two SAUs (b) Non-linear alignment path between two SAUs Figure 5.5: Detection of Pause/FOG (Top to bottom: Skeletal stream; Temporal Aligning Matrix; Estimated density) 72 Chapter 6 Conclusion and Future Work We have presented a novel framework to exploit the visibility information for robust range image registration, which allows us to consistently register pairs of range images without initialization, even in the presence of 15% overlap. We further developed extensions to handle multiple views as well as articulation. With the proposed multi-view rigid and articulated registration methods, we demonstrated that it is possible to turn a single commodity depth sensor into a 3D body scanning station. With the system, daily users can easily generate their 3D representation and plug into any suitable application. With the proposed global registration method, we have demonstrated that it is possible, using only a small number of synchronized consumer-grade handheld sensors, to reconstruct fully- textured moving humans, and without restricting the subject to the constrained environment re- quired by stage setups with calibrated sensor arrays. Our system does not require a template geometry in advance and thus can generalize well to a variety of subjects including animals and small children. 73 We believe that our human performance digitization systems are an important step toward accessible creation of VR and AR content for consumers. Our results depend critically on our new registration algorithm based on the visibility information, which can reliably align partial scans with much less overlap than is required by current state-of-the-art registration algorithms. Without these alignment algorithms, we would need to either use more sensors or enforce more data capture time on users to get more scans. In the future, we plan to further extend the registration method so that it can easily handle range images with dominant planar geometry, e.g. range scans of indoor scenes. At application level, we wish to develop an automatic 3D indoor scene scanning method, which can be poten- tially carried out by a drone. 74 Reference List [1] Kinesia homeview. http://glneurotech.com/kinesia/homeview/. [2] Microsoft kinect sdk. http://www.microsoft.com/en-us/ kinectforwindows/. [3] Motus motion monitor. http://www.motusbioengineering.com/index. htm. [4] Objective parkinson’s disease measurement. http://kineticsfoundation. org/. [5] N. Ahmed, C. Theobalt, P. Dobrev, H.-P. Seidel, and S. Thrun. Robust fusion of dynamic shape and normal capture for high-quality reconstruction of time-varying geometry. In IEEE CVPR, pages 1–8, June 2008. [6] Dror Aiger, Niloy J Mitra, and Daniel Cohen-Or. 4-points congruent sets for robust pair- wise surface registration. In ACM Transactions on Graphics (TOG), volume 27, page 85. ACM, 2008. [7] Brett Allen, Brian Curless, and Zoran Popovi´ c. Articulated body deformation from range scan data. In TOG, volume 21, pages 612–619. ACM, 2002. [8] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: Shape completion and animation of people. ACM Trans. Graph., 24(3):408–416, July 2005. [9] Ilya Baran and Jovan Popovi´ c. Automatic rigging and animation of 3d characters. In ACM Transactions on Graphics (TOG), volume 26, page 72. ACM, 2007. [10] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Robotics-DL tentative, pages 586–606. International Society for Optics and Photonics, 1992. [11] Bastiaan R Bloem, Yvette AM Grimbergen, Monique Cramer, Mirjam Willemsen, and Aeilko H Zwinderman. Prospective assessment of falls in parkinson’s disease. Journal of neurology, 248(11):950–958, 2001. [12] Bastiaan R Bloem, Jeffrey M Hausdorff, Jasper E Visser, and Nir Giladi. Falls and freez- ing of gait in parkinson’s disease: a review of two interconnected, episodic phenomena. Movement Disorders, 19(8):871–884, 2004. 75 [13] Federica Bogo, Michael J. Black, Matthew Loper, and Javier Romero. Detailed full-body reconstructions of moving people from monocular RGB-D sequences. pages 2300–2308, December 2015. [14] Morten Bojsen-Hansen, Hao Li, and Chris Wojtan. Tracking surfaces with evolving topol- ogy. ACM Transactions on Graphics (SIGGRAPH 2012), 31(4):53:1–53:10, 2012. [15] Will Chang and Matthias Zwicker. Automatic registration for articulated shapes. In Com- puter Graphics Forum, volume 27, pages 1459–1468, 2008. [16] Chu-Song Chen, Yi-Ping Hung, and Jen-Bo Cheng. Ransac-based darces: A new approach to fast automatic registration of partially overlapping range images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(11):1229–1234, 1999. [17] Yang Chen and G´ erard Medioni. Object modeling by registration of multiple range im- ages. Image and vision computing, 10(3):145–155, 1992. [18] Yung Chen and G´ erard Medioni. Object modeling by registration of multiple range im- ages. In ICRA, pages 2724–2729. IEEE, 1991. [19] Dmitry Chetverikov, Dmitry Stepanov, and Pavel Krsek. Robust euclidean alignment of 3d point sets: the trimmed iterative closest point algorithm. Image and Vision Computing, 23(3):299–309, 2005. [20] Dmitry Chetverikov, Dmitry Svirko, Dmitry Stepanov, and Pavel Krsek. The trimmed iterative closest point algorithm. In Pattern Recognition, 2002. Proceedings. 16th Interna- tional Conference on, volume 3, pages 545–548. IEEE, 2002. [21] German KM Cheung, Simon Baker, and Takeo Kanade. Visual hull alignment and refine- ment across time: A 3d reconstruction algorithm combining shape-from-silhouette with stereo. In CVPR, volume 2, pages II–375, 2003. [22] Chien-Wen Cho, Wen-Hung Chao, Sheng-Huang Lin, and You-Yin Chen. A vision-based analysis system for gait recognition in patients with parkinsons disease. Expert Systems with applications, 36(3):7033–7039, 2009. [23] Ming Chuang, Linjie Luo, Benedict J Brown, Szymon Rusinkiewicz, and Michael Kazh- dan. Estimating the laplace-beltrami operator by restricting 3d functions. In Computer Graphics Forum, volume 28, pages 1475–1484. Wiley Online Library, 2009. [24] Do Hyun Chung, Il Dong Yun, and Sang Uk Lee. Registration of multiple-range views using the reverse-calibration technique. Pattern Recognition, 31(4):457–464, 1998. [25] Roberto Cipolla, Kalle E Astrom, and Peter J Giblin. Motion from the frontier of curved surfaces. In ICCV, pages 269–275, 1995. [26] Roberto Cipolla and Andrew Blake. Surface shape from the deformation of apparent con- tours. IJCV, 9(2):83–112, 1992. 76 [27] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video. In ACM SIGGRAPH, volume 34, pages 69:1–69:13. ACM, July 2015. [28] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Performance capture from sparse multi-view video. In ACM SIG- GRAPH, pages 98:1–98:10, New York, NY , USA, 2008. ACM. [29] Paul Debevec. The Light Stages and Their Applications to Photoreal Digital Actors. In SIGGRAPH Asia, Singapore, November 2012. [30] Mingsong Dou, J. Taylor, H. Fuchs, A. Fitzgibbon, and S. Izadi. 3d scanning deformable objects with a single rgbd sensor. In IEEE CVPR, pages 493–501, June 2015. [31] SRLE Fahn, RL Elton, UPDRS Development Committee, et al. Unified parkinsons disease rating scale. Recent developments in Parkinsons disease, 2:153–163, 1987. [32] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. [33] J. Franco, M. Lapierre, and E. Boyer. Visual shapes of silhouette sets. In 3D Data Process- ing, Visualization, and Transmission, Third International Symposium on, pages 397–404, June 2006. [34] Pedro J Garcia Ruiz and Vicenta Sanchez Bernardos. Evaluation of actitrac R (ambulatory activity monitor) in parkinson’s disease. Journal of the neurological sciences, 270(1):67– 69, 2008. [35] Natasha Gelfand, Niloy J Mitra, Leonidas J Guibas, and Helmut Pottmann. Robust global registration. In Symposium on geometry processing, volume 2, page 5, 2005. [36] Dian Gong and Gerard Medioni. Dynamic manifold warping for view invariant action recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 571–578. IEEE, 2011. [37] Carlos Hernandez, Francis Schmitt, and Roberto Cipolla. Silhouette coherence for camera calibration under circular motion. TPAMI, 29(2):343–349, 2007. [38] Berthold KP Horn. Extended gaussian images. Proceedings of the IEEE, 72(12):1671– 1686, 1984. [39] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Push- meet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In UIST, pages 559–568, New York, NY , USA, 2011. ACM. [40] J Jankovic. Parkinsons disease: clinical features and diagnosis. Journal of Neurology, Neurosurgery & Psychiatry, 79(4):368–376, 2008. 77 [41] Andrew E Johnson and Martial Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(5):433–449, 1999. [42] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction. In Proceedings of the fourth Eurographics symposium on Geometry processing, 2006. [43] James Kennedy. Particle swarm optimization. In Encyclopedia of Machine Learning, pages 760–766. Springer, 2010. [44] Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust single-view geometry and motion reconstruction. In ACM SIGGRAPH Asia, SIGGRAPH Asia ’09, pages 175:1– 175:10, New York, NY , USA, 2009. ACM. [45] Hao Li, Linjie Luo, Daniel Vlasic, Pieter Peers, Jovan Popovi´ c, Mark Pauly, and Szy- mon Rusinkiewicz. Temporally coherent completion of dynamic shapes. ACM TOG, 31(1):2:1–2:11, February 2012. [46] Hao Li, Robert W. Sumner, and Mark Pauly. Global correspondence optimization for non- rigid registration of depth scans. Computer Graphics Forum (Proc. SGP’08), 27(5), July 2008. [47] Hao Li, Robert W Sumner, and Mark Pauly. Global correspondence optimization for non-rigid registration of depth scans. In Computer graphics forum, volume 27, pages 1421–1430, 2008. [48] Hao Li, Etienne V ouga, Anton Gudym, Linjie Luo, Jonathan T. Barron, and Gleb Gu- sev. 3d self-portraits. In ACM SIGGRAPH Asia, volume 32, pages 187:1–187:9. ACM, November 2013. [49] Hao Li, Etienne V ouga, Anton Gudym, Linjie Luo, Jonathan T. Barron, and Gleb Gusev. 3d self-portraits. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2013), 32(6), November 2013. [50] Hao Li, Etienne V ouga, Anton Gudym, Linjie Luo, Jonathan T Barron, and Gleb Gusev. 3d self-portraits. ACM Transactions on Graphics (TOG), 32(6):187, 2013. [51] Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. Realtime facial animation with on-the-fly correctives. In ACM SIGGRAPH, volume 32, pages 42:1–42:10. ACM, July 2013. [52] Wanqing Li, Zhengyou Zhang, and Zicheng Liu. Action recognition based on a bag of 3d points. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 9–14. IEEE, 2010. [53] Yebin Liu, Genzhi Ye, Yangang Wang, Qionghai Dai, and Christian Theobalt. Computer Vision and Machine Learning with RGB-D Sensors, chapter Human Performance Cap- ture Using Multiple Handheld Kinects, pages 91–108. Springer International Publishing, Cham, 2014. 78 [54] Ameesh Makadia, Alexander Patterson, and Kostas Daniilidis. Fully automatic registra- tion of 3d point clouds. In CVPR, 2006 IEEE Conference on, volume 1, pages 1297–1304. IEEE, 2006. [55] Wojciech Matusik, Chris Buehler, Ramesh Raskar, Steven J. Gortler, and Leonard McMil- lan. Image-based visual hulls. In Proceedings of the 27th Annual Conference on Com- puter Graphics and Interactive Techniques, SIGGRAPH ’00, pages 369–374, New York, NY , USA, 2000. ACM Press/Addison-Wesley Publishing Co. [56] Nicolas Mellado, Dror Aiger, and Niloy J Mitra. Super 4pcs fast global pointcloud reg- istration via smart indexing. In Computer Graphics Forum, volume 33, pages 205–215. Wiley Online Library, 2014. [57] S. Moezzi, Li-Cheng Tai, and P. Gerard. Virtual view generation for 3d digital video. MultiMedia, IEEE, 4(1):18–26, Jan 1997. [58] Philippos Mordohai and G´ erard Medioni. Dimensionality estimation, manifold learning and function approximation using tensor voting. The Journal of Machine Learning Re- search, 11:411–450, 2010. [59] Jorge J Mor´ e. The levenberg-marquardt algorithm: implementation and theory. In Numer- ical analysis, pages 105–116. Springer, 1978. [60] Meinard M¨ uller and Tido R¨ oder. Motion templates for automatic classification and re- trieval of motion capture data. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 137–146. Eurographics Association, 2006. [61] Richard A Newcombe, Andrew J Davison, Shahram Izadi, Pushmeet Kohli, Otmar Hilliges, Jamie Shotton, David Molyneaux, Steve Hodges, David Kim, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, pages 127–136. IEEE, 2011. [62] Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE CVPR, June 2015. [63] Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A Argyros. Tracking the articulated motion of two strongly interacting hands. In CVPR, 2012 IEEE Conference on, pages 1862–1869. IEEE, 2012. [64] Yuri Pekelny and Craig Gotsman. Articulated object reconstruction and markerless motion capture from depth video. In Computer Graphics Forum, volume 27, pages 399–408, 2008. [65] Chen Qian, Xiao Sun, Yichen Wei, Xiaoou Tang, and Jian Sun. Realtime and robust hand tracking from depth. In CVPR, 2014 IEEE Conference on, pages 1106–1113. IEEE, 2014. [66] Cen Rao, Alexei Gritai, Mubarak Shah, and Tanveer Syeda-Mahmood. View-invariant alignment and matching of video sequences. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 939–945. IEEE, 2003. 79 [67] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the icp algorithm. In 3-D Digital Imaging and Modeling, 2001. Proceedings. Third International Conference on, pages 145–152. IEEE, 2001. [68] Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009 IEEE International Conference on, pages 3212–3217. IEEE, 2009. [69] Radu Bogdan Rusu, Nico Blodow, Zoltan Csaba Marton, and Michael Beetz. Aligning point cloud views using persistent feature histograms. In Intelligent Robots and Systems, 2008 IEEE/RSJ International Conference on, pages 3384–3391. IEEE, 2008. [70] Radu Bogdan Rusu and Steve Cousins. 3d is here: Point cloud library (pcl). In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages 1–4. IEEE, 2011. [71] Arash Salarian, Heike Russmann, Franc ¸ois JG Vingerhoets, Pierre R Burkhard, and Kamiar Aminian. Ambulatory monitoring of physical activities in patients with parkin- son’s disease. Biomedical Engineering, IEEE Transactions on, 54(12):2296–2299, 2007. [72] Arash Salarian, Heike Russmann, Christian Wider, Pierre R Burkhard, Franc ¸ios JG Vinger- hoets, and Kamiar Aminian. Quantification of tremor and bradykinesia in parkinson’s dis- ease using a novel ambulatory monitoring system. Biomedical Engineering, IEEE Trans- actions on, 54(2):313–322, 2007. [73] Ari Shapiro, Andrew Feng, Ruizhe Wang, Hao Li, Bolas Mark, G´ erard Medioni, and Evan Suma. Rapid avatar capture and simulation from commodity depth sensors. In Computer Animation and Social Agents (CASA). IEEE, 2014. [74] Philip Shilane, Patrick Min, Michael Kazhdan, and Thomas Funkhouser. The princeton shape benchmark. In Shape modeling applications, 2004. Proceedings, pages 167–178. IEEE, 2004. [75] Ken Shoemake. Uniform random rotations. In Graphics Gems III, pages 124–132. Aca- demic Press Professional, Inc., 1992. [76] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-time human pose recognition in parts from single depth images. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1297–1304. IEEE, 2011. [77] Luciano Silva, Olga Regina Pereira Bellon, and Kim L Boyer. Precision range image reg- istration using a robust surface interpenetration measure and enhanced genetic algorithms. TPAMI, 27(5):762–776, 2005. [78] Bernard W Silverman. Density estimation for statistics and data analysis, volume 26. Chapman & Hall/CRC, 1986. [79] Jonathan Starck and Adrian Hilton. Surface capture for performance-based animation. IEEE Comput. Graph. Appl., 27(3):21–31, May 2007. 80 [80] Robert W Sumner, Johannes Schmid, and Mark Pauly. Embedded deformation for shape manipulation. TOG, 26(3):80, 2007. [81] Jochen S¨ ußmuth, Marco Winter, and G¨ unther Greiner. Reconstructing animated meshes from time-varying point clouds. In SGP, SGP ’08, pages 1469–1476, 2008. [82] Art Tevs, Alexander Berner, Michael Wand, Ivo Ihrke, Martin Bokeloh, Jens Kerber, and Hans-Peter Seidel. Animation cartography—intrinsic reconstruction of shape and motion. ACM TOG, 31(2):12:1–12:15, April 2012. [83] Jing Tong, Jin Zhou, Ligang Liu, Zhigeng Pan, and Hao Yan. Scanning 3d full human bodies using kinects. IEEE TVCG, 18(4):643–650, April 2012. [84] Jing Tong, Jin Zhou, Ligang Liu, Zhigeng Pan, and Hao Yan. Scanning 3d full hu- man bodies using kinects. Visualization and Computer Graphics, IEEE Transactions on, 18(4):643–650, 2012. [85] Andrea Torsello, Emanuele Rodola, and Andrea Albarelli. Multiview registration via graph diffusion of dual quaternions. In CVPR, pages 2441–2448. IEEE, 2011. [86] Emanuele Trucco, Andrea Fusiello, and Vito Roberto. Robust motion and correspondence of noisy 3-d point sets with missing data. Pattern recognition letters, 20(9):889–898, 1999. [87] Bob van Hilten, Jorrit I Hoff, Huub AM Middelkoop, Edo A van der Velde, Gerard A Kerkhof, Albert Wauquier, Hilbert AC Kamphuisen, and Raymund AC Roos. Sleep dis- ruption in parkinson’s disease: assessment by continuous activity monitoring. Archives of neurology, 51(9):922, 1994. [88] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovi´ c. Articulated mesh anima- tion from multi-view silhouettes. In ACM SIGGRAPH, SIGGRAPH ’08, pages 97:1–97:9, New York, NY , USA, 2008. ACM. [89] Daniel Vlasic, Pieter Peers, Ilya Baran, Paul Debevec, Jovan Popovi´ c, Szymon Rusinkiewicz, and Wojciech Matusik. Dynamic shape capture using multi-view photo- metric stereo. In ACM SIGGRAPH Asia, SIGGRAPH Asia ’09, pages 174:1–174:11, 2009. [90] Michael Wand, Bart Adams, Maksim Ovsjanikov, Alexander Berner, Martin Bokeloh, Philipp Jenke, Leonidas Guibas, Hans-Peter Seidel, and Andreas Schilling. Efficient re- construction of nonrigid shape and motion from real-time 3d scanner data. ACM TOG, 28(2):15:1–15:15, May 2009. [91] Michael Wand, Philipp Jenke, Qixing Huang, Martin Bokeloh, Leonidas Guibas, and An- dreas Schilling. Reconstruction of deforming geometry from time-varying point clouds. In SGP, SGP ’07, pages 49–58, 2007. [92] Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. Mining actionlet ensemble for ac- tion recognition with depth cameras. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1290–1297. IEEE, 2012. 81 [93] Ruizhe Wang, Jongmoo Choi, and G´ erard Medioni. Accurate full body scanning from a single fixed 3d camera. In 3D Imaging, Modeling, Processing, Visualization and Trans- mission (3DIMPVT), pages 432–439. IEEE, 2012. [94] Ruizhe Wang, Jongmoo Choi, and G´ erard Medioni. 3d modeling from wide baseline range scans using contour coherence. In CVPR, 2014 IEEE Conference on, pages 4018–4025, 2014. [95] Lingyu Wei, Qixing Huang, Duygu Ceylan, Etienne V ouga, and Hao Li. Dense human body correspondences using convolutional networks. In IEEE CVPR. IEEE, 2016. [96] Aner Weiss, Sarvi Sharifi, Meir Plotnik, Jeroen PP van Vugt, Nir Giladi, and Jeffrey M Hausdorff. Toward automated, at-home assessment of mobility among patients with parkinson disease, using a body-worn accelerometer. Neurorehabilitation and Neural Re- pair, 25(9):810–818, 2011. [97] Chenglei Wu, Carsten Stoll, Levi Valgaerts, and Christian Theobalt. On-set performance capture of multiple actors with a stereo camera. ACM Trans. Graph., 32(6):161:1–161:11, November 2013. [98] Chenglei Wu, K. Varanasi, Yebin Liu, H. P. Seidel, and C. Theobalt. Shading-based dy- namic shape refinement from multi-view video under general illumination. pages 1108– 1115. IEEE, November 2011. [99] Lu Xia, Chia-Chih Chen, and JK Aggarwal. View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 20–27. IEEE, 2012. [100] Jiaolong Yang, Hongdong Li, and Yunde Jia. Go-icp: Solving 3d registration efficiently and globally optimally. In Computer Vision (ICCV), 2013 IEEE International Conference on, pages 1457–1464. IEEE, 2013. [101] G. Ye, Y . Deng, N. Hasler, X. Ji, Q. Dai, and C. Theobalt. Free-viewpoint video of human actors using multiple handheld kinects. IEEE Transactions on Cybernetics, 43(5):1370– 1382, 2013. [102] Ming Zeng, Jiaxiang Zheng, Xuan Cheng, and Xinguo Liu. Templateless quasi-rigid shape modeling with implicit loop-closure. In CVPR, pages 145–152, 2013. [103] Zhengyou Zhang. Iterative point matching for registration of free-form curves and sur- faces. International journal of computer vision, 13(2):119–152, 1994. [104] Feng Zhou and Fernando De la Torre. Canonical time warping for alignment of human behavior. Advances in Neural Information Processing Systems (NIPS), pages 1–9, 2009. [105] Wiebren Zijlstra and At L Hof. Assessment of spatio-temporal gait parameters from trunk accelerations during human walking. Gait & posture, 18(2):1–10, 2003. [106] Timo Zinßer, Jochen Schmidt, and Heinrich Niemann. A refined icp algorithm for robust 3-d correspondence estimation. In Image Processing, 2003. Proceedings. International Conference on, volume 2, pages II–695. IEEE, 2003. 82 [107] Michael Zollh¨ ofer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christo- pher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, and Marc Stamminger. Real-time non-rigid reconstruction using an rgb-d cam- era. In ACM SIGGRAPH, volume 33, pages 156:1–156:12, New York, NY , USA, July 2014. ACM. 83
Abstract (if available)
Abstract
The rekindling of interest in Augmented Reality and Virtual Reality has birthed a need for digitizing objects with full geometry and texture, especially human body and human performance. Recent advance of commodity depth sensors (e.g., Kinect One and Occipital IO) has quickly gained popularity in both research community and industry since they are cost efficient, accessible while providing decent depth measurements at an interactive frame rate. Registration, which aims to transform all range images into the same coordinate system, plays a key role in the 3D digitization process, as a small error in registration will lead to a large degradation of final reconstruction quality. Registration is well addressed by existing techniques when the two views exhibit significant overlap and when a rough registration between them is given. Different methods have been proposed to relax these conditions, no satisfactory solution exists, however, to consistently address the combined problem. While all previous methods perform registration directly on the generated point clouds, we propose here a novel framework to explore the visibility information of the underlying range images. This allows us to efficiently handle the problem of insufficient overlap. In this dissertation, we adopt the visibility information to solve the problems of rigid registration, non-rigid registration and global registration respectively. This enables a 3D modeling system of rigid as well as non-rigid objects from as few as 4 range images, while other traditional 3D modeling methods normally require a few hundred range images. Experimental results on both synthetic and real data demonstrate the effectiveness and robustness of our registration methods, as well as our human performance digitization systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Scalable dynamic digital humans
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Tracking multiple articulating humans from a single camera
PDF
Hybrid methods for robust image matching and its application in augmented reality
PDF
Line segment matching and its applications in 3D urban modeling
PDF
RGBD camera based wearable indoor navigation system for the visually impaired
PDF
Complete human digitization for sparse inputs
PDF
Registration and 3-D reconstruction of a retinal fundus from a fluorescein images sequence
PDF
3D object detection in industrial site point clouds
PDF
Accurate image registration through 3D reconstruction
PDF
Human pose estimation from a single view point
PDF
Multi-scale dynamic capture for high quality digital humans
PDF
Autostereoscopic 3D diplay rendering from stereo sequences
PDF
Accurate 3D model acquisition from imagery data
PDF
Video object segmentation and tracking with deep learning techniques
PDF
Data-driven 3D hair digitization
PDF
Design and evaluation of adaptive redirected walking systems
PDF
CUDA deformers for model reduction
PDF
Image registration with applications to multimodal small animal imaging
Asset Metadata
Creator
Wang, Ruizhe
(author)
Core Title
Digitizing human performance with robust range image registration
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/26/2017
Defense Date
01/20/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D modeling,augmented reality,human performance digitization,OAI-PMH Harvest,range image registration,virtual reality
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Medioni, Gerard (
committee chair
), Lim, Joseph (
committee member
), Sawchuk, Alexander (
committee member
)
Creator Email
ruizhewa@gmail.com,ruizhewa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-390926
Unique identifier
UC11265641
Identifier
etd-WangRuizhe-5457.pdf (filename),usctheses-c40-390926 (legacy record id)
Legacy Identifier
etd-WangRuizhe-5457.pdf
Dmrecord
390926
Document Type
Dissertation
Rights
Wang, Ruizhe
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D modeling
augmented reality
human performance digitization
range image registration
virtual reality