Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-scale dynamic capture for high quality digital humans
(USC Thesis Other)
Multi-scale dynamic capture for high quality digital humans
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Multi-scale Dynamic Capture for High Quality Digital Humans
by
Koki Nagano
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements of the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
August 2017
Copyright 2017 Koki Nagano
Acknowledgements
This dissertation would not have been possible without the mentorship, support, and encourage-
ment of many people. I would like to express my sincerest gratitude to those help make this
possible.
First, I would like to thank my advisor Dr. Paul Debevec for mentoring and enlightening me in
pursuing all exciting research. My time during PhD has been the most fruitful and successful years
in my career, and I cannot imagine the accomplishments I have made during my PhD without
your support and guidance. Thank you for teaching me the joy of applying academic research
in exciting visual effects, and providing me with such opportunities. Thank you very much for
my dissertation committee members, Dr. Hao Li, Dr. Jernej Barbiˇ c, Dr. Aiichiro Nakano, and
Dr. Michelle Povinelli for insightful suggestions and all their efforts. I would like to thank Dr.
Abhijeet Ghosh for sharing his expertise in digital human capture and kind support. I would also
like to thank USC’s CS department PhD student adviser, Lizsl DeLeon, for keeping me on track.
I would like to extend my thank you to my undergraduate supervisors and mentors, Dr. Ushio
Saito, Dr. Masayuki Nakajima, Dr. Akihiko Shirai, and Dr. Takayuki Aoki, who encouraged me
to study in the US.
The majority of the work in this thesis is built on the strong support and encouragement by the
USC ICT Vision and Graphics Lab members. Thank you Hao Li for guiding me in many ways
and precious advice, Jay Busch for her cheerful support, Oleg Alexander for sharing his artistic
insights, and Xueming Yu, and Shanhe Wang for supporting me with the beautiful hardware.
Thank you Andrew Jones for countless overnighters to make the awesome projector arrays to
ii
work. All facial performance capture work was never possible without significant contributions
from Graham Fyffe. Thank you! Thank you Loc Huynh, Shunsuke Saito, Chloe LeGendre,
Marcel Ramos, Adair Liu, and Pratusha Prasad and other members. Thank you Ryosuke Ichikari
for helping me both inside and outside the lab in starting my career in abroad. I would like to thank
Kathleen Haase, Valerie Dauphin, Christina Trejo and Michael Trejo for their constant support
to perform high quality research at the lab. Thank you Michael for helping me in improving
the thesis. I also had a pleasure to share exciting times at ICT working with Dr. Jonas Unger,
Dr. Sumanta Pattanaik, Hsuan-Yueh Peng, Saghi Hajisharif, Jing Liu, Joey Barreto, and Megan
Iafrati. Special thanks to Paul Graham, and Borom Tunwattanapong for their senior advice in
pursuing my PhD, and to Matt Chiang, Yu Takahashi, Saneyuki Ohno, Shunsuke Saito, Thomas
Collins, Chi-An Chen, and all my PhD friends for sharing the ride during my journey. I would
like to extend my special thanks to Nobuyuki Umetani for sharing his expertise in biomechanical
simulation.
I was very fortunate to work with excellent talents in the Digital Human League, which have
made all the amazing rendering work possible in “Digital Emily 2.0” and “Skin Stretch” demos.
Thank you Chris Nichols for his leadership, Jason Huang, Vladmir Koylazov, Rusko Ruskov,
Mathieu Aerni, Danny Young, and Mike Seymour for sharing his immense knowledge in VFX.
Thank you very much for Emily O’Brien for being an amazing subject, and Todd Richmond for
the awesome narration. I would also like to thank Javier von del Pahlen, and Jorge Jimenez for
inspiring technical discussions, and Etienne Danvoye, and Chris Ellis for their awesome machine
vision cameras, and software.
I would like to thank Masuo Suzuki, Andr´ e Mazzone, Alex Ma, Kurt Ma, JP Lewis, and Yeongho
Seol for enlightening discussions with their in-depth knowledge in VFX. A very special thanks
to Weta Digital: Joe Letteri for sharing precious technical insights in VFX, Martin Hill, Alasdair
iii
Coull, and Antoine Bouthors for making my dream come true, Mariko Tosti, Kevin Whitfield,
Daniel Lond, Pieterjan Bartels, Emilie Guy, Masaya Suzuki, Yoshihiro Harimoto, R´ emi Fontan,
and Eric Vezinet. The summer (winter there) at Weta was one of the best times during my PhD.
Thank you very much for Oculus Research Pittsburgh: Takaaki Shiratori for his very kind support,
and many overnight discussions, Jason Saragih, and Chenglei Wu for a lot of technical inspira-
tions, Yaser Sheikh for giving me a great opportunity, and allowing me to work on an exciting
project, Shoou-I Yu, Sahana Vijai, Mary Green and all ORP for amazing time in Pittsburgh. It
was one of the most intense times during my study.
I would like to give my sincere gratitude to Tetsuro Funai, Takashi Masuda, Akira Funai, Keiko
Saito and the Funai Foundation for Information Technologies for sponsoring my PhD, and pro-
viding us with precious opportunities to interact with fellows at the foundation. Thank you all the
Funai fellows. I would like to thank Avneesh Sud for his mentorship and encouragement, and the
Google PhD Fellowship for sponsoring my study.
Thank you very much Gary Vierheller for mentoring me and supporting my journey. You saved
me many times. I would like to thank my family Shigeru, Miyuki, Seigo, Yasutaka, Setsu, Hisako,
and my grandpa Shigeo for enduring support and encouragement. My parents always have pro-
vided a solid appreciation and full support in pursuing my interests and careers.
Finally my wonderful wife Atsuko for her unwavering love, friendship, patience, and enthusiasm.
Understanding me best as a scientist/researcher herself, she has supported and pushed me to
realize my dreams. I couldn’t have made this far without her – thank you.
This work was sponsored by the University of Southern California Office of the Provost, U.S. Air
Force DURIP, and U.S. Army Research, Development, and Engineering Command (RDECOM),
Office of Naval Research, U.S. Navy, the Office of the Director of National Intelligence, and
Intelligence Advanced Research Projects Activity. The content of this thesis does not necessarily
iv
reflect the position or the policy of the US Government, and no official endorsement should be
inferred.
v
TableofContents
Abstract xvii
Acknowledgements xvii
I Introduction 1
I.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
II BackgroundandRelatedWork 17
II.1 High Fidelity Face Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
II.2 Types of Priors for Performance Capture . . . . . . . . . . . . . . . . . . . . . . 23
II.3 Capturing and Modeling of Dynamic Skin Appearance . . . . . . . . . . . . . . 26
III Multi-viewDynamicFacialCaptureandCorrespondence 31
III.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
III.2 Shared Template Mesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.3 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
III.3.1 Landmark-Based Initialization . . . . . . . . . . . . . . . . . . . . . . . 36
III.3.2 Coarse-Scale Template Warping . . . . . . . . . . . . . . . . . . . . . . 37
III.3.3 Pose Estimation, Denoising, and Template Personalization . . . . . . . . 38
III.3.4 Fine-Scale Template Warping . . . . . . . . . . . . . . . . . . . . . . . 39
III.3.5 Final Pose Estimation and Denoising . . . . . . . . . . . . . . . . . . . 39
III.3.6 Detail Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
III.4 Appearance-Driven Mesh Deformation . . . . . . . . . . . . . . . . . . . . . . . 41
III.4.1 Image Warping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
III.4.2 Optical Flow Based Update . . . . . . . . . . . . . . . . . . . . . . . . 44
III.4.3 Dense Mesh Representation . . . . . . . . . . . . . . . . . . . . . . . . 47
III.4.4 Laplacian Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 48
III.4.5 Updating Eyeballs and Eye Socket Interiors . . . . . . . . . . . . . . . . 50
III.5 PCA-Based Pose Estimation and Denoising . . . . . . . . . . . . . . . . . . . . 51
III.5.1 Rotation Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
III.5.2 Translation Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
III.5.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
III.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
III.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vi
III.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
III.9 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
IV Estimating3DMotionfromMulti-viewHumanPerformance 69
IV .1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
IV .2 System Setup and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 71
IV .3 Estimating 3D Scene Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
IV .3.1 Robust Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
IV .3.2 Outlier Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
IV .3.3 Data-driven Multi-scale Propagation . . . . . . . . . . . . . . . . . . . . 79
IV .4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
IV .5 Discussions and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
V SynthesizingDynamicSkinDetailsforFacialAnimation 91
V .1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
V .2 Basic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
V .3 Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
V .4 Microstructure Analysis and Synthesis . . . . . . . . . . . . . . . . . . . . . . . 100
V .5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
V .6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
V .7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
V .8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
V .9 Integrating Skin Microstructure Deformation for Animated Character . . . . . . 118
V .10 Appendix: Driving Skin BRDF Roughness from Deformation . . . . . . . . . . 126
VI ConclusionandFutureWork 130
VI.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
VI.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
VIIAppendix: AutomultiscopicProjectorArraysforInteractiveDigitalHumans 136
VII.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
VII.2 Life-size Facial Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
VII.2.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
VII.2.2 Viewer Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
VII.2.3 Convex Screen Projection . . . . . . . . . . . . . . . . . . . . . . . . . 146
VII.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
VII.3 Life-size Full Body Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
VII.3.1 Display Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
VII.3.2 Content Capture and Light Field Rendering . . . . . . . . . . . . . . . . 156
VII.3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
BIBLIOGRAPHY 159
vii
ListofFigures
I.1 Uncanny Valley [103]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
I.2 Comparison rendering of realistic and stylized characters from [162]. (a) Ren-
dering of a male character with realistic material and shape. (b) It shares the
same shape as (a), but the rendering is stylized. (c) The rendering of the mate-
rial is as realistic as (a), but the shape is stylized by exaggerating the proportion
of facial features (e.g. large eyes). . . . . . . . . . . . . . . . . . . . . . . . . 3
I.3 Comparison of a CG rendering (left) and a reference photograph (right) . . . . 4
I.4 An example of camera arrays for full body performance consisting of 50 HD
video cameras [79]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I.5 First row: puffing the cheek exerts surface tensions on the cheek and changes
the skin microstructure, resulting in much smoother surface reflection on the
right than the neutral on the left. Second row: compressing the crows feet re-
gion exhibits very evident anisotropic skin textures on the right, which produces
significantly different appearance to the neutral on the left. . . . . . . . . . . . 9
I.6 (Left) high quality mesh deformed using our technique; (Middle) a reconstructed
smile expression with a consistent parameterization, and (Right) high resolution
displacement details. Half-face overlay visualization indicates good agreement
and topology flow around geometric details within the facial expression. . . . . 10
I.7 Automatically corresponded static and dynamic frames from a single template. 11
I.8 Our dynamic surface reconstruction takes (a) multivew images (left) and recon-
structed mesh (right) as input, and computes (b) scene flow for each vertex.
Color code in the mesh (a) represents surface normal orientations, and that in
(b) represents scene flow direction overlaid on input images. . . . . . . . . . . 13
I.9 Strain field computed from corresponded facial expressions. . . . . . . . . . . 14
I.10 (Top) Two real-time renderings from a blendshape animation. (Left) Neutral
face rendered with initial microstructure. (Right) Wince expression rendered
with deformed microstructure. Here, the local strain drives the microstruc-
ture convolution. (Bottom) Corresponding expressions on the top row shows
specular-only frames from real-time facial animation, showing the gradual de-
velopment of horizontal microstructure near the eye as the face winces. The
bottom frame includes an inset of a real photo taken under similar lighting and
expression, showing similarly deformed anisotropic skin textures. . . . . . . . 15
II.1 Facial features at different scales. . . . . . . . . . . . . . . . . . . . . . . . . 18
viii
II.2 Comparison of our result (a) to PMVS2 [45] (b). Overlaying the meshes in (c)
indicates a good geometric match. . . . . . . . . . . . . . . . . . . . . . . . . 21
II.3 A normal distribution changing under deformation . . . . . . . . . . . . . . . 27
II.4 Scales of skin microstructures. (Left) A forehead region marked roughly corre-
sponds to 2 by 1 centimeter rectangle (middle) seen in a machine vision camera.
The red square in the middle figure corresponds to a forehead region 0.5 to 1.0
mm, which may contain tens of thousands of points on one surface layer as
shown here on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
II.5 Neutral skin microstructure under polarized spherical gradient illumination. Gra-
dients with (first row) a parallel condition, (second row) a cross polarized condi-
tion, (third row) the difference of the first two showing the surface reflection on
the skin patch, and (forth row) output specular albedo, diffuse albedo, specular
normal, and displacement from left to right in this order. . . . . . . . . . . . . 30
III.1 Our pipeline proceeds in six phases, illustrated as numbered circles. 1) A com-
mon template is fitted to multi-view imagery of a subject using landmark-based
fitting (Section III.3.1). 2) The mesh is refined for every frame using optical
flow for coarse-scale consistency and stereo (Section III.3.2). 3) The meshes
of all frames are aligned and denoised using a PCA scheme (Section III.3.3).
4) A personalized template is extracted and employed to refine the meshes for
fine-scale consistency (Section III.3.4). 5) Final pose estimation and denoising
reduces “sizzling” (Section III.3.5). 6) Details are estimated from the imagery
(Section III.3.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.2 Production-quality mesh template and the cross-section of the volumetric tem-
plate constructed from the surface. . . . . . . . . . . . . . . . . . . . . . . . . 35
III.3 (a) Facial landmarks detected on the source subject and (b) the corresponding
template; (c) The template deformed based on detected landmarks on the tem-
plate and subject photographs; (d) Detailed template fitting based on optical
flow between the template and subject, and between views. . . . . . . . . . . . 36
III.4 Facial expressions reconstructed without temporal flow. . . . . . . . . . . . . 38
III.5 (a) Dense base mesh; (b) Proposed detail enhancement; (c) “Dark is deep” detail
enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
III.6 Optical flow between photographs of different subjects (a) and (e) performs
poorly, producing the warped images (b) and (f). Using 3D mesh estimates
(e.g. a template deformed based on facial landmarks), we compute a smooth
vector field to produce the warped images (c) and (g). Optical flow between
the original images (a, e) and the warped images (c, g) produces the relatively
successful final warped images (d) and (h). . . . . . . . . . . . . . . . . . . . . 43
ix
III.7 (a) A stereo cue. An estimated pointx is projected to the 2D pointp
k
in view
k. A flow fieldF
l
k
transfers the 2D point top
l
in a second viewl. The point
x is updated by triangulating the rays throughp
k
andp
l
. (b) A reference cue.
An estimated pointy is projected to the 2D pointq
j
in viewj. A flow fieldG
k
j
transfers the 2D point top
k
in viewk of a different subject or different time. A
second flow fieldF
l
k
transfers the 2D point top
l
in viewl and then pointx is
estimated by triangulating the rays throughp
k
andp
l
. . . . . . . . . . . . . . . 45
III.8 Laplacian regularization results. Left: surface regularization only. Right: sur-
face and volumetric regularization. . . . . . . . . . . . . . . . . . . . . . . . 49
III.9 Comparison of rigid mesh alignment techniques on a sequence with significant
head motion and extreme expressions. Top: center view. Middle: Procrustes
alignment. Bottom: our proposed method. The green horizontal strike-through
lines indicate the vertical position of the globally consistent eyeball pivots. . . 53
III.10 12 views of one frame of a performance sequence. Using flat blue lighting
provides sharp imagery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
III.11 Zoomed renderings of different facial regions of three subjects. Top: results
after coarse-scale fitting using the shared template (Section III.3.2). The land-
marks incorrectly located the eyes or mouth in some frames. Middle: results
after pose estimation and denoising (III.3.3). Bottom: results after fine-scale
consistent mesh propagation (III.3.4) showing the recovery of correct shapes. . 58
III.12 Comparison using a morphable model of [134] as a initial template. (a) Front
face region captured by the previous technique, (b) stitched on our full head
topology. (c) Resulting geometry from III.3.2 deformed using our method with
(b) as a template, compared to (d) the result of using the Digital Emily template.
The linear morphable model misses details in the nasolabial fold. . . . . . . . . 60
III.13 (a) Synthetic rendering of the morphable model from Figure III.12(b). (b) Re-
sult using our image warping method to warp (a) to match real photograph (e).
Similarly the common template image (c) is warped to match (e), producing
plausible coarse-scale facial feature matching in (d). . . . . . . . . . . . . . . . 61
III.14 Our method automatically reconstructs dynamic facial models from multi-view
stereo with consistent parameterization. (a) and (b) Facial reconstruction with
artist-quality mesh topology overlaid on the input video. (c) Reconstructed face
model with a displacement map estimated from details in the images. (d) Close-
up of fine details, such as pores and dynamic wrinkles from (c). . . . . . . . . 64
III.15 Dynamic face reconstruction from multi-view dataset of a male subject shown
from one of the calibrated cameras (top). Wireframe rendering (second) and
per frame texture rendering (third) from the same camera. Enhanced details
captured with our technique (bottom) shows high quality agreement with the
fine-scale details in the photograph. . . . . . . . . . . . . . . . . . . . . . . . 65
x
III.16 Reconstructed mesh (a), enhanced displacement details with our technique (b),
and comparison to previous work (c). Our method automatically captures whole
head topology including nostrils, back of the head, mouth interior, and eyes, as
well as skin details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
III.17 Since our method reconstructs the face on a common head topology with coarse-
scale feature consistency across subjects, blending between different facial per-
formances is easy. Here we transition between facial performances from three
different subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
III.18 Our system struggles to reconstruct features that are not represented by the tem-
plate. For example, visible facial hair or tongue (a) may cause misplacement of
the landmarks employed in III.3.2 (b), which the denoising in III.3.3 may not be
able to recover (c), and remain as artifacts after fine-scale warping in III.3.4 (d). 66
III.19 From left to right: unconstrained input image, inferred complete albedo, render-
ing with a commercially available rendering package, and its close up showing
the medium frequency skin pigmentations captured with the technique. . . . . 67
IV .1 An example of 40 camera array setup, each of four rows includes 10 cameras al-
most uniformly distributed in azimuth angle covering ear to ear in case of human
face capture. Four lower cameras are zoomed in to improve the reconstruction
around the mouth cavity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
IV .2 Robust descriptors. Top and middle: input image and our 8-dimensional robust
descriptors. The arrows indicate directional vectors in D. Bottom: comparisons
between our robust descriptor (one of the eight directions is shown) andkrI
x
k
at middle and coarsest levels. Note thatkrI
x
k is used for visualization purpose,
whilerI is used in typical approaches. . . . . . . . . . . . . . . . . . . . . . 76
IV .3 An illustration of the rigidity weight. The rigidity weight penalizes the large
depth discrepancy due to non-rigid deformation within the patch. We derive
the rigidity weight (c) from the depth similarities between the source (a) and the
target (b) patches as in Equation (IV .6). Here, the higher rigid weight is assigned
to the upper lip area while it is lower in the lower lip. . . . . . . . . . . . . . . 77
IV .4 An illustration of the support weight. The support weight prefers the pixel
neighbor which has a similar feature to the center pixel in order to disambiguate
pixel neighbors corresponding to a semantically/geometrically different point
than the center (e.g. the patch in the left image contains points corresponding to
the nose and the eyes, which may deform differently). We employ the depth of
the patch pixel as the similarity measure, and derive the support weight shown
in the lower right using the Equation (IV .7). . . . . . . . . . . . . . . . . . . . 78
IV .5 Result of tracking Subject 2’s mouth corner pulling. Top: tracked point clouds,
and bottom: point trajectories overlaid on images. Result of tracking Subject
1’s cheek puffing. Top: tracked point clouds, and bottom: point trajectories
overlaid on images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
xi
IV .6 Result of tracking skin deformation of an arm. Top: tracked point clouds. Bot-
tom: point trajectories overlaid on images. . . . . . . . . . . . . . . . . . . . 83
IV .7 Result of tracking Subject 2’s eye blinking. (a) tracked point clouds, and (b)
point trajectories overlaid on images. . . . . . . . . . . . . . . . . . . . . . . 84
IV .8 Dynamic surface mesh of clothing (top) and an arm example (bottom) obtained
with our method. This mesh based visualization shows the 3D template mesh
obtained with Section IV .4 (first column), and the mesh sequence obtained by
applying our 3D scene flow in Section IV .3 on the template (second column and
later). The vertices that are no longer tracked (i.e. due to occlusion) are not
shown here. Though our method estimates the 3D scene flow independently on
each vertex, it can produce clean 3D surface for the large portion of the mesh. . 84
IV .9 (a) An input frame from a facial performance sequence, the tracking result (b)
without and (c) with robust descriptors (ours), and (d) close views around the
nose. Zoom in views in (d) show that the tracking suffers from artifacts due to
the shading in the nasal fold if naive RGB intensity is used (top) while ours does
not (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
IV .10 (a) Result obtained using only robust descriptors in Section IV .3.1, (b) (a)
with data-driven multi-scale propagation in Section IV .3.3, and (c) (b) with
all weights in Section IV .3.2 added (ours). . . . . . . . . . . . . . . . . . . . 85
IV .11 Alignment error comparison with a global optimization with laplacian regular-
ization. The above graph shows the evolution of the error in temporal tracking
with the horizontal axis being the frame count, and the vertical axis being the
accumulated error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
V .1 Three real forehead expressions (surprised, neutral, and perplexed) made by the
same subject showing anisotropic deformations in microstructure. . . . . . . . 92
V .2 Tension on the balloon surface changes its surface appearance. . . . . . . . . . 94
V .3 Stretching and compressing a measured OCT skin profile, with and without
convolution filters to maintain surface length. . . . . . . . . . . . . . . . . . . 95
V .4 Deforming sphere with dynamic microgeometry. (Top) The microstructure be-
comes rougher through displacement map sharpening when shrunk, and smoother
through blurring when expanded. The insets show details of the specular high-
lights. (b) Anisotropic compression and stretching yields anisotropic microstruc-
ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
V .5 Microstructure acquistion in a polarized LED sphere with macro camera and
articulated skin deformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
V .6 Texture-aligned surface normal (top) and displacement (bottom) maps of a skin
patch under vertical compression and stretching. (a) full compression, (b) medium
compression, (c) neutral, (d) medium stretching, and (e) full stretching. . . . . 99
V .7 Dynamic microgeometry rendering of a forehead skin patch under a point light
under stretch (Left) and compressions (Right). The diffuse albedo is artistically
colorized for visualization purposes. . . . . . . . . . . . . . . . . . . . . . . . 100
xii
V .8 Each column shows measured 8mm wide facial skin patches under different
amounts of stretching and compression, with a histogram of the corresponding
surface normal distributions shown to the right of each sample. . . . . . . . . . 101
V .9 Surface normal distributions plotted against the amount of strain for several skin
patches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
V .10 Displacement map pixels are sampled along the principal directions of strain
with a separable filter for convolution. D
s
is computed from D, then D
0
is
computed fromD
s
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
V .11 Fitted kernel parameters and (m) for a patch of forehead skin undergoing a
range of stretching and compression.r is the stretch ratio, withr> 1 stretching
andr< 1 compressing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
V .12 Fitted kernel parameters and plotted against the stretch ratior. Dots repre-
sent parameters fitted to sample patches, and lines represent the piecewise linear
fits in (V .8), (V .9). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
V .13 Strain field visualization for a smile expression (top row) and a sad expression
(bottom row) with the first stress eigen value (a), (e), and the second eigen value
(b), (f), and strain direction visualization (c), (d), (g), and (h). . . . . . . . . . 109
V .14 A sampled skin patch is deformed with FEM which drives microstructure con-
volution, rendered with path tracing. . . . . . . . . . . . . . . . . . . . . . . . 110
V .15 A rendered facial expression with (a) mesostructure only (b) static microstruc-
ture from a neutral expression (c) dynamic microstructure from convolving the
neutral microstructure according to local surface strain compared to a reference
photograph of the similar expression. The insets show detail from the lower-left
area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
V .16 A rendered facial expression with (a) mesostructure only, (b) static microstruc-
ture from a neutral expression, and (c) dynamic microstructure from convolving
the neutral microstructure according to local surface strain compared to (d) a
reference photograph. The insets show detail from the upper-left area. . . . . . 113
V .17 Real-time rendering of the cheek region from a facial performance animation
with enhanced dynamic surface details (right) compared to the static microstruc-
ture rendering (left). The dynamic microstructure provides an additional indica-
tion of the deformation on the cheek when the subject makes a smile expression. 114
V .18 Specular-only real-time renderings of a nose from a blendshape animation, show-
ing anisotropic dynamic microstructure at different orientations in the expres-
sion (right) compared to the neutral (left). The dynamic microstructure breaks
up the smooth specular highlight on the bridge of the nose. . . . . . . . . . . . 114
xiii
V .19 Real-time comparison renderings from a blendshape animation (middle) com-
pared to the reference photographs of a similar expression (right). Specular-only
real-time renderings show anisotropic dynamic microstructure at different ori-
entations in the expressions (left). Top two rows: young male subject’s crow’s
feet region with the eyes shut tightly, and the stretched cheek region when the
mouth is pulled left. Bottom two rows: young female subject crow’s feet region,
and the nose under the squint expression. . . . . . . . . . . . . . . . . . . . . 115
V .20 Real-time rendering of older subject’s mouth with a smile expression (a), raised
down forehead (b), and the stretched cheek and mouth regions (c). . . . . . . . 116
V .21 Recorded 3D geometry and reflectance maps. . . . . . . . . . . . . . . . . . . 119
V .22 Shader network for displacement. . . . . . . . . . . . . . . . . . . . . . . . . 120
V .23 Shader network for reflectance. . . . . . . . . . . . . . . . . . . . . . . . . . 120
V .24 Visualization of anisotropic stress field (bottom) and corresponding surface re-
flection showing spatially varying stretching and compression effects on the
face. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
V .25 Demonstration of a fully integrated digital character with realistic eyes and
hairs, exhibiting the dynamic change in both facial textures and reflections. The
face was rendered under an HDR sunset environment. . . . . . . . . . . . . . 122
V .26 Blendshape animation of forehead raising up (top) and down (bottom) the eye-
brows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
V .27 Blendshape animation of a cheek region showing a neutral state (top) and a
puffed cheek (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
V .28 Simulating dynamic skin microstructure with captured facial performance (top)
increases the visual realism in the dynamic surface reflection (bottom). A stronger
sense of surface tension on the cheek provides a more sincere smile expression.
The specular reflection image on the bottom is brightened up for visualization
purposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
V .29 (a) Major strain magnitude (b) surface area change, and (c) strain anisotropy. . 127
V .30 Rendered surface reflectance of a squashing sphere with dynamic (a) and static
(b) microfacet distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
V .31 Rendered surface reflectance of the crow’s feet on a smile expression with dy-
namic (a) and static (b) microfacet distributions. . . . . . . . . . . . . . . . . 129
VII.1 3D stereo photographs of a human face on the autostereoscopic projector array 136
VII.2 The anisotropic screen forms a series of vertical lines, each corresponding to a
projector lens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
VII.3 Photograph of projector array calibration setup . . . . . . . . . . . . . . . . . . 142
VII.4 Diagrams explaining per-vertex interpolation of multiple viewer positions . . . 146
VII.5 Warped MCOP frames sent to the projectors for flat and convex screens . . . . 147
VII.6 Diagrams showing resolution tradeoff and iterative projector refinement for con-
vex mirror screens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xiv
VII.7 Comparison of different viewer interpolation functions. We show three objects
photographed by an untracked center camera and two tracked left and right cam-
eras. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
VII.8 Comparison of different viewer interpolation functions for a convex mirror . . . 152
VII.9 Facial features at different scales. . . . . . . . . . . . . . . . . . . . . . . . . 156
VII.10 (left) Photograph showing the 6 computers, 72 video splitters, and 216 video
projectors used to display the subject. (right) The anisotropic screen scatters
light from each projector into a vertical stripe. The individual stripes can be
seen if we reduce the angular density of projectors. Each vertical stripe contains
pixels from a different projector. . . . . . . . . . . . . . . . . . . . . . . . . . 156
VII.11 Stereo photograph of subjects on the display, left-right reversed for cross-fused
stereo viewing. Each subject is shown from three positions. . . . . . . . . . . 158
xv
Abstract
Digitally creating a virtual human indistinguishable from a real human has been one of the central
goals of Computer Graphics, Human-Computer Interaction, and Artificial Intelligence. Such dig-
ital characters are not only the primary creative vessel for immersive storytellers and filmmakers,
but also a key technology to understand the process of how humans think, see, and communicate
in the social environment. In order for digital character creation techniques to be valuable in sim-
ulating and understanding humans, the hardest challenge is for them to appear believably realistic
from any point of view in any environment, and to behave and interact in a convincing manner.
Creating a photorealistic rendering of a digital avatar is increasingly more accessible due to rapid
advancement in sensing technologies and rendering techniques. However, generating realistic
movement and dynamic details that are compatible with such a photorealistic appearance still
relies on manual work from experts, which hinders the potential impact of digital avatar tech-
nologies in real world applications. Generating dynamic details is especially important for facial
animation, as humans are extremely tuned to sense people’s intentions from facial expressions.
This dissertation proposes systems and approaches for capturing the appearance and motion to re-
produce high fidelity digital avatars that are rich in subtle motion and appearance details. We aim
for a framework which can generate consistent dynamic detail and motion at the resolution of skin
pores and fine wrinkles, and can provide extremely high resolution microstructure deformation
for use in cinematic storytelling or immersive virtual reality environments.
This thesis presents three principal techniques for achieving multi-scale dynamic capture for
digital humans. The first is a multi-view capture system and a stereo reconstruction technique
which directly produces a complete high fidelity head model with consistent facial mesh topol-
ogy. Our method jointly solves for stereo constraints and consistent mesh parameterization from
xvi
static scans or a dynamic performance, producing dense correspondences on an artist-quality tem-
plate. Additionally, we propose a technique to add dynamic per-frame high and middle frequency
details from the flat-lit performance video. Second, we propose a technique to estimate high fi-
delity 3D scene flow from multi-view video. The motion estimation fully respects high quality
data from multi-view input, and can be incorporated to any facial performance capture pipeline
to improve the fidelity of the facial motion. Since the motion can be estimated without relying
on any domain-specific priors or regularization, our method scales well to modern systems with
many high resolution cameras. Third, we present a technique to synthesize dynamic skin mi-
crostructure details to produce convincing facial animation. We measure and quantify how skin
microstructure deformation contributes to dynamic skin appearance, and present an efficient way
to simulate dynamic skin microstructure. When combined with the state-of-the art performance
capture and face scanning techniques, it can significantly improve the realism of animated faces
for virtual reality, video games, and visual effects.
xvii
ChapterI
Introduction
Creating a believable photorealistic digital human has been a long standing goal in Computer
Graphics. As there are increasing demands for high quality digital avatars in digital storytelling
and entertainment, it is increasingly important to investigate methods to create the digital avatars
in a repeatable manner. In addition, the recent surge in Virtual Reality/Augmented Reality neces-
sitates automated ways to create high quality digital avatars to provide more digital content. As
digital avatars are more prevalent in a daily life, there is a growing impact of techniques, from
virtual assistants to fashion to health care.
A major concern of creating a photorealistic digital character is to avoid the phenomenon called
“Uncanny Valley” (Figure I.1). It was originally presented in the robotics field by Masahiro Mori
in 1970 [108] that as an object starts to appear more humanlike, observers react and respond to the
object in an empathic and positive manner. When the object appears close to a human, but fails
to provide humanlike appearance, it creates a negative “creepy” feeling with human observers,
creating a negative reaction as appeared in the sharp drop off of familiarity in Figure I.1. It
was later interpreted in Computer Generated (CG) characters, and is considered one of the most
important metrics in measuring the success of digital characters in digital film making and games.
1
Figure I.1: Uncanny Valley [103].
In order to avoid falling into the valley, Mori suggested that the robot designers should aim for
the first peak of the plot instead of the second by deliberately designing a character in a non-
humanlike way [108]. In CG character design, a common practice to escape from the valley is
stylization [162], that is to stylize character design as opposed to photorealism, which is to repro-
duce the object as realistically as possible such that it is indistinguishable from its photograph.
While photorealism requires all the aspects in animation, rendering, and geometry to be photo-
realistic, successful stylization allows some or all of the components to be non-humanlike, from
which we can easily tell the character is virtual. Examples of stylization include non-photorealistic
2
rendering Figure I.2 (b), exaggerated facial features (Figure I.2 (c)), and cartoonish motion. Fig-
ure I.2 shows the comparison of realistic rendering of a male character (a) with partially stylized
versions (b)(c) as appeared in [162]. Stylization helps us stop focusing on what is missing from
the real human, and helps to avoid falling into the valley, although it is usually unavailable for
making photorealistic digital characters.
(a) (b) (c)
Figure I.2: Comparison rendering of realistic and stylized characters from [162]. (a) Rendering
of a male character with realistic material and shape. (b) It shares the same shape as (a), but
the rendering is stylized. (c) The rendering of the material is as realistic as (a), but the shape is
stylized by exaggerating the proportion of facial features (e.g. large eyes).
If the object appears more humanlike, it elicits increased expectations to the viewer, and if they
fail to meet the expectations of being a “human”, we are alerted to a strangeness or creepiness
[136]. Thus, photorealism is an essentially challenging and costly process in that a character
designed with it needs to meet all expectations that make the character believable. In order to
identify major challenges, it is important to note the required quality for photorealistic digital
humans. Some major expectations for a believable digital character include:
It needs to appear realistic from any 3D point of view. It requires accurate 3D geometry that
models reflections and occlusions properly from large to fine scales. Also, the presented
3D object should present view dependent shapes which appear natural and consistent from
any perspectives.
3
Rendering Photograph
Figure I.3: Comparison of a CG rendering (left) and a reference photograph (right)
It needs to appear realistic under any lighting and physical environment. The object needs
to exhibit realistic surface and subsurface light reflection under any illumination including
global illumination effects. Also, the motion and interaction with a physical world needs to
faithfully follow physical laws.
It needs to behave and interact in a believable manner. The interaction and response needs
to be congruent to verbal and non-verbal stimuli.
In particular, if the perceived realism of the character does not match the behavior, it exaggerates
uncanny feeling in the observer [136]. Thus, it is important to maintain consistent levels of
behavioral fidelity with visual realism [143]. Since the visual realism in character appearance has
4
been substantially improved (Figure I.3) thanks to recent advancement in capturing hardware and
3D scanning algorithms, it is important to be able to generate visually satisfying character motion
that matches the level of the character appearance. To achieve this goal, this thesis focuses on
techniques to capture and synthesize dynamic appearance for realistic digital avatars. The goal
of this dissertation is to develop a principled (and ideally automatic) technique for synthesizing
dynamic details. In this thesis, we present three different techniques for high quality dynamic
digital characters which exhibit consistent animated details at multiple scales.
Our first work focuses on multi-view dynamic facial capture. Currently high-end digital avatar
creation relies on high resolution scanning, which may employ active photometric scans [4, 6].
The state-of-the-art photometric scanning technique [55] provides skin details up to submilime-
ters including layers of reflectance information, that allows photorealistic rendering of faces. One
way to obtain dynamic per-frame details and reflectance is to perform an active photometric stereo
process at high frame-rate as done in previous work [102, 153, 48, 58]. However it requires expen-
sive high speed cameras and lighting to capture multiple lighting conditions every frame. Plus,
there are inherent problems regarding the camera noise and light levels, which limit the qual-
ity of the data. Even if per-frame photometric scanning is successfully employed, it still lacks
the correspondence of points across the frames, from which the full motion of the point can be
recovered. The per-frame dynamic reflectance, and details still need to be inferred from some-
where to generate a photorealistic rendering. The current high-end production pipeline generally
tackles this problem by decoupling the avatar creation into the model building (including high
resolution scanning and the rig creation) and temporally consistent performance tracking. To
employ the high resolution model in the tracking, the dense correspondence needs to be solved
between the scans, and the frames in the performance. While there are automated solutions to
correspond high resolution shapes to performance frames [14, 87, 49], currently there is no end-
to-end solution to solve for the high resolution shapes and correspondence simultaneously. More
5
importantly, these existing techniques rely on expensive temporal tracking, which can only be
performed sequentially. This can be prohibitive if an artist wants to edit the model and iterate
the tracking. In solving this problem, we present a multi-view dynamic reconstruction technique
to produce dense and consistent parameterization from either static scans or performance frames.
Our process starts with acquiring dynamic facial details in a static LED dome using multiple high
resolution cameras. Unlike previous work, our technique simultaneously solves for shapes and
dense correspondence from a single template. Thus, our method does not require any sequential
tracking, and does not suffer from drifting. Since there is no sequential tracking employed, our
method is parallelizable, and does not discriminate static scans and dynamic performance frames,
and they can be processed altogether on consistent parameterization. The resulting model in-
cludes high quality edge flows, mouth interiors, eyeballs, and dynamic mesoscopic details, all of
which are suitable for high quality facial animation.
Our second work addresses a technique for estimating dense 3D human motion. A recent trend in
markerless human capture in high-end production continues to employ more and more cameras to
meet the increasing demands in high quality digital avatars for AR/VR applications, which allow
intimate interactions with the users. Such a camera array system for 3D scanning is getting more
popular as the cameras are becoming cheaper, and the stable implementation of 3D reconstruc-
tion software is widely available (e.g. [3, 117]). Figure I.4 shows a camera array consisting of
50 HD video cameras used in our body performance capture project [79]. Some commercial sys-
tems consist of more than 100 DSLR cameras for high resolution scanning (e.g. [99, 71]). These
camera arrays provide billions of pixel counts as a measurement, from which a high resolution
3D shape can be derived. As the shape estimation already employs this much resolution, it is
natural to expect that the future dynamic performance capture needs to solve for motion at the
same high resolution. Current high fidelity performance capture systems rely on regularization
(e.g. Laplacian priors) and target specific models, such as statistical priors (collectively we refer
6
Figure I.4: An example of camera arrays for full body performance consisting of 50 HD video
cameras [79].
to them as constraints) to account for occlusion and outliers in data. High quality priors are the
key for successful performance capture, and building them needs a truthful measurement (ideally
one that does not have any contamination or bias). However, currently there is no technique which
can provide motion as a measurement since existing methods rely on priors, which have a nor-
malization effect, and tend to destroy characteristics in the data. Furthermore, the regularization
is generally related to a global optimization that involves solving possibly millions of unknowns,
and is a slow process. Such a process may not scale well if it needs to deal with a massive number
of pixels. In the second work, we propose a constraint-free scene flow technique for a massively
multi-view and high resolution camera array. Our optimization framework fully respects high
quality data from multi-view input, and does not rely on regularization or smoothness constraints.
7
Since our method does not involve an expensive global optimization, it is fully parallelizable and
scales efficiently for today’s massively high resolution camera arrays. If desired, our method
can be integrated to existing multi-view performance capture systems to improve the fidelity of
estimated motions.
Our third work presents a technique to significantly increase the resolution of dynamic facial
scans. The above-mentioned modern facial scanning provides submilimeter precision, record-
ing facial mesostructures, such as pores and fine creases. However human skin details continue
to be present to the level of microstructures, which are at the scale of microns. Graham et al.
[60] showed a synthesis-based approach to increase the resolution of the facial scans to the level
of skin microstructures. They demonstrated that the skin microstructure significantly improves
the realism of the static facial appearance, as it provides sharp high frequency reflections and
spatially-varying surface roughness. One question that naturally arises is what might happen to
the skin microstructure as a face makes different expressions. Figure I.5 shows close-up pho-
tographs of a young subject’s cheek (first row) and crows feet (second row) regions. As shown
in the real photographs, the skin microstructure is remarkably dynamic, and stretching and com-
pression on it significantly changes the skin surface appearance. We wish to model these dynamic
effects of skin microstructure; However, representing the skin details at the level of microstructure
requires billions of elements just on the surface of the face. Any traditional simulation approach
would be prohibitive to simulate due to the sheer number of elements. To solve this problem, we
propose a new simulation approach to synthesize the dynamic skin microstructure for animated
facial rendering. Our method efficiently simulates skin microstructures stretching and compres-
sion, and can significantly improve the realism in both real-time and offline facial animations.
When combined with the facial performance capture, and high resolution facial scanning, our
method can generate compelling dynamic facial rendering that holds up in extreme close-up.
8
Figure I.5: First row: puffing the cheek exerts surface tensions on the cheek and changes the
skin microstructure, resulting in much smoother surface reflection on the right than the neutral on
the left. Second row: compressing the crows feet region exhibits very evident anisotropic skin
textures on the right, which produces significantly different appearance to the neutral on the left.
I.1 Contributions
High Resolution Correspondence from a Single Consistent Topology We present a multi-
view stereo reconstruction technique that directly produces a complete high fidelity head model
with consistent facial mesh topology. While existing techniques decouple shape estimation and
facial tracking, our framework jointly optimizes for stereo constraints and consistent mesh pa-
rameterization. Thus, our method is free from drift and fully parallelizable for dynamic facial
9
performance capture. We produce highly detailed facial geometries with artist-quality UV param-
eterization, including secondary elements such as eyeballs, mouth pockets, nostrils, and the back
of the head. Our approach consists of deforming a common template model to match multi-view
input images of the subject, while satisfying cross-view, cross-subject, and cross-pose consis-
tencies using a combination of 2D landmark detection, optical flow, and surface and volumetric
Laplacian regularization. Since the flow is never computed between frames, our method is triv-
Figure I.6: (Left) high quality mesh deformed using our technique; (Middle) a reconstructed smile
expression with a consistent parameterization, and (Right) high resolution displacement details.
Half-face overlay visualization indicates good agreement and topology flow around geometric
details within the facial expression.
ially parallelized by processing each frame independently. Accurate rigid head pose is extracted
using a PCA-based dimension reduction and denoising scheme. We demonstrate high fidelity
performance capture results with challenging head motion and complex facial expressions around
eye and mouth regions. While the quality of our results is on par with the current state-of-the-art
techniques, our approach can be fully parallelized, does not suffer from drift, and produces face
models with production-quality mesh topologies.
Figure I.6 shows directly deformed artist-quality topology (Left), a captured shape (Middle), and
a detail layer (Right) in a calibrated camera view. Overlaying half the face onto the original
photograph shows high quality topology flows and good agreement for details, such as forehead
10
Figure I.7: Automatically corresponded static and dynamic frames from a single template.
11
wrinkles. Figure I.7 shows corresponded static scans and performance frames of multiple indi-
viduals with a wide variety of facial expressions and head poses from a single template with our
technique.
12
Constraint-freeSceneFlowforMulti-viewDynamicReconstruction We present a constraint-
free scene flow method for high fidelity deformable surface reconstruction from multi-view input.
Given multi-view images, our method first computes high quality 3D reconstruction of the scene
followed by the 3D scene flow that simultaneously satisfies geometric and photo-consistency for
each surface point. We show that with carefully designed objectives, as well as strategies to han-
(a) (b)
Figure I.8: Our dynamic surface reconstruction takes (a) multivew images (left) and reconstructed
mesh (right) as input, and computes (b) scene flow for each vertex. Color code in the mesh (a)
represents surface normal orientations, and that in (b) represents scene flow direction overlaid on
input images.
dle large deformation and outliers, it is possible to produce consistent and high quality 3D motion
from multi-view input without explicit priors. As we do not employ target specific priors or reg-
ularization, our approach generalizes naturally to various deformable surfaces, including human
skin and clothing at high resolution. The optimization framework consists only of data terms, and
solves a small optimization problem for each vertex that can be fully parallelized and efficiently
processed on multiple CPUs and GPUs. Thus, our method is suitable for massively multi-view
13
capture setup consisting of dozens or possibly hundreds of high resolution cameras required in
high-end production today. We demonstrate that our approach exhibits excellent accuracy and can
handle challenging heterogeneous deformations, such as the interaction between the clothing and
a hand. Figure I.8 shows input RGB-D images on the left and output 3D motion computed with
our constraint-free scene flow method on the right.
Skin Microstructure Deformation We finally present a technique for synthesizing the effects
of skin microstructure deformation by anisotropically convolving a high resolution displacement
map to match normal distribution changes in measured skin samples. We use a 10-micron res-
olution scanning technique to measure several in vivo skin samples as they are stretched and
compressed in different directions, quantifying how stretching smooths the skin and compression
makes it rougher. We tabulate the resulting surface normal distributions, and show that convolv-
ing a neutral skin microstructure displacement map with blurring and sharpening filters can mimic
normal distribution changes and microstructure deformations. Spatially varying dense strain fields
(Figure I.9) are computed from high quality facial scan correspondence obtained from our tech-
nique, which properly drives the surface microstructure with a model learned from measurement.
Figure I.9: Strain field computed from corresponded facial expressions.
14
We implement the spatially-varying displacement map filtering on the GPU to interactively render
the effects of dynamic microgeometry on animated faces obtained from high resolution facial
scans (Figure I.10). We also demonstrate that our method can be integrated to the commercially
available offline renderer to improve the appearance of computer animated faces.
Figure I.10: (Top) Two real-time renderings from a blendshape animation. (Left) Neutral face ren-
dered with initial microstructure. (Right) Wince expression rendered with deformed microstruc-
ture. Here, the local strain drives the microstructure convolution. (Bottom) Corresponding ex-
pressions on the top row shows specular-only frames from real-time facial animation, showing
the gradual development of horizontal microstructure near the eye as the face winces. The bot-
tom frame includes an inset of a real photo taken under similar lighting and expression, showing
similarly deformed anisotropic skin textures.
15
SummaryofContributions In summary, the contributions of this thesis include:
(i). A framework for reconstructing dynamic faces with dense correspondence on a consistent
parameterization.
A passive multi-view capture system using static blue LEDs for imaging high resolu-
tion skin textures suitable for dynamic surface reconstruction.
A fully parallelizable multi-view stereo facial performance capture pipeline that pro-
duces a high quality facial reconstruction with consistent mesh topology.
An appearance-driven mesh deformation algorithm using optical flow on high resolu-
tion imaging data combined with volumetric Laplacian regularization.
A novel PCA-based pose estimation and denoising technique for general deformable
surfaces.
(ii). A method for estimating high fidelity 3D scene flow for multi-view dynamic surface recon-
struction.
A constraint-free scene flow method for dynamic surface reconstruction that uses ro-
bust descriptors to propagate local discriminative features in coarse-to-fine optimiza-
tion.
A data-driven technique to propagate well-behaved optimization landscapes in multi-
ple pyramid levels to encourage convergence to a correct local minima.
The design of confidence terms that prune outliers in the data due to occlusions, geo-
metric inconsistency, and non-rigidity.
(iii). Approaches for synthesizing and simulating the effects of skin microstructure deformation
and the resulting skin reflection change.
A 10-micron resolution measurement system for skin microgeometry under deforma-
tion (stretching and compression)
Analyses for the effect of such microgeometry deformation on the resultant anisotropic
BRDF of skin.
An algorithm to efficiently render dynamic deformable microgeometry, which em-
ploys displacement map convolution with appropriate blurring and sharpening filters.
We demonstrate that the above technique can be integrated to both real-time and com-
mercial offline renderers to improve the realism of facial animation rendering.
16
ChapterII
BackgroundandRelatedWork
A human face can convey an amazingly wide range of emotions and ideas within a subtle range of
facial expressions without even a word. In doing so, these expressions are universally understood
across different cultures. Thus, being able to digitally reproduce realistic facial appearance and
animation down to the most subtle details is important not only from a scientific point of view,
but also from a cultural standpoint. As virtual human technologies capture progressively more
personal traits and mannerisms in speech, gestures, and eye contact, they will allow for more
complex and socially engaging interactions between real and virtual worlds. The capability to
digitally recreate such dynamic facial details will enable more realistic and natural non-verbal face
to face interactions, while also increasing the realism of the experience in virtual communication.
To that end, the main focus of this thesis is for creating realistic dynamic appearance of human
faces. In this chapter, we review related work for digitizing dynamic facial appearance.
Multi-Scale Analysis of Facial Features The human face exhibits a wide variety of features
at different scales, from facial landmarks to the microscale skin geometry variation. Since these
skin features exhibit different optical and visual properties, it is important to classify facial fea-
tures that are particularly important for modeling a digital human face. We categorize the skin
17
Figure II.1: Facial features at different scales.
18
components based on the taxonomy in [70]: Macrostructure, Mesostructure, and Microstructure,
and clarify what each taxonomy refers to in this thesis. Macrostructures feature the overall shape
of the face and include regions and parts of the face, such as nose, eyes, and jaw lines, and are
approximately in the order of centimeters. In the 3D computer graphics representation of a human
face, macrostructures can be represented by a relatively low resolution 3D polygonal mesh (tens
of thousands), and can be separated from finer scale features. Mesostructures are in the order of
millimeters, such as fine wrinkles, pores, creases and other details up to submillimeters. These
features can be recorded in modern facial capture methods [150][100][12] as well as high reso-
lution scanning of casts [1], and may be stored in a high resolution texture/normal/displacement
maps. At this scale, the skin details contribute to the appearance of both surface and subsur-
face reflections. Microstructures feature fine geometric variations, such as small deviations inside
pores and wrinkles. In particular, in this work we refer to the structures between a few microns
as microstructures. While the individual microscale bumps may not be clearly visible at natural
conversation distance, they are linked to the luster of the human skin and are important for the
modeling of realistic surface reflection.
II.1 HighFidelityFaceCapture
Driving the motion of digital characters with real actor performances has become a common and
effective process for creating realistic facial animation. Performance-driven facial animation dates
back as far as Williams [151], who used facial markers in monocular video to animate and deform
a 3D scanned facial model. Guenter et al. [61] drove a digital character from multi-view video
by 3D tracking a few hundred facial markers seen in six video cameras. Yet, even a dense set of
facial markers can miss subtle facial motion details necessary for conveying the entire meaning of
a performance. Addressing this, Disney’s Human Face Project [160] was perhaps the first to use
19
dense optical flow on multi-view video of a facial performance to obtain dense facial motion for
animation, setting the stage for the markerless multi-view facial capture system used to animate
realistic digital characters in the “The Matrix” film series [23]. For our multi-view performance
capture in Chapter 3, we use a multi-view video setup to record facial motion, but fit a model to
the images per time instant rather than temporally tracking the performance.
Real-Time Facial Performance Capture Faces assume many shapes but have the same fea-
tures in similar positions. As a result, faces can be modeled with generic templates, such as
morphable models [20]. Such models have proven useful in recent work for real-time perfor-
mance tracking [91, 146, 92, 24, 28, 66, 30, 119], expression transfer [148, 15, 134, 135], and
performance reconstruction from monocular video [52, 129, 122, 54]. Recent work demonstrated
reconstruction of a personalized avatar from mobile free-form videos [69], medium scale dynamic
wrinkles from RGB monocular video [27], and high fidelity mouth animation for head mounted
displays [113]. These template based approaches can provide facial animation in a common artist-
friendly topology with blendshape animations. However they cannot capture shape details outside
of the assumed linear deformation subspace, which may be important for high quality expressive
facial animation. On the other hand, our multi-view capture technique in Chapter 3 produces ac-
curate 3D shapes comparable to multi-view stereo on a common head topology without the need
for complex facial rigs.
Multi-view Stereo Multi-view stereo approaches [45, 11, 14] remain popular since they yield
verifiable and accurate geometry even though they require offline computation. Our dynamic
performance reconstruction technique in Chapter 3 differs from techniques like [46, 25, 14], in
that we do not begin by solving for independent multi-view stereo geometry at each time instant.
In fact, our method does not require a set of high resolution facial scans (or even a single facial
20
(a) (b) (c)
Figure II.2: Comparison of our result (a) to PMVS2 [45] (b). Overlaying the meshes in (c)
indicates a good geometric match.
scan) of the subject to assist performance tracking as in [4, 67, 52, 5, 49]. Instead, we employ
optical flow and surface/volume Laplacian priors to constrain 3D vertex estimates based on a
template. Figure II.2 compares the 3D geometry obtained with our multi-view capture technique
to previous work [45].
Multi-viewFacialPerformanceCapture Video-based facial performance capture is suscepti-
ble to “drift”, meaning inconsistencies in the relationship between facial features and the mesh
parameterization across different instances in time. For example, a naive algorithm that tracks
vertices from one frame to the next will accumulate error over the duration of a performance.
Previous works have taken measures to mitigate drift in a single performance. However, none
of these approaches lends itself to multiple performance clips or collections of single-frame cap-
tures. Furthermore, previous works addressing fine-scale consistency involve at least one manual
step if high quality topology is desired. Beeler et al. [14] employ a manually selected refer-
ence frame and geometry obtained from stereo reconstruction. If a clean topology is desired, it
is edited manually. The method locates “anchor frames” similar to the reference frame to seg-
ment the performance into short clips, and optical flow tracking is performed within each clip
21
and across clip seams. The main drawback of this method is that all captures must contain well
distributed anchor frames that resemble the reference, thus limiting the expressive freedom of the
performer. Klaudiny et al. [87] construct a minimum spanning tree in appearance space and em-
ploy non-sequential tracking to reduce drift, combined with temporal tracking to reduce temporal
seams. The user must manually create a mesh for the frame at the root of the tree, based on ge-
ometry obtained from stereo reconstruction. Despite the minimum spanning tree, expressions far
from the root expression still require concatenation of multiple flow fields, accumulating drift. If
single-frame captures are included in the data, it may fail altogether. Valgaerts et al. [140] employ
the first frame of a performance as a template. If clean topology is desired, it must be manually
edited. Sequences are processed from the first frame to the last. Synthetic renderings of the tem-
plate are employed to reduce drift via optical flow correction. This method cannot handle multiple
performance clips or single frame captures in correspondence. Garrido et al. [52] employ a neu-
tral facial scan as a template and require manually refined alignment of the neutral scan to the
starting frame of the performance. The method locates “key frames” resembling the neutral scan
(much like anchor frames) to segment the performance into short clips that are tracked via tem-
poral optical flow, and employs synthetic renderings of the neutral scan to reduce drift via optical
flow correction. This method is unsuitable for collections of multiple performances, as the man-
ual initial alignment required for each performance would be prohibitive and error-prone. Fyffe
et al. [49] employ multiple facial scans, with one neutral scan serving as a template. The neu-
tral scan topology is produced manually and is tracked directly to all performance frames and all
other scans using optical flow, which is combined with temporal optical flow and flows between
a sparse set of frames and automatically selected facial scans to minimize drift. This method
handles multiple performance clips and multiple single-frame captures in correspondence, but re-
quires multiple facial scans spanning the appearance space of the subject’s face, one of which is
manually processed.
22
Physics Simulation Realistic facial animation may also be generated though physical simu-
lation as in previous work[114, 131]. The computer animation ”The Jester” [31] tracked the
performer’s face with a standard set of mocap markers but used finite element simulation to sim-
ulate higher-resolution performance details, such as the skin wrinkling around the eyes. Sifakis
et al. [126] used a bone, flesh, and muscle model of a face to reverse-engineer the muscle activa-
tions which generate the same motion of the face as recorded with mocap markers. Recent work
showed a way to automatically construct personalized anatomical model for volumetric facial tis-
sue simulations [34]. In our work, we use a volumetric facial model to enable robust model fitting
solutions including occluded regions.
II.2 TypesofPriorsforPerformanceCapture
Technologies to digitize real world objects have made significant advancement in the past decades
thanks to advanced depth sensing and high resolution/high-framerate imaging. Despite such
progress in the sensing technologies, dynamic surface reconstruction still remains a challeng-
ing problem due to artifacts present in the data (e.g. noise and blur), occlusion, and appearance
changes of a target object. Many researchers have focused on how to overcome these problems
by utilizing either domain-specific or generic priors.
Priors for Face Capture For facial performance, currently the highest possible result can be
captured in a multi-view setup. Often dense optical flow is employed to capture dense motion field
in the multivew video stream. Even with the high quality input from multi-view setup, surface
priors [14, 86, 26, 49] and/or volumetric priors [47] are still employed to obtain a well behaved
surface. In Chapter 4, we present a technique that directly estimates dense 3D motion (i.e. scene
flow [142]) that satisfies frame to frame photometric- and geometric-consistency in each view
23
without any priors. Valgaerts and colleagues’ [141] presented an unconstrained performance
capture algorithm using scene flow from a binocular stereo rig. The method not only relies on a
global smoothing term in the scene flow estimation, but also needs to interleave a Laplacian mesh
update as in previous works since the scene flow estimation tends to be noisy. In contrast, our
tracking framework produces consistent surface deformation without explicit smoothing thanks
to the robust descriptors, the multi-scale data-driven propagation scheme, and carefully designed
confidence terms. With a limited setup such as a monocular depth or color sensor, a stronger
prior(s) is usually employed to constrain the solution to be in a valid domain. For example,
a blendshape model [89] captures facial deformation in a linear combination of facial muscle
activations, and has been used in real-time facial performance capture [147, 93, 29, 28, 134, 135].
Recent work has shown results of high fidelity from a monocular sensor using a person-specific
linear model [53, 54, 123, 130, 24, 91], a regression framework to synthesize medium-scale details
[27], and an anatomically constrained local model [155].
Priors for Body Capture Capturing body performance is another challenging scenario in that
the presence of limbs and kinematic motion of a body causes unique and severe occlusions.
Therefore, many existing methods are based on a multi-view setup to maximize the coverage
and leverage priors, such as Laplacian priors [37, 38] and linear blend skinning [144, 157], to
fill in the missing data. More recent work uses body-specific priors, such as shape and pose pri-
ors [8, 65, 96], for reconstructing dynamic skin deformation from a sparse marker set [97], from
a multi-view input [115], from a monocular RGB-D camera [22], and for simulating realistic
human breathing [138]. While leveraging a human specific prior is a viable solution for human
digitization, it does not scale to arbitrary surfaces. Our scene flow estimation framework in Chap-
ter 4, on the other hand, does not ever utilize target-specific priors, and thus naturally generalizes
to arbitrary deformable surfaces.
24
Priors for Generic Capture To capture a wider variety of real world objects, people have
proposed more general approaches using a template obtained from depth scanning and non-
rigid registration [90, 166]. While template-based tracking methods can employ priors based
on topology such as surface Laplacian and as-rigid-as possible deformation, this constrains the
topology of the target to be fixed during the sequence, limiting the variety of performance to be
captured. To allow for topology change, Collet and colleagues [33] divide a performance into
short segments and perform tracking over the short segment with an updated topology. A more
recent trend is to build a template mesh and track performance simultaneously to handle topology
changes [133, 111, 40, 72]. However, the results lack for details because of aggressive regu-
larization and smoothing. Besides, priors usually require to solve a global system of equations
consisting of dense unknown points, incurring heavy computational cost.
Priors for Scene Flow Point-based tracking, (i.e., scene flow [142]), are free from the mesh-
related issues, such as topological constraints. Nonetheless, estimating dense scene flow fields
involves many challenges due to a lack of texture, appearance changes and occlusion, for which
recent approaches still rely on smoothness priors [139, 19, 10] even when an active lighting setup
is employed [58]. Joo and colleagues [81] decouple scene flow computation into point visibility
estimation and 3D trajectory reconstruction, the latter of which is free from spatial or temporal
regularization. While their method achieves accurate and stable 3D trajectories, the result still
lacks density in the estimated motion field. On the other hand, our constraint-free framework
estimates dense and accurate scene flow from multi-view passive input without any global terms.
Thus, our scene flow estimation runs fully in parallel and efficiently scale to massively multi-view,
high resolution inputs.
25
II.3 CapturingandModelingofDynamicSkinAppearance
Our skin is a complex multilayered organ that plays numerous roles in protection, heat regulation,
sensing, and hydration. As the principal surface we see when we look at each other, it is also
key in human communication, telling others about our health, our age, our physical state, and our
emotions. Since the appearance of the skin is such an important clue for human communication,
it has been studied well in the previous work including the dynamics of the human skin. Here we
summarize related work that addressed the capturing and modeling of its dynamic appearance.
SkinRoughnessandMicrostructure Skin is elastic, with an ability to safely stretch an average
of 60 to 75 percent [42]. The top layers of the skin, the epidermis, are relatively stiff and achieve
much of their elasticity through a network of fine-scale ridges and grooves which provide a reserve
of tissue which can flatten when pulled [107]. This microstructure varies in scale and texture
throughout the body and is responsible for the specular Bidirectional Reflectance Distribution
Function (BRDF) of the skin. As seen in Figure V .1, the skin’s texture orientation and BRDF
change dramatically as the skin is subjected to stretching and compression.
The physical and mechanical properties of skin have been studied significantly, including the rela-
tionship between deformation and changes in specular reflectance. Ferguson et al. [44] measured
skin roughness under varying strain and quantified how stretching reduces the surface roughness
along that direction (Figure II.3). They also showed a relationship between the surface profile
length and the distribution of surface orientations. Other previous work like [43], [63], and [120]
use surface roughness of soft tissue materials as an indicator of mechanical effects including ap-
plied stresses and strain. Federici et al. [43] and Guzelsu et al. [63] proposed a noninvasive
method to measure the stretch of soft tissues, including skin based on specular reflectivity. They
used polarized light to measure specular reflectance and observed that the reflected light increases
26
with stretching as the surface becomes smoother. Schulkin et al. [120] extended the measurement
to characterize how subsurface reflectance changes in response to mechanical effects.
TENSION
Normal Distribution
TENSION
Stretch Neutral Compression
Figure II.3: A normal distribution changing under deformation
Modeling a Microfacet BRDF In computer graphics, light reflection from rough surfaces is
often modeled by physically-based microfacet distribution models such as [137] and [35]. A mi-
crofacet distribution model simulates roughness by symmetric V-grooves at the microscopic level,
called microfacets, which are assumed to behave like a perfect mirror. More recently, Gaussian
random microfacet models are favored over V-groove models [64]. The change of microfacet
orientations at a micro level changes the resulting surface BRDF and alters the appearance of the
surface. Therefore, it is important to take into account such dynamic distributions in simulating a
surface BRDF.
Accurately modeling and efficiently rendering the subtle reflection effects of surface microstruc-
ture has been an area of significant recent interest [41, 74, 159] for man-made materials. The
technique proposed by [41] admits a microstructure scaling factor which can uniformly reduce or
increase the amplitude of surface microstructure as a material deforms. Our work in Chapter 5
27
models a richer set of deformation-based reflectance effects by directionally blurring and sharp-
ening the surface microstructure based on local surface strain, and we tune this filtering to match
measurements of real skin patches. However, we do not address efficient anti-aliased rendering
techniques.
However, due to computational complexity such techniques have not yet been applied to skin
microstructure in the 10 micron scale over an entire face. Data driven techniques have been
employed to synthesize facial details onto novel face poses using polynomial functions [101] or
statistical models [56], but only at the scale of mesostructure.
SimulationofHumanSkin Physically-based simulation techniques have been applied to facial
animation at a range of scales from overall facial shape (e.g. [114, 132, 126, 16]) to surface
mesostructure on the order of forehead furrowing and crow’s feet around the eyes (e.g. [18,
118, 94]). Surface-based physics simulation, such as cloth simulation [9], and a mass-spring
system, or geometric simulation, such as as-rigid-as possible surface modeling [127] could be
employed to simulate the wrinkling of the surfaces. However since the skin is a complex multi-
layerd material consisting of surface and volumetric tissue layers which exhibit very different
mechanical properties, surface only simulation tends not to capture the full complexity of skin
deformation. Recent work proposed a technique to simulate coupled interations between surface
and underlying volumetric materials [118, 94]. The simulation is derived from a model that
links the layered material properties, the material thickness, and the frequency of the wrinkles
to simulate more complex skin deformation. The challenge in applying traditional simulaiton
approaces to the skin microstructure is the sheer number of elements required to represent surface
microstructures on the entire face. Figure II.4 shows the number of elements needed to represent
a portion of the forehead region. To simulate the entire face, the model needs billions of elements
just to represent the surface, for which a traditional simulation can be prohibitive.
28
Figure II.4: Scales of skin microstructures. (Left) A forehead region marked roughly corresponds
to 2 by 1 centimeter rectangle (middle) seen in a machine vision camera. The red square in
the middle figure corresponds to a forehead region 0.5 to 1.0 mm, which may contain tens of
thousands of points on one surface layer as shown here on the right.
CapturingSkinMicrostructure Recent work in material acquisition has noted the importance
of capturing surface microstructure using techniques such as computational tomography [164]
and an elastomeric sensor [76]. In our work, we wish to capture in vivo skin microstructure
without contacting the surface and adapt the microstructure acquisition process of [60], which
showed that constrained texture synthesis could be used to create microstructure for an entire
facial model based on a set of discrete microstructure patches. Example data of polarized spherical
gradient images used to capture skin microstructures are shown in Figure II.5. von der Pahlen et al.
[145] showed a real-time implementation of skin microstructure using procedural noise functions
tailored to match measured skin sample.
29
Figure II.5: Neutral skin microstructure under polarized spherical gradient illumination. Gradi-
ents with (first row) a parallel condition, (second row) a cross polarized condition, (third row) the
difference of the first two showing the surface reflection on the skin patch, and (forth row) output
specular albedo, diffuse albedo, specular normal, and displacement from left to right in this order.
30
ChapterIII
Multi-viewDynamicFacialCaptureandCorrespondence
III.1 Introduction
Video-based facial performance capture has become a widely established technique for the digiti-
zation and animation of realistic virtual characters in high-end film and game production. While
recent advances in facial tracking research are pushing the boundaries of real-time performance
and robustness in unconstrained capture settings, professional studios still rely on computation-
ally demanding offline solutions with high-resolution imaging. To further avoid the uncanny val-
ley, time-consuming and expensive artist input, (e.g. tracking clean-up or key-framing), is often
required to fine-tune the automated tracking results and ensure consistent UV parameterization
across the input frames.
State-of-the-art facial performance capture pipelines are mostly based on a multi-view stereo
setup to capture fine geometric details, and generally decouple the process of model building and
facial tracking. The facial model (often a parametric blendshape model) is designed to reflect the
expressiveness of the actor but also to ensure that any deformation stays within the shape and
expression space during tracking. Because of the complexity of facial expressions and potentially
large deformations, most trackers are initialized from the previous input frames. However, such
31
sequential approaches cannot be parallelized and naturally result in drift, which requires either
artist-assisted tracking corrections or ad-hoc segmentation of the performance into short temporal
clips.
We show in this work that it is possible to directly obtain, for any frame, a high-resolution facial
model with consistent mesh topology using a passive multi-view capture system with flat illu-
mination and high-resolution input images. We propose a framework that can accurately warp
a reference template model with existing texture parameterization to the face of any person, and
demonstrate successful results on a wide range of subjects and challenging expressions. While
existing multi-view methods either explicitly compute the geometry [14, 87] or implicitly encode
stereo constraints [140], they rely on optical flow or scene-flow to track a face model, for which
computation is only possible sequentially. Breaking up the performance into short clips using
anchor frames or key frames with a common appearance is only a partial solution, as it requires
the subject to return to a common expression repeatedly throughout the performance.
Our objective is to warp a common template model to a different person in arbitrary poses and
different expressions while ensuring consistent anatomical matches between subjects and accu-
rate tracking across frames. The key challenge is to handle the large variations of facial ap-
pearances and geometries, as well as the complexity of facial expression and large deformations.
We propose an appearance-driven mesh deformation approach that produces intermediate warped
photographs for reliable and accurate optical flow computation. Our approach effectively avoids
image discontinuities and artifacts often caused by methods based on synthetic renderings or tex-
ture reprojection.
In a first pass, we compute temporally consistent animations, that are produced from indepen-
dently computed frames, by deforming a template model to the expressions of each frame while
32
enforcing consistent cross-subject correspondences. To initialize our face fitting, we leverage re-
cent work in facial landmark detection. In each subsequent phase of our method, the appearance-
based mesh warping is driven by the mesh estimate from the previous phase. We show that even
where the reference and target images exhibit significant differences in appearance (due to signifi-
cant head rotation, different subjects, or expression changes), our warping approach progressively
converges to a high-quality correspondence. Our method does not require a complex facial rig or
blendshape priors. Instead, we deform the full head topology according to the multi-view optical
flow correspondences, and use a combination of surface and volumetric Laplacian regularization
to produce a well-behaved shape, which helps especially in regions that are prone to occlusion
and inter-penetration such as the eyes and mouth pocket.
As the unobserved regions (e.g. the back of the head) are inferred from the Laplacian deforma-
tion, these regions may be temporally inconsistent in the presence of significant head motion or
expression changes. Thus, we introduce a PCA based technique for general deformable surface
alignment to align and denoise the facial meshes over the entire performance, which improves
temporal consistency around the top and back of the head and reduces high-frequency “sizzling”
noise. Unlike [13, 156], our method does not employ any prior knowledge of anatomy. We
then compute a subject-specific template and refine the performance capture in a second pass to
achieve pore-level tracking accuracy.
Our method never computes optical flow between neighboring frames, and never compares a syn-
thetic rendering to a photograph. Thus, our method does not suffer from drift, and accurately
corresponds regions that are difficult to render synthetically such as around the eyes. Our method
can be applied equally well to a set of single-frame expression captures with no temporal con-
tinuity, bringing a wide variety of facial expressions into (u,v) correspondence with pore-level
accuracy. Furthermore, our joint optimization for stereo and fitting constraints also improves the
33
common
template
multi-view
input data
landmark-based
initialization
appearance-driven
mesh deformation
pose estimation
and denoising
1 2 3
personalized template
4
detail
enhancement
reconstruction
with consistent
topology
5 6
Figure III.1: Our pipeline proceeds in six phases, illustrated as numbered circles. 1) A common
template is fitted to multi-view imagery of a subject using landmark-based fitting (Section III.3.1).
2) The mesh is refined for every frame using optical flow for coarse-scale consistency and stereo
(Section III.3.2). 3) The meshes of all frames are aligned and denoised using a PCA scheme
(Section III.3.3). 4) A personalized template is extracted and employed to refine the meshes for
fine-scale consistency (Section III.3.4). 5) Final pose estimation and denoising reduces “sizzling”
(Section III.3.5). 6) Details are estimated from the imagery (Section III.3.6).
digitization quality around highly occluded regions, such as mouth, eyes, and nostrils, as they
provide additional reconstruction cues in the form of shape priors.
III.2 SharedTemplateMesh
Rather than requiring a manually constructed personalized template for each subject, our method
automatically customizes a generic template, including the eyes and mouth interior. We maintain
a consistent representation of this face mesh throughout our process: a shared template mesh with
its deformation parameterized on the vertices. The original template can be any high-quality artist
mesh with associated multi-view photographs. To enable volumetric regularization, we construct
a tetrahedral mesh for the template using TetGen [125] (Figure III.2). We also symmetrize the
template mesh by averaging each vertex position with that of the mirrored position of the vertex
34
(a) (b)
Figure III.2: Production-quality mesh template and the cross-section of the volumetric template
constructed from the surface.
bilaterally opposite it. This is because we do not want to introduce any facial feature asymme-
tries of the template into the Laplacian shape prior. For operations relating the template back to
its multi-view photographs, we use the original vertex positions. For operations employing the
template as a Laplacian shape prior, we employ the symmetrized vertex positions.
We demonstrate initialization using a high-quality artist mesh template constructed from multi-
view photography. We use the freely available “Digital Emily” mesh, photographs, and camera
calibration from [88]. The identity of the template is of no significance, though we purposely
chose a template with no extreme unique facial features. A single template can be reused for
many recordings of different subjects. We also compare to results obtained from a morphable
model [134] with synthetic renderings in place of multi-view photographs.
35
(a) (b) (c) (d)
Figure III.3: (a) Facial landmarks detected on the source subject and (b) the corresponding tem-
plate; (c) The template deformed based on detected landmarks on the template and subject pho-
tographs; (d) Detailed template fitting based on optical flow between the template and subject,
and between views.
III.3 MethodOverview
Given an existing template mesh, we can reconstruct multiple video performances by optimizing
photoconsistency cues between different views, across different expressions, and across different
subjects. Our method consists of six sequential phases, illustrated in Figure III.1.
Some phases share the same underlying algorithm, therefore, in this section we provide a short
overview of each phase and provide further technical details in Sections III.4 and III.5.
We report run times for each phase based on computers with dual 8-core Intel E5620 processors
and NVidia GTX980 graphics cards. All phases except for rigid alignment are trivially paralleliz-
able across frames.
III.3.1 Landmark-BasedInitialization
First, we leverage 2D facial landmark detection to deform the common template and compute
an initial mesh for each frame of the performance. Subsequent optical flow steps require a mesh
36
estimate, which is reasonably close to the true shape. We estimate facial landmark positions on
all frames and views using the method of [83] implemented in the DLib library [85]. We then
triangulate 3D positions with outlier rejection, as the landmark detection can be noisy. We use
the same procedure for the template photographs to locate the template landmark positions. Fig-
ure III.3(a) shows an example with detected landmarks as black dots, and triangulated landmarks
after outlier rejection as white dots. We transform the 3D landmarks of all poses to a common
coordinate system using an approximate rigid registration to the template landmarks. We perform
PCA-based denoising per subject in the registered space to remove any isolated errors, and then
transform the landmarks back into world space. We additionally apply Gaussian smoothing to the
landmark trajectories in each performance sequence. Finally, we compute a smooth deformation
of the template to non-rigidly register it to the world space 3D landmarks of each captured facial
pose using Laplacian mesh deformation. Figure III.3(c) shows an example of the template de-
formed to a subject using only the landmarks. These deformed template meshes form the initial
estimates in our pipeline, and are not required to be entirely accurate. This phase takes only a few
seconds per frame, which are processed in parallel except for the PCA step.
III.3.2 Coarse-ScaleTemplateWarping
Starting from the landmark-based initialization in Section III.3.1, we employ an appearance-
driven mesh deformation scheme to propagate the shared template mesh onto the performance
frames. More details on this algorithm are provided in Section III.4. This phase takes 25 minutes
per frame, which is then processed in parallel. After this phase, the processed facial meshes are
all high quality 3D scans with the same topology, and are consistent at the level of coarse features
such as the eyebrows, eyes, nostrils, and corners of the mouth. Figure III.4 shows the results
at this phase directly deforming the template to multiple poses of the same individual without
37
Figure III.4: Facial expressions reconstructed without temporal flow.
using any temporal information. Despite significant facial motion, the mesh topology remains
consistent with the template. If only a single pose is desired for each subject, we can stop here.
If sequences or multiple poses were captured, we continue with the remaining phases to improve
consistency across poses.
III.3.3 PoseEstimation,Denoising,andTemplatePersonalization
The face mesh estimates from Section III.3.2 are reasonably good facial scans, but they exhibit
two sources of distracting temporal noise. First, they lack fine-scale consistency in the UV do-
main, and second, any vertices that are extrapolated in place of missing data may differ consid-
erably from frame to frame (for example, around the back of the head). The primary purpose of
this phase is to produce a mesh sequence that is temporally smooth, with a plausible deformation
basis, and closer to the true face sequence than the original estimate in Section III.3.1. We wish to
project the meshes into a reduced dimensional deformation basis to remove some of the temporal
noise, which requires the meshes to be registered to a rigidly aligned head pose space rather than
roaming free in world space. Typically, this is accomplished through iterative schemes, alternat-
ing between pose estimation and deformation basis estimation. In Section III.5 we describe a
38
method to decouple the pose from the deformation basis, allowing us to first remove the relative
rotation from the meshes without knowledge of the deformation basis, then remove the relative
translation, and finally compute the deformation basis via PCA. We truncate the basis retaining
95% of the variance, which reduces temporal noise without requiring frame-to-frame smoothing.
Finally, we identify the frame whose shape is closest to the mean shape in the PCA basis, and let
this frame be the personalized template frame for the proceeding phases. This phase, which is not
easily parallelized, takes about 8 seconds per frame.
III.3.4 Fine-ScaleTemplateWarping
This phase is nearly identical to Section III.3.2, except that we propagate the shared template
only to the personalized template frame identified in Section III.3.3 (per subject), after which
the updated personalized template becomes the new template for the remaining frames (again,
per subject). This enables fine-scale consistency to be obtained from optical flow, as the pores,
blemishes, and fine wrinkles on a subject’s skin provide ample registration markers across poses.
Further, we start from the denoised estimates from Section III.3.3 instead of the landmark based
estimates of Section III.3.1, which are much closer to the actual face shape of each frame, reducing
the likelihood of false matches in the optical flow.
III.3.5 FinalPoseEstimationandDenoising
After the consistent mesh has been computed for all frames, we perform a final step of rigid reg-
istration to the personalized template and PCA denoising, similar to Section III.3.3 but retaining
99% of the variance. We found this helps remove “sizzling” noise produced by variations in the
optical flow. We also denoise the eye gaze animation using a simple Gaussian filter. More details
on this phase are provided in Section III.5.
39
Figure III.5: (a) Dense base mesh; (b) Proposed detail enhancement; (c) “Dark is deep” detail
enhancement.
III.3.6 DetailEnhancement
Finally, we extract texture maps for each frame, and employ the high frequency information to
enhance the surface detail already computed on the dense mesh in Section III.4.3, in a similar
manner as [11]. We make the additional observation that the sequence of texture maps holds an
additional cue: when wrinkles appear on the face, they tend to make the surface shading darker
relative to the neutral state. To exploit this, we compute the difference between the texture of each
frame and the texture of the personalized template, and then filter it with an orientation-sensitive
filter to remove fine pores but retain wrinkles. We call this the wrinkle map, and we employ it
as a medium-frequency displacement, in addition to the high-frequency displacement obtained
from a high-pass filter of all texture details. We call this scheme “darker is deeper”, as opposed
to the “dark is deep” schemes from the literature. Figure III.5 shows the dense mesh, enhanced
details captured by our proposed technique, and details using a method similar to [14].This step,
including texture extraction and mesh displacement, takes 10 minutes per frame and is trivially
parallelizable.
40
III.4 Appearance-DrivenMeshDeformation
We now describe in detail the deformation algorithm mentioned in Sections III.3.2 and III.3.4.
Suppose we have a known reference mesh with vertices represented as y
i
2 Y and a set of
photographsI
Y
j
2I
Y
corresponding to the reference mesh along with camera calibrations. Now,
suppose we also have photographsI
X
k
2I
X
and camera calibrations for some other, unknown
mesh with vertices represented asx
i
2 X. Our goal is to estimateX givenY;I
Y
;I
X
. In other
words, we propagate the known reference meshY to the unknown configurationX using evidence
from the photographs of both. Suppose we have a previous estimate
^
X, somewhat close to the
true X. (We explain how to obtain an initial estimate in Section III.3.1.) We can improve the
estimate
^
X by first updating each vertex estimate ^ x
i
2
^
X using optical flow (described in Section
III.4.2), then updating the entire mesh estimate
^
X using Laplacian shape regularization withY as
a reference shape (described in Section III.4.4). Finally, we position the eyeballs based on flow
vectors and geometric evidence from the eyelid region (described in Section III.4.5).
III.4.1 ImageWarping
Before further discussion, we must address the difficult challenge of computing meaningful opti-
cal flow between pairs of photographs that may differ in viewpoint, in facial expression, in subject
identity, or any combination of the three. We assume high-resolution images and flat illumina-
tion, so different poses of a subject will have generally similar shading and enough fine details for
good registration. Still, if the pose varies significantly or if the subject differs, naive optical flow
estimation will generally fail. For example, Figure III.6(b, f) shows the result of naively warping
one subject to another using optical flow, which would not be useful for facial correspondence
since the flow mostly fails. Even in these cases, we desire a flow field that aligns coarse facial
41
features, even though the fine-scale features will lack meaningful matches, as in Figure III.6(d,
h).
Our solution is to warp the image of one face to resemble the other face before computing optical
flow (and vice-versa to compute optical flow in the other direction). We warp the images based on
the current 3D mesh estimates (first obtained via the initialization in Section III.3.1.) One might
try rendering a synthetic image in the first camera view using the first mesh and texture sourced
from the second image via the second mesh, to produce a warped version of the second image in a
similar configuration to the first. However, this approach would introduce artificial discontinuities
wherever the current mesh estimates are not in precise alignment with the photographs, and such
discontinuities would confuse the optical flow algorithm. Thus, we instead construct a smooth
vector field to serve as an image-space warp that is free of discontinuities. We compute this
vector field by rasterizing the first mesh into the first camera view, but instead of storing pixel
colors we write the second camera’s projected image plane coordinates (obtained via the second
mesh). We skip pixels that are occluded in either view using a z buffer for each camera, and
smoothly interpolate the missing vectors across the entire frame. We then apply a small Gaussian
blur to slightly smooth any discontinuities, and finally warp the image using the smooth vector
field. Examples using our warping scheme are shown in Figure III.6(c, g), and the shape is close
enough to the true shape to produce a relatively successful optical flow result using the method
of [149], shown in Figure III.6(d, h). After computing flow between the warped image and the
target image, we concatenate the vector field warp and the optical flow vector field to produce
the complete flow field. This is implemented simply by warping the vector field using the optical
flow field.
42
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure III.6: Optical flow between photographs of different subjects (a) and (e) performs poorly,
producing the warped images (b) and (f). Using 3D mesh estimates (e.g. a template deformed
based on facial landmarks), we compute a smooth vector field to produce the warped images (c)
and (g). Optical flow between the original images (a, e) and the warped images (c, g) produces
the relatively successful final warped images (d) and (h).
43
III.4.2 OpticalFlowBasedUpdate
Within the set of imagesI
Y
;I
X
, we may find cues about the shape of the unknown meshX and
the relationship betweenX and the reference meshY . We employ a stereo cue between images
from the same time instant, and a reference cue between images from different time instants or
different subjects, using optical flow and triangulation (Figure III.7). First, consider a stereo cue.
Given an estimatep
k
i
=P
k
(^ x
i
)P
k
(x
i
) withP
k
representing the projection from world space
to the image plane coordinates of viewk ofX, we can employ an optical flow fieldF
l
k
between
viewsk andl ofX, to estimatep
l
i
= F
l
k
(p
k
i
) P
l
(x
i
). Defining D
k
(p
k
i
) = Id
k
(p
k
i
)d
k
(p
k
i
)
T
whered
k
(p
k
i
) is the world space view vector passing through image plane coordinatep
k
i
of the
camera of view k of X, and c
k
the center of projection of the lens of the same camera (and
likewise forl), we may employp
k
i
andp
l
i
together to triangulate ^ x
i
as:
^ x
i
argmin
x
i
D
k
(p
k
i
)(x
i
c
k
)
2
+
D
l
(p
l
i
)(x
i
c
l
)
2
; (III.1)
which may be solved in closed form. Next, consider a reference cue, which is a cue involving
the reference meshY being either a shared template or another pose of the same subject. Given
a known or estimatedq
j
i
= Q
j
(y
i
), withQ
j
representing the projection from world space to the
image plane coordinates of viewj ofY , we can employ an optical flow fieldG
k
j
between viewj
ofY and viewk ofX, to estimatep
k
i
= G
k
j
(q
j
i
) P
k
(x
i
). Next, we useF
l
k
to obtainp
l
i
from
p
k
i
as we did for the stereo cue, and triangulate ^ x
i
as before. However, instead of triangulating all
these different cues separately, we combine them into a single triangulation, introducing a scalar
fieldr
k
j
representing the optical flow confidence for flow fieldG
k
j
, ands
l
k
representing the optical
44
(a)
(b)
Figure III.7: (a) A stereo cue. An estimated point x is projected to the 2D point p
k
in view
k. A flow field F
l
k
transfers the 2D point to p
l
in a second view l. The point x is updated by
triangulating the rays throughp
k
andp
l
. (b) A reference cue. An estimated pointy is projected
to the 2D pointq
j
in viewj. A flow fieldG
k
j
transfers the 2D point top
k
in viewk of a different
subject or different time. A second flow fieldF
l
k
transfers the 2D point top
l
in viewl and then
pointx is estimated by triangulating the rays throughp
k
andp
l
.
45
flow confidence for flow fieldF
l
k
. Including one more parameter
to balance between stereo cues
and reference cues, the whole thing looks like this:
^ x
i
argmin
x
i
X
k;l
s
l
k
2
(p
k
i
)
D
k
(p
k
i
)(x
i
c
k
)
2
+
D
l
(F
l
k
(p
k
i
))(x
i
c
l
)
2
+ (1
)
X
j;k;l
r
k
j
(q
j
i
)s
l
k
(G
k
j
(q
j
i
))
D
k
(G
k
j
(q
j
i
))(x
i
c
k
)
2
+
D
l
(F
l
k
(G
k
j
(q
j
i
)))(x
i
c
l
)
2
:
(III.2)
This differs from [49] in that reference flows are employed as a lookup into stereo flows instead of
attempting to triangulate pairs of reference flows, and differs from [14] in that the geometry is not
computed beforehand; rather the stereo and consistency are satisfied together. While (III.2) can be
trivially solved in closed form, the flow field is dependent on the previous estimate ^ x
i
, and hence
we perform several iterations of optical flow updates interleaved with Laplacian regularization
for the entire face. We schedule the parameter
to range from 0 in the first iteration to 1 in the
last iteration, so that the solution respects the reference most at the beginning, and respects stereo
most at the end. We find five iterations to generally be sufficient, and we recompute the optical
flow fields after the second iteration as the mesh will be closer to the true shape, and a better flow
may be obtained.
The optical flow confidence fieldsr
k
j
ands
l
k
are vitally important to the success of the method. The
optical flow implementation we use provides an estimate of flow confidence [149] based on the op-
tical flow matching term, which we extend in a few ways. First, we compute flows both ways be-
tween each pair of images, and multiply the confidence by an exponentially decaying function of
the round-trip distance. Specifically, ins
l
k
(p
k
i
) we include a factor exp
p
k
i
F
k
l
(F
l
k
(p
k
i
))
2
(with normalized image coordinates), where = 20 is a parameter controlling round-trip strict-
ness, and analogously for r
k
j
. Since we utilize both directions of the flow fields anyway, this
46
adds little computational overhead. For stereo flows (i.e. flows between views of the same pose)
we include an additional factor penalizing epipolar disagreement, including ins
l
k
(p
k
i
) the factor
exp
dl
2
(c
k
;d
k
(p
k
i
);c
l
;d
l
(F
l
k
(p
k
i
)))
, where dl(o
1
;d
1
;o
2
;d
2
) is the closest distance between
the ray defined by origino
1
and directiond
1
and the ray defined by origino
2
and directiond
2
,
and = 500 is a parameter controlling epipolar strictness. Penalizing epipolar disagreement,
rather than searching strictly on epipolar lines, allows our method to find correspondences even in
the presence of noise in the camera calibrations. In consideration of visibility and occlusion, we
employ the current estimate
^
X to compute per-vertex visibility in each view ofX using a z-buffer
and back-face culling on the GPU, and likewise for each view of Y . If vertex i is not visible
in viewk, we sets
l
k
(p
k
i
) to 0. Otherwise, we include a factor of (n
i
d
k
(p
k
i
))
2
to soften the
visibility based on the current surface normal estimaten
i
. We include a similar factor for viewl
ins
l
k
(p
k
i
), and for viewj ofY and viewk ofX inr
k
j
. As an optimization, we omit flow fields
altogether if the current estimated head pose relative to the camera differs significantly between
the two views to be flowed. We compute the closest rigid transform between the two mesh esti-
mates in their respective camera coordinates, and skip the flow field computation if the relative
transform includes a rotation of more than twenty degrees.
III.4.3 DenseMeshRepresentation
The optical flow fields used in Section III.4.2 contain dense information about the facial shape,
yet we compute our solution only on the vertices of an artist-quality mesh. It would be a shame
to waste the unused information in the dense flow fields. Indeed, we note that sampling the flow
fields only at the artist mesh vertices in Section III.4.2 introduces some amount of aliasing, as
flow field values in between vertices are ignored. So we compute an auxiliary dense mesh with
262,144 vertices parameterized on a 512 512 vertex grid in UV space. Optical flow updates are
47
applied to all vertices of the dense mesh as in Section III.4.2. We then regularize the dense mesh
using the surface Laplacian terms from Section III.4.4, but omit the volumetric terms as they are
prohibitive for such a dense mesh. The surface Laplacian terms, on the other hand, are easily
expressed and solved on the dense grid parameterization.
This dense mesh provides two benefits. First, it provides an intermediate estimate that is free of
aliasing, which we utilize by looking up the dense vertex position at the same UV coordinate as
each artist mesh vertex. This estimate lacks volumetric regularization, but that is applied next in
Section III.4.4. Second, the dense mesh contains surface detail at finer scales than the artist mesh
vertices, and so we employ the dense mesh in Section III.3.6 as a base for detailed displacement
map estimation.
III.4.4 LaplacianRegularization
After the optical flow update, we update the entire face mesh using Laplacian regularization, using
the position estimates from Section III.4.3 as a target constraint. We use the framework of [165],
wherein we update the mesh estimate as follows:
^
X argmin
X
X
i2S
i
kx
i
^ x
i
k
2
+
X
i2S
kL
S
(x
i
)
i
k
2
+
X
i2V
kL
V
(x
i
)
i
k
2
; (III.3)
where
i
=(
P
k;l
s
l
k
2
(p
k
i
)+(1
)
P
j;k;l
r
k
j
(q
j
i
)s
l
k
(G
k
j
(q
j
i
))) is the constraint strength for vertex
i derived from the optical flow confidence with = 15 being an overall constraint weight,S is the
set of surface vertices,L
S
is the surface Laplace operator,
i
is the surface Laplacian coordinate
of vertexi in the rest pose,V is the set of volume vertices,L
V
is the volume Laplace operator,
i
is the volume Laplacian coordinate of vertexi in the rest pose, and is a parameter balancing
48
Figure III.8: Laplacian regularization results. Left: surface regularization only. Right: surface
and volumetric regularization.
49
L
S
andL
V
as in [165]. In our framework, the rest pose is the common template in early phases,
or the personalized template in later phases. We solve this sparse linear problem using the sparse
normal Cholesky routines implemented in the Ceres solver [2]. We also estimate local rotation to
approximate as-rigid-as-possible deformation [128]. We locally rotate the Laplacian coordinate
frame to fit the current mesh estimate in the neighborhood of ^ x
i
, and iterate the solve ten times to
allow the local rotations to converge. Figure III.8 illustrates the effect of including the volumetric
Laplacian term. While previous works employ only the surface Laplacian term [140], we find
the volumetric term is vitally important for producing good results in regions with missing or
occluded data such as the back of the head or the interior of the mouth, which are otherwise prone
to exaggerated extrapolation or interpenetration.
III.4.5 UpdatingEyeballsandEyeSocketInteriors
Our template represents eyeballs as separate objects from the face mesh, with their own UV
texture parameterizations. We include the eyeball vertices in the optical flow based update, but
not the Laplacian regularization update. Instead, the eyes are treated as rigid objects, using the
closest rigid transform to the updated positions to place the entire eyeball. This alone does not
produce very good results, as the optical flow in the eye region tends to be noisy, partly due to
specular highlights in the eyes. To mitigate this problem, we do two things. First, we apply a
3 3 median filter to the face images in the region of the eye, using the current mesh estimate
to rasterize a mask. Second, after each Laplacian regularization step, we consider distance con-
straints connecting the eye pivot pointse
0
ande
1
to the vertices on the entire outer eyelid surfaces,
and additional distance constraints connecting the eye pivot points to the vertices lining the in-
side of the eye socket. These additional constraints appear as:
P
1
j=0
P
i2E
j
[O
j
(kx
i
e
j
k
y
i
e
Y
j
)
2
+
P
i2E
j
(kx
i
e
j
k; r)
, where E
0
and E
1
are the set of left and right eyelid
50
vertices, O
0
and O
1
are the set of left and right socket vertices, e
Y
0
and e
Y
1
are the eye pivot
positions in the reference mesh, is a distance constraint with a non-penetration barrier defined
as (a;b) = (ab)
2
exp((ba)) with = 10 being a parameter controlling barrier fall-
off, and r is a hard constraint distance representing the radius of the eyeball, plus the minimum
allowed thickness of the eyelid. The target distance of each constraint is obtained from the ref-
erence mesh. We minimize an energy function including both the position (III.3) and the eye
pivot distance constraints, but update only the eye pivots, interior volume vertices, and vertices
lining the eye sockets, leaving the outer facial surface vertices constant. The distance constraints
render this a nonlinear problem, which we solve using the sparse Levenberg-Marquardt routines
implemented in the Ceres solver [2]. After the eye pivots are computed, we compute a rotation
that points the pupil towards the centroid of the iris vertex positions obtained in the optical flow
update. Although this scheme does not personalize the size and shape of the eyeball, there is
less variation between individuals in the eye than in the rest of the face, and we obtain plausible
results.
III.5 PCA-BasedPoseEstimationandDenoising
We next describe the pose estimation and denoising algorithm mentioned in Sections III.3.3 and
III.3.5. Estimating the rigid transformation for each frame of a performance serves several pur-
poses. First, it is useful for representing the animation results in standard animation packages.
Second, it allows consistent global 3D localization of the eyeballs, which should not move with
respect to the skull. Third, it enables PCA-based mesh denoising techniques that reduce high-
frequency temporal noise and improve consistency of occluded or unseen regions that are inferred
from the Laplacian regularization. A rigid stabilization technique could be employed such as [13]
(also employed in [156]), which involves fitting an anatomical skull template and skin thickness
51
models. Instead we estimate rigid rotation without any anatomical knowledge, and then estimate
rigid translation using a simple eyelid thickness constraint. Section III.5.1 describes our novel
rotation alignment algorithm for deformable meshes which does not require knowledge of the
deformation basis. Section III.5.2 describes our novel translation alignment algorithm that simul-
taneously estimates per-mesh translation and globally consistent eyeball pivot placement. Section
III.5.3 then describes a straightforward PCA dimension reduction scheme.
We compare our rigid alignment technique to Procrustes analysis [59] as a baseline. Figure III.9
shows several frames from the evaluation of our technique on a sequence with significant head
motion and extreme facial expressions, including wide open mouth. The baseline method exhibits
significant misalignment, especially on wide open mouth expressions, which becomes more ap-
parent when globally consistent eye pivots are included. Our proposed technique stabilizes the
rigid head motion well, enabling globally consistent eye pivots to be employed without interpen-
etration.
III.5.1 RotationAlignment
We represent the set of facial meshes as a 3NM matrix X, where each column of X is a 3N
dimensional vectorX
t
;t = 1:::M, representing the interleavedx,y, andz coordinates of the
N vertices of mesht. We assume a low-dimensional deformation basis, and hence X = BW in
the absence of rigid transformations, where B is a 3NK basis matrix withK M, and W
is aKM weight matrix with columnsW
t
corresponding to the basis activations of eachX
t
(We do not separately add the mean mesh in the deformation model, so it will be included as the
first column of B). The trouble is that eachX
t
may actually have a different rigid transform, so
thatX
t
= R
t
BW
t
+T
t
for some unknown rotation R
t
and translationT
t
, rendering B andW
t
difficult to discover. Translation can be factored out of the problem by analyzing the mesh edges
52
Figure III.9: Comparison of rigid mesh alignment techniques on a sequence with significant head
motion and extreme expressions. Top: center view. Middle: Procrustes alignment. Bottom: our
proposed method. The green horizontal strike-through lines indicate the vertical position of the
globally consistent eyeball pivots.
53
~
X
t
rather than the verticesX
t
, defining
~
B appropriately so that
~
X
t
= R
t
~
BW
t
, and defining a
matrix
~
X having columns
~
X
t
. This eliminatesT
t
, however the rotations R
t
still obfuscate the
solution.
To solve this, we first roughly align the meshes using Procrustes method so that any remaining
relative rotations are small. Recall that for rotations of magnitudeO(), composing rotations is
equal to summing rotations up to anO(
2
) error, as in the exponential map R
t
= I +r
t
x
G
x
+
r
t
y
G
y
+r
t
z
G
z
+O(
r
t
2
), wherer
t
is the Rodrigues vector corresponding to R
t
and G
x
, G
y
,
G
z
are the generator functions for theSO
3
matrix lie group:
G
x
=
2
6
6
6
6
4
0 0 0
0 0 1
0 1 0
3
7
7
7
7
5
; G
y
=
2
6
6
6
6
4
0 0 1
0 0 0
1 0 0
3
7
7
7
7
5
; G
z
=
2
6
6
6
6
4
0 1 0
1 0 0
0 0 0
3
7
7
7
7
5
: (III.4)
Defining G
x
as a block diagonal matrix with G
x
repeated along the diagonal (and likewise fory,
z), we construct the 3N 4M matrix
^
X =
~
X G
x
~
X G
y
~
X G
z
~
X
, forming a basis
that spans the set of meshes (and hence deformation) as well as the local rotation neighborhood,
up to the O(
r
t
2
) error mentioned previously. Because the last three block columns of this
matrix represent small rotation differentials, and composition of small rotations is linear, they
lie in the same subspace as any rotational components of the first block column, and therefore
performing column-wise principal component analysis on this matrix without mean subtraction
separates deformation and rotation bases, as
^
X =
^
B
^
W. There are at mostM deformation bases
in
^
B and three times as many rotation bases, so we can assume that 3 out of 4 bases are rotational
but need to identify which ones. To do this, we score each basis with a data weight and a rotation
weight. The data weight for the basis at columnb in
^
B is the sum of the squares of the coefficients
in the firstM columns of rowb of
^
W, and the rotation weight is the sum of the squares of the
54
coefficients in the last 3M columns of rowb of
^
W. The rotational score of columnb is then the
rotation weight divided by the sum of the data weight and rotation weight. We assume the 3M
columns with the greatest such score represent rotations of meshes and rotations of deformation
bases. The remainingM columns form a basis with deformation only (up toO(
r
t
2
)), which we
call the rotation suppressed basis
~
B
s
, and use it to suppress rotation by projecting
~
X
~
B
s
~
B
T
s
~
X.
This may also contain a residual global rotation, so we compute another rigid alignment between
the mean over all
~
X
t
and (the edges of) our template mesh, and apply this rotation to update all
~
X
t
, making convergence possible. We iterate the entire procedure starting from the construction
of
^
X until convergence, which we usually observed in 10 to 20 iterations. Finally, we compute the
closest rigid rotation between each original
~
X
t
and the final rotation suppressed
~
X
t
to discover
R
t
.
III.5.2 TranslationAlignment
After R
t
andW
t
are computed using the method in Section III.5.1, there remains a translation
ambiguity in obtaining B. We define the rotationally aligned meshA
t
= R
t
T
X
t
, and we define
the aligned-space translation
t
= R
t
T
T
t
, thus,A
t
= BW
t
+
t
. We compute the mean ofA
t
and compute and apply the closest rigid translation to align it with the template mesh, denoting
the result
A
t
. We then wish to discover
t
;t = 1:::M such that eachA
t
t
is well aligned to
A
t
. Since our model has eyes, we also wish to discover globally consistent eye pivot points in
an aligned head pose space, which we call e
0
and e
1
, and allow the eyes to move around slightly
relative to the facial surface, while being constrained to the eyelid vertices using the same distance
55
constraints as in Section III.4.5, in order to achieve globally consistent pivot locations. We find
e
0
, e
1
, and
t
;t = 1:::M minimizing the following energy function using the Ceres solver [2]:
M
X
t=1
h
X
i2S
(
a
t
i
t
a
t
i
) +
1
X
j=0
X
i2E
j
(
a
t
i
t
e
j
;
y
i
e
Y
j
)
i
; (III.5)
wherea
t
i
is a vertex in meshA
t
(and a
t
i
in
A
t
), and is the Tukey biweight loss function tuned
to ignore cumulative error past 1 cm. With
t
computed, we letT
t
= R
t
t
.
III.5.3 DimensionReduction
With R
t
and T
t
computed for all meshes in Sections III.5.1 and III.5.2, we may remove the
relative rigid transforms from all meshes to place them into an aligned pose space. We perform a
weighted principle component analysis with vertices weighted by the mean of the confidence
i
(see Section III.4.4), producing the basis B and weight matrix W. We truncate the basis to reduce
noise and inconsistencies across poses in areas of ambiguous matching to the shared template, and
in areas of insufficient data that are essentially inferred by the Laplacian prior, such as the back
and top of the head.
III.6 Results
We demonstrate our method for dynamic facial reconstruction with five subjects - three male and
two female. The first four subjects were recorded in a LED sphere under flat-lit static blue lighting.
The blue light gives us excellent texture cues for optical flow. We used 12 Ximea monochrome
20482048 machine vision cameras, as seen in Figure III.10. We synchronized the LEDs with
cameras at 72Hz, and only exposed the camera shutter for 2 ms to eliminate motion blur as much
as possible. Also pulsing the LEDs for a shorter period of time reduces the perceived brightness
56
Figure III.10: 12 views of one frame of a performance sequence. Using flat blue lighting provides
sharp imagery.
to the subject, and is more suitable for recording natural facial performance. Though we captured
72Hz, we only processed 24Hz in the results to reduce computation time.
Figure III.11 shows intermediate results from each reconstruction step described in Sections
III.3.2 through III.3.4. This is a particularly challenging case as the initial facial landmark detec-
tor matches large-scale facial proportions, but struggles in the presence of facial hair that partially
occludes the lips and teeth. Artifacts remain even after dense optical flow in Section III.3.2. PCA
based denoising (III.3.3) and fine-scale consistent mesh propagation (III.3.4) fill in more accurate
mouth and eye contours that agree with inset photographs.
An alternative approach would be to initialize image-based mesh warping with a morphable model
[134] in place of the Digital Emily template (Figure III.12). To perform this comparison, we fitted
a morphable model to multi-view imagery of a male subject (Figure III.13(a)) and filled in the rest
57
Figure III.11: Zoomed renderings of different facial regions of three subjects. Top: results after
coarse-scale fitting using the shared template (Section III.3.2). The landmarks incorrectly located
the eyes or mouth in some frames. Middle: results after pose estimation and denoising (III.3.3).
Bottom: results after fine-scale consistent mesh propagation (III.3.4) showing the recovery of
correct shapes.
58
of the head topology using Laplacian mesh deformation (Figure III.12(b)). We then generated
synthetic renderings of the morphable model using estimated morphable albedo and inpainted
textures. Our technique reconstructs more geometric details, such as the nasolabial fold on Figure
III.12(c), that are not captured well by the linear deformation model in Figure III.12(a), perhaps
because our method employs stereo cues from multi-view data. Figure III.13(b) illustrates the
warped image to match (e) in the same camera, compared to the warped image (d) starting from
the Digital Emily model (c) in a similar camera view.
Our final geometry closely matches fine details in the original photographs. Figure III.14 (a) and
(b) show automatically deformed high quality topology by overlaying half the face onto calibrated
cameras, showing high quality agreement with the input image. Figure III.14 also shows details
captured on another subject on (c) a scrunched expression, and (d) a close-up of (c). The close-
up image shows the mesoscopic details, such as pores and fine wrinkles reconstructed with our
dynamic details enhancement technique in Section III.3.6
Figure III.15 shows several frames from a highly expressive facial performance reconstructed
on a high quality template mesh using our pipeline. The reconstructed meshes shown in wire-
frame with and without texture mapping indicate good agreement with the actual performance as
well as texture consistency across frames. The detail enhancement from Section III.3.6 produces
high-resolution dynamic details, such as pores, forehead wrinkles, and crows feet, adding greater
fidelity to the geometry.
We directly compare our method with [14] based on publicly available video datasets as shown
in Figure III.16. Our method is able to recover significantly greater skin detail and realistic fa-
cial features, particularly around the mouth, eyes, and nose, as well as completing the full head.
Our system also does not rely on temporal flow, making it easier to parallelize each frame in-
dependently for faster processing times. Figure II.2 also illustrates the accuracy of our method
59
(a) (b) (c) (d)
Figure III.12: Comparison using a morphable model of [134] as a initial template. (a) Front face
region captured by the previous technique, (b) stitched on our full head topology. (c) Resulting
geometry from III.3.2 deformed using our method with (b) as a template, compared to (d) the
result of using the Digital Emily template. The linear morphable model misses details in the
nasolabial fold.
compared to the multi-view reconstruction method of [45]. Though we never explicitly compute a
point cloud or depth map, the optical flow computation is closely related to stereo correspondence
and our result is a very close match to the multi-view stereo result (as indicated by the speckle
pattern apparent when overlaying the two meshes with different colors). Unlike [45], our tech-
nique naturally fills in occluded or missing regions, such as the back of the head, and provides
consistent topology across subjects and dynamic sequences. We can reconstruct a single static
frame, as shown in Figure III.4 or an entire consistent sequence.
Since our method reconstructs the shape and deformation on a consistent UV space and topology,
we can transfer attributes, such as appearance or deformation, between subjects. Figure III.17
shows morphing between facial performances of three subjects, with smooth transition from one
subject to the next. Unlike previous performance transfer techniques, the recovered topology is
inherent to the reconstruction and does not require any post processing.
60
(a) (b) (c) (d) (e)
Figure III.13: (a) Synthetic rendering of the morphable model from Figure III.12(b). (b) Result
using our image warping method to warp (a) to match real photograph (e). Similarly the com-
mon template image (c) is warped to match (e), producing plausible coarse-scale facial feature
matching in (d).
III.7 Limitations
While our technique yields a robust system and provides several benefits compared to existing
techniques, it has several limitations. Initial landmark detection may incorrectly locate a land-
mark. For example, sometimes facial hair is interpreted as a mouth. Therefore, improvements
in landmark detection would help here. The coarse-scale template alignment fails in some areas
when the appearance of the subject and the template differ significantly, which can happen in the
presence of facial hair or when the tongue and teeth become visible, as they are not part of the
template (see Figure III.18). While these errors are often mitigated by our denoising technique,
in the future it would be of interest to improve tracking in such regions by providing additional
semantics, such as more detailed facial feature segmentation and classification, or by combining
tracking from more than one template to cover a larger appearance space. While our appearance-
driven mesh deformation warps the image and deforms the personalized template progressively
closer to the solution, registration error could still occur under significant appearance change.
This could particularly occur around the eyes and the mouth due to occlusion. Previous works
are also susceptible to such occlusion artifacts. Our facial surface details come from dynamic
61
high-frequency appearance changes in the flat-lit video, but as with other passive illumination
techniques, they miss some of the facial texture realism obtainable with active photometric stereo
processes, such as in [55]. If such an active-illumination scan of the subject could be used as
the template mesh, our technique could propagate its high-frequency details to the entire perfor-
mance, and dynamic skin microgeometry could be simulated as as in [109]. Furthermore, it would
be of interest to allow the animator to conveniently modify the captured performances. This could
be facilitiated by identifying sparse localized deformation components as in [110] or performance
morphing techniques as in [104].
III.8 Discussion
We have presented an entirely automatic method to accurately track facial performance geom-
etry from multi-view video, producing consistent results on an artist-friendly mesh for multiple
subjects from a single template. Unlike previous works that employ temporal optical flow, our
approach simultaneously optimizes stereo and consistency objectives independently for each in-
stant in time. We demonstrated an appearance-driven mesh deformation algorithm that leverages
landmark detection and optical flow techniques, which produces coarse-scale facial feature con-
sistency across subjects and fine-scale consistency across frames of the same subject. Furthermore
we demonstrated a displacement map estimation scheme that compares the uv-space texture of
each frame against an automatically selected neutral frame to produce stronger displacements in
dynamic facial wrinkles. Our method operates solely in the desired artist mesh domain and does
not rely on complex facial rigs or morphable models. While performance retargeting is beyond
the scope of this work, performances captured using our proposed pipeline could be employed
as high-quality inputs into retargeting systems such as [92]. To our knowledge, this is the first
method to produce facial performance capture results with detail on par with multi-view stereo
62
and pore-level consistent parameterization without temporal optical flow, and could lead to inter-
esting applications in building databases of morphable characters and simpler facial performance
capture pipelines.
63
(a) (b)
(c) (d)
Figure III.14: Our method automatically reconstructs dynamic facial models from multi-view
stereo with consistent parameterization. (a) and (b) Facial reconstruction with artist-quality mesh
topology overlaid on the input video. (c) Reconstructed face model with a displacement map
estimated from details in the images. (d) Close-up of fine details, such as pores and dynamic
wrinkles from (c).
64
Figure III.15: Dynamic face reconstruction from multi-view dataset of a male subject shown
from one of the calibrated cameras (top). Wireframe rendering (second) and per frame texture
rendering (third) from the same camera. Enhanced details captured with our technique (bottom)
shows high quality agreement with the fine-scale details in the photograph.
65
(a) (b) (c)
Figure III.16: Reconstructed mesh (a), enhanced displacement details with our technique (b),
and comparison to previous work (c). Our method automatically captures whole head topology
including nostrils, back of the head, mouth interior, and eyes, as well as skin details.
Figure III.17: Since our method reconstructs the face on a common head topology with coarse-
scale feature consistency across subjects, blending between different facial performances is easy.
Here we transition between facial performances from three different subjects.
(a) (b) (c) (d)
Figure III.18: Our system struggles to reconstruct features that are not represented by the template.
For example, visible facial hair or tongue (a) may cause misplacement of the landmarks employed
in III.3.2 (b), which the denoising in III.3.3 may not be able to recover (c), and remain as artifacts
after fine-scale warping in III.3.4 (d).
66
III.9 Applications
In this chapter we have presented a technique that can automatically correspond multiple indi-
viduals with arbitrary facial expressions and challenging head poses from a single template, and
multi-view data. Such a technique could be an excellent tool to produce high quality corresponded
data, and could open an avenue for new research directions in high quality digital avatar creation.
The corresponded high quality geometry could be an excellent source for building shape and mo-
tion priors. For example, high quality blendshapes could be built from the correspondence on the
artist-friendly topology and the high quality shape comparable to multi-view stereo. The captured
whole head topology could help build complete statistical facial models, such as a morphable
model [20] including eyeballs, eyelids, back of the head, and the mouth interior. Corresponded
data is also suitable for data compression, such as PCA for a lightweight representation, which
could be useful for real-time applications, such as video games, and VR. For facial animation,
a mesh sequence obtained on clean artist-friendly parameterization could help optimize the face
topology to more efficiently represent facial deformation as done in previous work [106].
Figure III.19: From left to right: unconstrained input image, inferred complete albedo, rendering
with a commercially available rendering package, and its close up showing the medium frequency
skin pigmentations captured with the technique.
67
Since the process also solves for the dense correspondence, the corresponded texture data could
be used for building reflectance priors. Figure III.19 shows the results of [124], which presented
a technique to infer a complete facial albedo texture from a single in the wild image using a deep
neural network and a multi-scale texture correlation analysis. While the work employed publicly
available datasets for learning the albedo, high quality data captured in an controlled environ-
ment could improve the quality of the result. Furthermore when our correspondence technique
is combined with the active photometric stereo techniques [55], it could produce a wider vari-
ety of reflectance datasets including specular albedo, subsurface scattering, and high resolution
displacement maps.
68
ChapterIV
Estimating3DMotionfromMulti-viewHuman
Performance
IV.1 Introduction
Motion, along with geometry and light, is a fundamental element of the computer graphics
pipeline. The ability to capture motion as a measurement, the way we measure geometry through
depth scanning, and light through imaging, has revolutionized the state-of-the-art in animation,
simulation, motion control, and many other core areas of graphics. Yet, unlike geometry and
light, motion capture has remained a sparsely sampled signal, usually achieved by tracking mark-
ers across time [152, 62, 17, 68]. As markers increase in proximity, their interference progres-
sively limits the signal that can be measured. This limitation is particularly problematic because
motion is a multi-scale signal. Subtle and momentary motion, such as a wrinkle in cloth or a
subtle smirk, often contains information of as much perceptual value as larger-scale deformation.
Recent research into densifying the sampling of 3D motion by tracking image keypoints in high-
resolution cameras has, nearly ubiquitously, employed spatial and/or temporal priors to regularize
3D motion estimates [14, 141, 33].
69
Our point of departure from previous work is that we consider the use of priors for motion es-
timation to be harmful.We base this assertion on three observations about most priors used in
literature: (1) Priors have a normalizing effect that can dampen details and peculiarities that exist
in the data—often precisely what we want to model or simulate. (2) Priors operate uniformly on
the signal, and current approaches use heuristics to determine where they should be turned on or
off or how they should be weighted.In statistical analysis, principled application of such weight-
ing is learned from unbiased measurements from the sample set, which in turn requires techniques
for prior-free measurement. (3) Priors are usually global terms that link estimation into a single
global optimization, straining memory and compute resources for high density point sets.
In this work, we present a dynamic surface reconstruction algorithm for multi-view surface recon-
struction that is designed to minimize the necessity of prior information in the estimation of dense
motion data. We use a capture system consisting of a large array of high-resolution cameras
to maximize the raw signal that is recorded and to minimize the influence of occlusion (Sec-
tion IV .4). In conjunction, we present a tracking method for generic deformable surfaces from
multi-view image sequences (Section IV .3). Our tracking method does not rely on any priors such
as statistical shape models on target shape/motion or even spatial and/or temporal smoothness. We
demonstrate that through a carefully designed objective function, stable motion estimation can be
achieved without any priors (Section IV .4). The prior-free method has significant generalizability,
and can track not only facial performance, but also subtle skin deformation and interaction of
multiple objects such as a hand folding cloth (Figure VII.1). A significant feature of the prior-free
approach is scalability: our tracking method computes a small-scale independent optimization
per point that can run in parallel, and achieves efficiency and scalability with multiple CPUs and
GPUs. We show the accuracy of our method quantitatively using synthetic facial performance
data, and demonstrate the same algorithm running in several cases, from skin deformation to
cloth folding.
70
Figure IV .1: An example of 40 camera array setup, each of four rows includes 10 cameras almost
uniformly distributed in azimuth angle covering ear to ear in case of human face capture. Four
lower cameras are zoomed in to improve the reconstruction around the mouth cavity.
IV.2 SystemSetupandPreprocessing
The capture system used in this work consists of 40 hardware-synchronized machine vision cam-
eras recording 5120 by 3840 pixel resolution (20 mega-pixel) RGB images at 30 frames per
second and an exposure time of 4ms, surrounding an object at the center lit uniformly from static
LED light sources. An example camera arrangement is shown in Figure IV .1. Intrinsic and ex-
trinsic calibration parameters for each of the cameras are computed using a planar checkerboard
pattern [163].
At every time instance, we perform depth estimation from the 40 cameras using multiview stereo
techniques. In particular, we employ a variant of the PatchMatch-based multiview stereo algo-
rithm proposed by Galliani et al. [51].
71
To further refine the accuracy of the estimated depth, our implementation combines continuous
optimization framework inspired by the GPU-based refinement method of [158]. We employ a
similar optimization strategy that divides each image into smaller regions and runs the Gauss-
Newton optimization for each region. Finally, we obtain a mesh from the fused point clouds
using Poisson reconstruction [84]. Figs. VII.1 (a) and (c) show reconstructions with our method
(right) with corresponding input RGB photographs (left). As can be seen, the reconstruction cap-
tures mesoscopic details including wrinkles in the skin, and detailed folding of the clothing from
passive inputs. Our scene flow method purely operates on points without relying on particular
topology or faces from the Poisson reconstruction, as detailed in the next section. We use the re-
construction mesh only for visualization and for rendering continuous visibility, and depth maps
per-view, which can be useful the scene flow estimation.
IV.3 Estimating3DSceneFlow
In the multi-view setup with 40 views focused on the object, the method described above can yield
high fidelity per-frame 3D reconstructions. As such, we separate per-frame 3D reconstruction
from 3D motion fields estimation, whereby we may make use of the reconstruction as manifold
constraints on which the scene flow vectors are constrained to lie. This greatly simplifies the
optimization, and achieves convergence to an accurate solution without relying on regularization.
Although methods that simultaneously estimate scene flow and shape exist (see for example [10]),
these incur additional complexity and do not improve accuracy in reconstruction when sufficient
views are available. In the following, we outline our scene flow method that individually estimates
temporal 3D trajectory of each vertex on the 3D reconstruction in Section IV .4, starting from
visible points on the starting frame.
72
For each vertex X in the 3D reconstruction at the initial frame, we directly compute the 3D scene
flow V of X by minimizing the sum of the per-pixel photometric consistency termE
photo
and
optionally the geometric consistency termE
geo
over all camerasC:
V
= argmin
V
X
c2C
X
x2N(xc)
E
c
photo
(x) +E
c
geo
(x); (IV .1)
where x
c
=
c
(X) with perspective projection of camerac,N (x
c
) is a 2D patch around x
c
,
and is a weight to balance the two terms. Without loss of generality, we refer a starting and an
ending frame of tracking toi andj. Each term is formulated as:
E
c
photo
(x) =kw
c
(x)
I
c
i
(x) I
c
j
(x + u
c
(V))
k
2
; (IV .2)
E
c
geo
(x) =kw
c
(x)
G
c
i
(x) + V G
c
j
(x + u
c
(V))
k
2
; (IV .3)
where I
c
and G
c
are respectively pixel values of an image for camera c and a position map
containing in its RGB channel 3D world locations of each point on the mesh rendered from
camerac. u
c
is a 2D optical flow for camerac, which is shown here for the explanation purpose,
but is never explicitly computed in practice. The 2D optical flow u
c
is projection of 3D scene flow
V for a particular camerac, i.e., u
c
(V)=
c
(X+V)x
c
.w
c
is a spatially varying weight shared
for photometric and geometric consistency terms, discussed in Section IV .3.2. The photometric
consistency term, E
P
, penalizes the difference in the appearance in the image pair at frame i
and j, and ensures point correspondence between the different time instances. The geometric
consistency term, E
G
, favors the solution to lie on the manifold of a reconstructed surface by
penalizing the geometric difference between the reconstruction at framei andj modulo the scene
flow V. We omit view-to-view photometric consistency as it has already been used in the previous
stage to compute the per-frame 3D reconstructions (see Section IV .4). In the multiview setting,
the photometric consistency term E
p
is enough to produce 3D motion estimates, but we found
73
it slightly more accurate when both terms are used. In both terms, it is assumed that the object
is far from the cameras relative to their focal length and that rotations of the patches between
frames are reasonably small. This simplifies the formulation by considering only fixed rectangular
patch sizes and orientations across time. The objective function in Equation (IV .1) is minimized
using Gauss-Newton (GN) optimization in a coarse-to-fine manner with image and position map
pyramids.
The objective in Equation (IV .1) is non-linear with many local minima and suffers from outliers
arising from occlusion and appearance changes. Most literature deals with this with priors, but
our goal is to measure the motion with as little bias as possible in the data. To robustify the
optimization, and maximize accuracy in the measurement, we employ the following strategies for
prior-free scene flow estimation. In Section IV .3.1, we present a robust descriptor which has a
wider basin of attraction that encourages the convergence to a global minimum in a coarse-to-fine
image pyramid. In Section IV .3.2 we describe our generic scheme to ignore artifacts and outliers
in the data that accelerates the convergence. Finally in Section IV .3.3 we propose a way to
adaptively scale features based on the photometric reconstruction error that encourages consistent
motion estimates without forcing smoothing on the solution.
IV.3.1 RobustDescriptor
The naive pixel intensity for the photometric consistency objective in Equation (IV .1) is too sen-
sitive to changes in appearance, which is a commonly encountered problem during image-based
capture. To combat the problem, we employ the following descriptor derived from the image
gradient:
I(x)=max
D
T
rI(x);0
; (IV .4)
74
where max() is an element-wise max operator,rI is the image gradient computed with the
Scharr operator and D encodes the direction of the gradient. The advantage of using the gra-
dient information instead of using raw intensities is the invariance to low frequency appearance
changes due to shading, and albedo changes (e.g. blood flow). Our descriptor is computed with
Equation (IV .4) using multiple gradient directions covering the entire orientation at each pixel
x, inspired by SIFT [98] and HoG [36]. In our work, we compute the gradient for every 45
(i.e. 8 directions over 2), that is, D
n
=(cos(n=4); sin(n=4))
T
where D
n
isn-th column of
D. The proposed descriptor shares a similar advantage with the distribution field (DF) representa-
tion [121] that it produces a large basin of convergence that guides the gradient-based optimization
to a correct minimum in an image pyramid. An illustration of the proposed descriptor is shown in
Figure IV .2. The eight images on the right top corner show the normalized response of the robust
descriptor in the eight directions recorded in a separate channel, illustrated in a false color, with
input photographs shown on the left side. The key advantage of our descriptor over the commonly
used gradient descriptor is that since our descriptor (first two figures in the fourth row) only keeps
track of the positive response to the corresponding orientation, the peculiar properties in the image
are preserved during smoothing in the image pyramid while it produces a wider and well-behaved
optimization landscape with few local minima. On the other hand, the naive gradient of the image
(the latter two images in the fourth row) contains both positive/negative responses, and details in
the image can be easily lost by smoothing over different features.
In computingrI, we found it important to exclude pixels associated with background and occlu-
sion boundaries as these generate strong gradient responses, but correspond to transient features
not attached to a geometrically consistent position between time instances. For example, the oc-
clusion boundary can move between frames, corresponding to different locations on the surface.
To remove these pixels, we build a foreground/background segmentation mask in each view using
the reconstructed mesh, and attenuate the corresponding pixels accordingly. We also applied very
75
small Gaussian blur on the image to remove camera noise, as the gradient is sensitive to the high
frequency information.
Figure IV .2: Robust descriptors. Top and middle: input image and our 8-dimensional robust
descriptors. The arrows indicate directional vectors in D. Bottom: comparisons between our
robust descriptor (one of the eight directions is shown) andkrI
x
k at middle and coarsest levels.
Note thatkrI
x
k is used for visualization purpose, whilerI is used in typical approaches.
IV.3.2 OutlierPruning
Patch-based deformable surface tracking in general is susceptible to occlusion, and non-rigid
motion, and artifacts in the data (e.g. blurs that make it difficult to match the appearance in the
76
patch). Pruning the outliers from the data is a key for the success of stable tracking, and faster
convergence to an accurate solution. To that end, we use the following four measures:
w=w
rig
w
sup
w
vis
; (IV .5)
which are detailed in the following paragraphs. These weights are updated in each iteration of
the Gauss-Newton optimization based on the reconstruction error, which influence the successive
iteration of the optimization.
(a) (b) (c)
Figure IV .3: An illustration of the rigidity weight. The rigidity weight penalizes the large depth
discrepancy due to non-rigid deformation within the patch. We derive the rigidity weight (c) from
the depth similarities between the source (a) and the target (b) patches as in Equation (IV .6). Here,
the higher rigid weight is assigned to the upper lip area while it is lower in the lower lip.
77
Rigidity The rigidity weight w
rig
accounts for local rigidity inside the patchN . Eqs. (IV .2)
and (IV .3) assume that the motion between two frame is rigid, which is not necessarily true par-
ticularly when the receptive window corresponds to a relatively large part of the surface when the
optimization is at the coarse scales. This weight is computed as:
w
c
rig
(x)=exp
1
2
kX
c
(x)k
2
=
2
rig
; (IV .6)
whereX
c
(x)=G
c
i
(x)+
^
VG
c
j
(x+u
c
(
^
V)) with the current scene flow estimate
^
V, and
rig
is
a user specified weight to scale the influence of the weight. is analogously so from now on.
Figure IV .4: An illustration of the support weight. The support weight prefers the pixel neighbor
which has a similar feature to the center pixel in order to disambiguate pixel neighbors corre-
sponding to a semantically/geometrically different point than the center (e.g. the patch in the left
image contains points corresponding to the nose and the eyes, which may deform differently).
We employ the depth of the patch pixel as the similarity measure, and derive the support weight
shown in the lower right using the Equation (IV .7).
Support A patchN may contain a surface that corresponds to semantically/geometrically dif-
ferent points of an object. For example, the patch in the left and top right images of Figure IV .4
78
include the pixel of the eyes, which may move differently than the center pixel (i.e. a point on
the nose), thereby influencing the optimization. We are interested in the motion of the pixels that
have similar quality to the point of interest, i.e. patch center pixel, x
c
. Inspired by [161], we ap-
proximate the semantic/geometric neighbors using the similarity in the depth values in the patch
N . Namely, an support weightw
sup
respects the pixels with similar depth value as the point of
interest x
c
, and is computed as:
w
c
sup
(x)=exp(
1
2
kG
c
i
(x)G
c
i
(x
c
)k
2
=
2
sup
): (IV .7)
Visibility Pixels with a grazing angle often cause outliers in the optimization. The visibility
weightw
vis
accounts for these outliers and is calculated as:
w
c
vis
(x)=cos
2
c
; (IV .8)
where
c
=n
T
(CX)=kCXk with the point normal n and the position C of camerac.
IV.3.3 Data-drivenMulti-scalePropagation
The above robust descriptor and the outlier pruning significantly improve the tracking when com-
bined with image pyramids. The advantage of a coarse-to-fine optimization is that the objective
function at coarser levels has less local minima due to smoothing on the data, guiding a solution
to a better local minima. As the receptive window halves in the successive level, more local and
high-frequency features emerge, and the optimization seeks for more local matching. However,
under large deformation, the solution might be too far or the information in the patch may not be
reliable due to appearance change, leading to instability in the gradient-based optimization. We
79
further facilitate the advantage of a coarse-to-fine optimization, and propose a data-driven tech-
nique that propagates information at multiple pyramid levels to better condition the optimization.
Consider the regular GN normal equation defined as following:
H
l
n
V=g
l
n
; (IV .9)
where H
l
n
and g
l
n
are a Hessian matrix and a gradient vector atn-th iteration of GN optimization
atl-th level. We consider the Hessian matrix H
l1
and solution V
l1
at the previous coarser level
into Equation (IV .9), i.e.,
^
H
l
n
=(1w
h
)H
l
n
+w
h
H
l1
; (IV .10)
^ g
l
n
=(1w
h
)g
l
n
+w
h
H
l1
V
l1
V
l
n1
: (IV .11)
w
h
is an uncertainty derived from the data to balance the information in the current and the
previous levels, and is computed based on the photometric consistency term, as
w
h
=1
1
jCj
X
c2C
exp
1
2
2
h
X
x
^
E
c
photo
(x)
!
; (IV .12)
where
^
E
c
photo
is computed via Equation (IV .2) with W
c
photo
=I. When there is no meaningful cor-
respondence in the patch due to appearance change or large deformation, it automatically blends
in the information at coarser scales based on the uncertainty. A key difference to traditional regu-
larization is that it does not enforce smoothing on the data. It hints where a correct solution might
be when the optimization is uncertain, and when fully certain it only respects the information at
the current level. In fact, if the uncertainty in Equation (IV .12) is zero (i.e. photometric error
^
E
c
photo
is zero), the optimization only considers the patch information at the current level. Fur-
thermore, unlike most of the smoothing priors which typically involves solving a global system,
80
our formulation allows us to keep the optimization fully local. This encourages locally consistent
motion, and prevents the optimization from diverging at the finer levels. For the coarsest level, we
use the conventional GN optimization, i.e.,w
h
=0, and initialize the flow at the next level with the
estimate on the previous level. We initialize the optimization with V
0
0
=0, but may employ rigid
ICP [32] for a more robust initialization. For each pyramid level, we iterate the GN optimization
at 10 times, and halve the patch size every time we proceed to the successive level in the pyramid.
IV.4 Results
We implemented the scene flow method of Section IV .3 on a CentOS linux machine with 48
Intel Xeon CPU E5-2680 v3 @2.5GHz cores with 263GB RAM, and 8 Nvidia Quadro M6000
graphics cards. Since the optimization in Equation (IV .1) is fully paralelizable per-vertex, we
distributed the GN optimization over multiple CPUs and GPUs. Since Equation IV .9 is merely a
33 system, it is straightforward to solve. Populating the Hessian matrix, and the gradient vector
involves many calls of small arithmetic operations over views, and pixels in the patches, but it can
be fully parallelized on GPUs. Overall, we achieved a processing time of 15 minutes per-frame
with forty 20 megapixel cameras. In our experiment we used a pyramid depth of six going from a
full resolution 5120 by 3840 to 160 by 120 with a square patch of size 21 to 31 pixels. For tracking
we manually specified an initial tracking frame, and tracks all the points on the mesh individually
sequentially by updating the appearance of the patch. We terminate the tracking of a point when
we iterate a maximum number of GN iteration or the point is not seen from multiple cameras. For
points the stop tracking before reaching to the maxmum number of iteration, we simply removed
them from the next iteration and the final visualization. In this section we demonstrate the result
of our prior-free approach in skin and cloth deformation including a challenging interaction with
a hand and cloth folding.
81
Figure IV .5: Result of tracking Subject 2’s mouth corner pulling. Top: tracked point clouds, and
bottom: point trajectories overlaid on images. Result of tracking Subject 1’s cheek puffing. Top:
tracked point clouds, and bottom: point trajectories overlaid on images.
82
Figure IV .6: Result of tracking skin deformation of an arm. Top: tracked point clouds. Bottom:
point trajectories overlaid on images.
83
(a) (b)
Figure IV .7: Result of tracking Subject 2’s eye blinking. (a) tracked point clouds, and (b) point
trajectories overlaid on images.
Figure IV .8: Dynamic surface mesh of clothing (top) and an arm example (bottom) obtained with
our method. This mesh based visualization shows the 3D template mesh obtained with Section
IV .4 (first column), and the mesh sequence obtained by applying our 3D scene flow in Section
IV .3 on the template (second column and later). The vertices that are no longer tracked (i.e. due
to occlusion) are not shown here. Though our method estimates the 3D scene flow independently
on each vertex, it can produce clean 3D surface for the large portion of the mesh.
84
-
(a) (b) (c) (d)
Figure IV .9: (a) An input frame from a facial performance sequence, the tracking result (b) with-
out and (c) with robust descriptors (ours), and (d) close views around the nose. Zoom in views in
(d) show that the tracking suffers from artifacts due to the shading in the nasal fold if naive RGB
intensity is used (top) while ours does not (bottom).
-
(a) (b) (c)
Figure IV .10: (a) Result obtained using only robust descriptors in Section IV .3.1, (b) (a) with
data-driven multi-scale propagation in Section IV .3.3, and (c) (b) with all weights in Section
IV .3.2 added (ours).
85
Figure IV .11: Alignment error comparison with a global optimization with laplacian regulariza-
tion. The above graph shows the evolution of the error in temporal tracking with the horizontal
axis being the frame count, and the vertical axis being the accumulated error.
Comparisonwithglobalregularization In Figure VII.1, we show our tracking results on hu-
man facial performance, and the interaction between a hand and cloth. (b) and (d) illustrate
estimated 3D scene flow without priors with the length and color of lines, respectively indicating
the magnitude and the orientation of flow with corresponding RGB input and 3D mesh recon-
structed in Section in (a) and (c). The colors in the 3D reconstructions in (a) and (c) encode the
surface normal orientations in RGB channels, and show our reconstruction successfully captures
mesoscopic details such as fine wrinkles on the face, and foldings on the clothing. Smoothly
varying magnitude and color of the flow in (b) and (d) indicates consistency of estimated motion
with our method. Note that our method operates on points without relying on specific topology
or targets, and can still produce consistent 3D motion without regularization.
86
Figure IV .5 shows in even rows similar scene flow visualization as Figure VII.1, as well as tracked
point clouds in the odd rows. Our method successfully captures the dynamics of the cheek puffing
(first two rows), including the secondary motion. The result is best scene in an accompanying
video. The third and forth rows show the tracking result of Subject 2’s mouth corner pulling. We
can clearly see how muscles around the right mouth corner are activated, starting with muscle of
the right cheek, then around right chin. The last two rows of Figure IV .5 show the tracking of arm
muscle deformation. Unlike previous results, this sequence involves significant global motion as
illustrated as the sudden flow color change to purple in the last column of the image, as well as
local skin deformation due to muscle activity. Our tracking successfully captures both rigid and
non-rigid motion on human skin.
Figure IV .7 shows the tracking result Subject 2’s eye blinking. While the motion is subtle, our
method extends to the tracking of salient features such as eyes blinking.
Figure IV .8 shows the dynamic reconstruction of the clothing in Figure VII.1, and the arm in
Figure IV .5 in mesh views with raw 3D displacement. While some points are lost due to oc-
clusions, the method produces well-behaved surface for the most part with mesoscopic details
without surface smoothing unlike previous work [26, 14, 141, 47].
Evaluation Figure IV .9 compares the effect of robust descriptor on a facial performance ex-
ample. A facial expression undergoing the deformation of the cheek causes significant shading
change on the sides of the nose (a), which causes the instability in the tracking when a naive RGB
intensity is used (b). On the other hand, our robust descriptor (c) derived from multiple image gra-
dients effectively ignores such low frequency shading change, and produces high quality surface
even some part is severely occluded (d).
87
Figure IV .10 demonstrates the effect of each component proposed in Section IV .3 with false color
indicating reconstruction errors. While the robust descriptor in Section IV .3.1 achieves the ac-
curate registration overall, it exhibits errors around the occlusion boundaries (e.g. the side of the
nose, and around the neck) (first from left). With the data-driven multiscale propagation in Sec-
tion IV .3.3 (second from the left), it achieves overall more consistent and better errors around the
neck. With the interest weight in Section IV .3.2, it improves some of the occlusion boundary such
as the side of the nose by preferring the similar geometric neighbor in the patch (third from left).
Finally with the rigidity weightw
rig
, it mostly removes the occlusion boundary artifacts around
the nose, and produces well behaved errors over the entire the face.
We quantify the accuracy of our tracking method using a multiview synthetic data. We generated
multiview synthetic rendering based upon the result of state-of-the-art multiview facial perfor-
mance tracking of [14]. To avoid artificial misalignment in the data, we generated a ground truth
texture map from the first frame, and rendered multiview synthetic images via reprojection to 18
novel views. We processed frames sequentially for 300 frames from the first frame, and achieved
on average 0.1 mm of registration errors. In order to evaluate the accuracy of our prior-free ap-
proach, we compared our method with a global regularization method where surface Laplacian
termE
L
between the target and reference frames is added to Equation (IV .1):
E
L
=
X
X
kL(X+V)LX
0
k
2
; (IV .13)
where L is a Laplacian operator, and X
0
is the vertex position in the reference frame. Note that
the surface Laplacian term requires all X to be solved simultaneously. Figure IV .11 shows the
plot illustrating the average reconstruction error per-frame between the ground truth shape, and
the tracking result. The Laplacian-1 is the average error computed over all the points on the mesh
including ones that are partially and fully regularized, while ours, and Laplacian-2 is computed
88
over a set of common points that the proposed tracker is reasonably confident about. When the
point is not well constrained (e.g. under occlusion), the regularizer tends to bias the solution, and
does not help improve the reconstruction error at all. Our prior-free method performed equally
well to the Laplacian-2. This is an evidence that with enough multiview constraints and a carefully
designed optimization scheme, it is possible to obtain a solution that is partially better than what
the global optimization could offer. Furthermore, while the laplacian one took 240 seconds to
compute a frame, our method took 90 seconds to process 18 views input at 1280 by 960 resolution.
On the real data the laplacian one took more than 100 minutes to process one frame, while ours
took 15 minutes.
IV.5 DiscussionsandConclusion
In this paper, we presented a scene flow method for generic deformable surfaces from multiview
images and reconstructed meshes without any priors. We formulate the geometric and photo-
metric consistency energies with robust descriptors from image gradients, data-driven multiscale
propagation technique to adaptively blend optimization landscapes at multiple pyramid levels,
and confidences to handle outliers in the data based on geometric inconsistency, non-rigid defor-
mation, visibility and motion blur. We implemented our method on multiple CPUs and GPUs,
yielding greater efficiency and scalability than the one with one with global optimization, namely
with a Laplacian prior, with the comparable quality. The experiments show that our method
achieved high accuracy, and has significant generalizability to track various deformable surfaces
including the interaction between the clothing and a hand.
While our approach demonstrates excellent generalizability and accuracy comparable to one with
global optimization, it contains limitations. Since we initialize tracking with a fixed set of points,
our method does not track a point that newly appears in the next frame. This could be addressed by
89
looking at multiple frames and fusing the canonical model as done in previous work [111, 40, 72].
Though our topology-free method tracks arbitrary points on surface undergoing deformation, it
cannot re-associate a point that has been occluded when the point reappears. However, this lim-
itation is shared with many of previous work. Our method generates low level motion estimate
which is free from bias or contamination from regularization. We believe that the low level data
generated from genuine motion or signal could be useful for many graphics applications, includ-
ing performance capture, quad mesh extraction [105], and rig creation such as Blendshape [89],
extracting skinning parameters [38] and statistical models for body shape [115]. For example,
the bias free data could be used in generating high quality motion priors for more challenging
performance capture scenarios such as constrained monocular capture.
90
ChapterV
SynthesizingDynamicSkinDetailsforFacialAnimation
V.1 Introduction
Simulating the appearance of human skin is important for rendering realistic digital human char-
acters for simulation, education, and entertainment applications. Skin exhibits great variation in
color, surface roughness, and translucency over different parts of the body, between different in-
dividuals, and when it’s transformed by articulation and deformation. But as variable as skin can
be, human perception is remarkably attuned to the subtleties of skin appearance, as attested to by
the vast array of makeup products designed to enhance and embellish it.
Advances in measuring and simulating the scattering of light beneath the surface of the skin
[75, 150, 39] have made it possible to render convincingly realistic human characters whose skin
appears to be fleshy and organic. Today’s high-resolution facial scanning techniques (e.g. [100,
12, 55] record facial geometry, surface coloration, and surface mesostructure details at the level
of skin pores and fine creases to a resolution of up to a tenth of a millimeter. By recording a
sequence of such scans [14] or performing blendshape animation using scans of different high-
res expressions (e.g. [7, 50]), the effects of dynamic mesostructure – pore stretching and skin
furrowing – can be recorded and reproduced on a digital character.
91
Recently, Graham et al. [60] recorded skin microstructure at a level of detail below a tenth of
a millimeter for sets of skin patches on a face, and showed that texture synthesis could be used
to increase the resolution of a mesostructure-resolution facial scan to one with microstructure
everywhere. They demonstrated that skin microstructure makes a significant difference in the
appearance of skin, as it gives rise to a face’s characteristic pattern of spatially-varying surface
roughness. However, they recorded skin microstructure only for static patches from neutral fa-
cial expressions, and did not record the dynamics of skin microstructure as skin stretches and
compresses.
Figure V .1: Three real forehead expressions (surprised, neutral, and perplexed) made by the same
subject showing anisotropic deformations in microstructure.
Skin microstructure, however, is remarkably dynamic as a face makes different expressions. Fig-
ure V .1 shows a person’s forehead as they make surprised, neutral, and angry expressions. In
92
the neutral expression (center), the rough surface microstructure is relatively isotropic. When the
brow is raised (left), there are not only mesostructure furrows but the microstructure also devel-
ops a pattern of horizontal ridges less than 0.1 mm across. In the perplexed expression (right),
the knitted brow forms vertical anisotropic structures in its microstructure. Seen face to face or
filmed in closeup, such dynamic microstructure is a noticeable aspect of human expression, and
the anisotropic changes in surface roughness affect the appearance of specular highlights, even
from a distance.
Dynamic skin microstructure results from the epidermal skin layers being stretched and com-
pressed by motion of the tissues underneath. Since the skin surface is relatively stiff, it develops
a rough microstructure to effectively store a reserve of surface area to prevent rupturing when
extended. Thus, parts of the skin which stretch and compress significantly (e.g. the forehead and
around the eyes) are typically rougher than parts which are mostly static, like the tip of the nose
or the top of the head. When skin stretches, the microstructure flattens out and the surface appears
less rough as the reserves of tissue are called into action. Under compression, the microstructure
bunches up, creating micro-furrows which exhibit anisotropic roughness. Often, stretching in one
dimension is accompanied by compression in the perpendicular direction to maintain the area of
the surface or the volume of tissues below. A balloon in Figure V .2 provides a clear example of
roughness changes under deformation: the surface is diffuse at first, and becomes shiny when
inflated.
While it would be desirable to simulate these changes in appearance during facial animation,
current techniques do not record or simulate dynamic surface microstructure for facial animation.
One reason scale: taking the facial surface to be 25cm 25cm, recording facial shape at 10
micron resolution would require real-time Gigapixel imaging beyond the capabilities of today’s
camera arrays. Simulating a billion triangles of skin surface, let alone several billion tetrahedra
93
Figure V .2: Tension on the balloon surface changes its surface appearance.
of volume underneath, would be computationally very expensive using finite element techniques.
Figure II.4 illustrates a ratio of an area a 10-micron resolution imaging could record to the overall
face, and a number of geometric elements required to represent such data.
In this work, we approximate the first-order effects of dynamic skin microstructure by performing
fast image processing on a high-resolution skin microstructure displacement map obtained as in
[60]. Then, as the skin surface deforms, we blur the displacement map along the direction of
stretching, and sharpen it along the direction of compression. On a modern GPU, this can be
performed at interactive rates, even for facial skin microstructure at ten micron resolution. We
determine the degree of blurring and sharpening by measuring in vivo surface microstructure of
several skin patches under a range of stretching and compression, tabulating the changes in their
surface normal distributions. We then choose the amount of blurring or sharpening to affect a
similar change in surface normal distribution on the microstructure displacement map. While
our technique falls short of simulating all the observable effects of dynamic microstructure, it
produces measurement-based changes in surface roughness and anisotropic changes in surface
microstructure orientation that is consistent with real skin deformation. For validation, we com-
pare renderings using our technique to real photographs of faces making similar expressions.
94
V.2 BasicApproach
compressed and sharpened
compressed linearly
neutral skin
stretched linearly
stretched and blurred
Figure V .3: Stretching and compressing a measured OCT skin profile, with and without convolu-
tion filters to maintain surface length.
Our approach to synthesizing skin microstructure deformation is to directionally blur and sharpen
a high-resolution surface displacement map according to the amount of stretching or compression.
We can visualize this along a one-dimensional cross-section, as in Figure V .3. In the center is a
3mm wide cross-section of ventral forearm skin measured in [95] with optical coherence tomog-
raphy (OCT). To its left and right are compressed and stretched versions with no modification
to surface height, decreasing the surface length by 20% and increasing it by 35%, respectively.
More realistically, the surface would deform to minimize strain. We can approximate this effect
by smoothing the height map of the surface under stretching as seen at the right, and sharpening
the height map under compression as seen at the left. In both cases, the surface length of the
neutral profile is now maintained, causing a greater effect on the distribution of surface normals.
95
Figure V .4 shows this microstructure convolution technique applied to the surface of a deforming
sphere. In the top row, the sphere is rendered with a microstructure displacement map generated
with a volume noise function. When shrunk, an isotropic sharpening filter is applied to the dis-
placement map, making it bumpier and giving the sphere a rougher surface reflectance. When
expanded, the sphere’s displacement map is blurred, giving it a smoother, shinier appearance,
as when inflating a balloon. The effects are better seen in the expanded insets of areas near the
specular reflections. In the bottom row, the sphere is textured with a displacement map from a
real skin microgeometry sample, expanded to the sphere using texture synthesis. The sphere is
squashed and stretched to produce anisotropic surface strain, which causes anisotropic filtering
of the displacement map: blurring in one dimension and sharpening in the other. This results
in anisotropic micro-ridges, similar to those seen in real human skin in Figure V .1 during facial
expression.
For the sphere example, the amount of stretching or blurring proportional to the surface strain was
chosen by the user to create an appealing dynamic appearance. For rendering a realistic human
face, it would be desirable for the filter kernel to be driven according to the behavior of real skin.
To this end, we use a measurement apparatus to record the behavior of skin microstructure under
stretching and compression as described in the next section.
V.3 Measurement
We record the surface microstructure of various skin patches at 10 micron resolution with a setup
similar to [60] which uses a set of differently lit photos taken with polarized gradient illumination
[100]. The sample patches are scanned in different deformed states using the lighting apparatus
with a custom stretching measuring device consisting of a caliper and a 3D printed stretching
aperture. The aperture of the patch holder is set 8 mm for the neutral deformation state and is
96
Figure V .4: Deforming sphere with dynamic microgeometry. (Top) The microstructure becomes
rougher through displacement map sharpening when shrunk, and smoother through blurring when
expanded. The insets show details of the specular highlights. (b) Anisotropic compression and
stretching yields anisotropic microstructure.
placed 30 cm away from a Ximea machine vision camera. The Ximea machine vision camera
records monochrome 2048 by 2048 pixel resolution images with Nikon 105 mm macro lens at
f/16, so that each pixel covers a 6 micron square of skin. The 16 polarized spherical lighting
conditions allow the isolation and measurement of specular surface normals, resulting in a per-
pixel surface normal map. We integrate the surface normal map to compute a displacement map
and use a high pass filter to remove surface detail greater than the scale of a millimeter to remove
surface bulging.
97
Figure V .5: Microstructure acquistion in a polarized LED sphere with macro camera and articu-
lated skin deformer.
Each skin patch, such as part of the forehead, cheek, or chin, is coupled to the caliper aperture
using 3M double-sided adhesive tape, and each scan lasts about half a second. After performing
the neutral scan, the calipers are narrowed by 0.8mm and the first compressed scan is taken.
This process continues until the skin inside the aperture buckles significantly. Then, the calipers
are returned to neutral, and scans are taken with progressively increased stretching until the skin
detaches from the double-stick tape. Figure V .6 shows surface normals and displacements of skin
samples in five different states of strain. Our capture process also provides specular and diffuse
albedo, from which we can produce data-driven rendering of the sample undergoing stretching
and compressions as in Figure V .7. The calipers can be rotated to different angles, allowing the
same patch of skin to be recorded in up to four different orientations, such as the forehead sample
seen in Figure V .8.
98
(a) (b) (c) (d) (e)
Figure V .6: Texture-aligned surface normal (top) and displacement (bottom) maps of a skin patch
under vertical compression and stretching. (a) full compression, (b) medium compression, (c)
neutral, (d) medium stretching, and (e) full stretching.
With skin patch data acquired, we now wish to characterize how surface microfacet distributions
change under compression and stretching. After applying a denoising filter to the displacement
maps to reduce camera noise, we create a histogram of the surface orientations observed across
the skin patch under its range of strain. Several such histograms are visualized in Figure V .8
next to their corresponding skin samples, and can also be thought of the specular lobe which
would reflect off the patch. As can be seen, stretched skin becomes anisotropically shinier in the
direction of the stretch, and anisotropically rougher in the direction of compression. For some
samples, such as the chin in Figure V .8(g,h), we observed some dependence on the stretching
direction to the amount of change in normal distributions. However, we do not yet account for the
effect of the stretching direction in our model.
The variance inx andy of the surface normal distribution quantify the degree of surface smooth-
ing or roughening according to the amount of strain put on the sample. Figure V .9 plots the
changes in surface normal distribution in the direction of the strain for several facial skin patches.
Again, stretched skin becomes shinier, and compressed skin becomes rougher.
99
Figure V .7: Dynamic microgeometry rendering of a forehead skin patch under a point light under
stretch (Left) and compressions (Right). The diffuse albedo is artistically colorized for visualiza-
tion purposes.
V.4 MicrostructureAnalysisandSynthesis
Based on the skin patch data obtained as in Section V .3, we model the relationship between the
measured surface normal distribution and surface stretching or compression (collectively defor-
mation). We observe that the change in surface detail in Figure V .6 from neutral (c) to stretched
(e) qualitatively resembles a blurring filter in the direction of stretch, and from neutral (c) to com-
pressed (a) qualitatively resembles a sharpening filter in the direction of compression, perhaps
with some blurring in the perpendicular direction. These qualitative observations are consistent
with the surface normal distribution plots in Figure V .8. Such filters are also inexpensive to com-
pute on GPU hardware. Hence, we design a method to synthesize the microgeometry of skin
under deformation using a microgeometry displacement map of the skin in a neutral state and a
parametric family of convolution filters ranging continuously from sharpening to blurring. We
then synthesize spatially and temporally varying microstructure for faces undergoing dynamic
100
(a) forehead horizontal (b) forehead vertical (c) forehead 45
(d) forehead 135
(e) cheek horizontal (f) cheek vertical (g) chin horizontal (h) chin vertical
Figure V .8: Each column shows measured 8mm wide facial skin patches under different amounts
of stretching and compression, with a histogram of the corresponding surface normal distributions
shown to the right of each sample.
101
−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08
1.5
2
2.5
3
3.5
4
x 10
−3
Roughness
compression stretching
chin
neutral cheek
forehead (hor.)
forehead (ver.)
Figure V .9: Surface normal distributions plotted against the amount of strain for several skin
patches.
deformation at interactive rates by driving the filter parameters by local deformation fields. Our
framework is generic in the sense that the neutral microgeometry displacment map can come from
any source, such as noise functions [145] or data-diven image analogy techniques [60].
Deformation Model We formulate our model as a convolution in the 2D texture coordinate
space associated with the surface geometry. Given a high-resolution microstructure displacement
mapD for the neutral pose, we synthesize the deformed microdisplacement mapD
0
as:
D
0
(u;v)=(DK
u;v
)(u;v); (V .1)
102
where (u;v) are texture space coordinates,K
u;v
is the convolution kernel for coorinateu;v, and
represents discrete convolution:
(DK)(u;v)=
X
i
X
j
D(ui;vj)K(i;j): (V .2)
SupposingK were constant over the entire surface, and linearly seperable over some perpendic-
ular axes s and t, we could writeD
0
=DK=(D[s]k
s
)[t]k
t
, where k
s
and k
t
are 1D kernels
and[a] represents convolution along some axis a=(a
u
;a
v
):
(D[a]k)(u;v)=
X
i
D(uia
u
;via
v
)k(i): (V .3)
This allows efficient computation as a sequence of two 1D convolutions, reducing the compu-
tational cost fromO(N
2
) toO(2N) for anNN kernel. Unfortunately, in general, the kernel
is spatially varying. Nonetheless, if the kernel varies gradually, we may still approximate the
spatially varying 2D convolution as a sequence of two spatially varying 1D convolutions, as illus-
trated in Figure V .10:
D
0
(u;v)(D
s
[t
u;v
]k
t
u;v
)(u;v); (V .4)
whereD
s
(u;v)=(D[s
u;v
]k
s
u;v
)(u;v): (V .5)
In practice we align the axis s to the direction of stretch (if any) and t to the direction of compres-
sion (if any), where s and t are always mutually perpendicular. We employ a 2-parameter family
of 1D convolution kernels encompassing sharpening and blurring:
k=(1)+G(;); (V .6)
103
Figure V .10: Displacement map pixels are sampled along the principal directions of strain with a
separable filter for convolution.D
s
is computed fromD, thenD
0
is computed fromD
s
.
where is the discrete delta function,21 is the filter strength, and is the standard devia-
tion of the normalized discrete Gaussian kernelG. With>0 the filter blurs the signal; with<0
it sharpens; and with=0 it preserves the signal.
ParameterFitting We estimate the parameters (
s
;
s
) defining k
s
and (
t
;
t
) defining k
t
for
each measured skin patch using a brute-force approach, where the principle directions s and t
are known for each patch. We assume these parameters to be constant over the extent of each
patch. We search for2(2:::1) with a granularity of 0:01 and2(1m:::100m) with a granu-
larity of 0:5m. We find the parameters that minimize the total variation between surface normal
histograms of the ground truth displacement map and the convolved neutral displacement map.
The surface normals are computed in the target deformation coordinates, meaning the neutral dis-
placement map is stretched or compressed after convolution to match the deformed shape. We
compute the surface normal histograms on the GPU by splatting the surface normal computed
at each pixel in the displacement map into a grid of buckets, based on theu andv components
of the normal in tangent space. Care must be taken that enough samples are available for the
104
number of buckets used, in order to avoid bucket aliasing, which may misguide the optimization.
We used patches with up to 20002000 pixels and 6464 buckets. For efficiency, we optimize
only (
s
;
s
) or (
t
;
t
) for the kernel in the direction of caliper movement, omitting the kernel
in the the other direction. The total variation metric is simply the absolute difference between
the ground truth histogram and the histogram of the convolved neutral patch, summed over the
buckets:
;=argmin
^ ;^
X
b2B
H
b
^
H
b
; (V .7)
whereB is the set of histogram buckets,H is the ground truth histogram, and
^
H is the histogram
obtained by 1D convolution of the neutral skin patch using parameters ^ ;^ in the direction of
movement. Figure V .11 tabulates the fitted parameters for a patch of forehead skin undergoing a
range of stretching and compression.
r 0.73 0.79 0.88 0.94 1.08 1.15 1.22 1.28
-2 -1.76 -0.2 -0.02 0.54 0.27 1 1
6.5 6.5 13 22 14.5 19 17 24
Figure V .11: Fitted kernel parameters and (m) for a patch of forehead skin undergoing a
range of stretching and compression. r is the stretch ratio, with r>1 stretching and r<1 com-
pressing.
Kernel Table Construction In order to apply our model to skin undergoing a novel deforma-
tion, we must first establish a relationship between the deformation and the kernel parameters,
using the table of parameters fitted to sample patches. For each measured skin patch, the prin-
cipal axes and the stretch ratior are known, defining the deformation. Whenr>1, the patch is
undergoing stretching, and whenr<1, the patch is undergoing compression. Figure V .12 plots
the fitted kernel parameters and against the stretch ratior along the primary axis of stretch
or compression. Based on these data points, we fit piecewise linear models relating and to
105
r. We partition the domain ofr into line segments by manually inspecting the data. For, we
use connected line segments to enforce smooth transitions between compression and stretching,
and we constrain one segment to pass through neutral case (r=1;=0) exactly. For, we fit one
line segment to the compression samples and another line segment to the stretch samples, with a
discontinuity atr=1. For example, the forehead patch yields the following model:
=min(1;15:4r13:8;3:09(r1)); (V .8)
=
8
>
>
<
>
>
:
(38:2r26:5)m ifr1
(70:5r46:9)m otherwise:
(V .9)
As we are not assigning any physically meaningful interpretation to and, the fitted functions
serve merely as a rapid lookup to compute parameters from stretch ratios. Other functions could
be employed if desired, but it is important that=0 in the neutral case, hence a single line fit to
would not fit both the stretching behavior and compression behavior well.
0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
r
0
5
10
15
20
25
σ
Figure V .12: Fitted kernel parameters and plotted against the stretch ratior. Dots represent
parameters fitted to sample patches, and lines represent the piecewise linear fits in (V .8), (V .9).
106
ApplyingtoaNovelSurfaceAnimation Given a triangular mesh undergoing novel deforma-
tion, we may estimate the principal directions and stretch ratios of the local deformation at every
vertex on the mesh, and in turn compute the associated kernel parameters, allowing the deformed
displacement map to be synthesized using the two-pass separable convolution technique in a GPU
fragment shader. Our generic framework allows deformation from physical simulation, keyframe
animation, or facial performance capture. Consider a triangleP withv
0
;v
1
; andv
2
being vertices
on the neutral face, and the deformed counterpartP
0
likewise withv
0
0
;v
0
1
; andv
0
2
being the ver-
tices. We compute the rotationR that maps the triangleP in 3D space to the so-called “tangent
space”. Tangent space may be determined by first computing the affine 23 transform Q that
maps world coordinates to UV texture coordinates, and then computingR as the closest rotation
matrix toQ (using e.g. SVD or QR decomposition), first concatenating to it a third row that is
the cross product of the first two rows. As the last component in tangent space is irrelevant, we
truncate the last row ofR again leaving a 23 matrix. We likewise computeR
0
to mapP
0
into
2D tangent space. We then compute a linear transformationT that maps the 2D neutral triangle
RP onto the deformed triangleR
0
P
0
:
e
0
1
e
0
2
=T
e
1
e
2
; (V .10)
wheree
i
is an edge fromRv
0
toRv
i
, and the rest is analogous, and the 2 by 2 linear transformation
matrixT can be trivially computed. SuchT can be found per deformed triangle. In practice, we
want as smooth a 2D deformation field as possible in order to drive the per-pixel displacement
map without visible seams. To that end, we average the linear transformation T of each face
attached to a vertex, and then interpolate the per-vertex T within a GPU fragment shader, and
perform Singular Value Decomposition of the form:
T =UV
T
; (V .11)
107
per-fragment with
=
2
6
4
r
s
0
0 r
t
3
7
5; (V .12)
wherer
s
, andr
t
are the stretch ratios (r
s
>r
t
). With a 2 by 2 matrix, such a SVD can be trivially
computed in closed form in a GPU fragment shader, providing smooth spatially varying 2D de-
formation fields as illustrated in Figure V .13. (Alternatively, if the mesh animation is produced
using a physics simulation such as a finite element model, the stretch and strain could be obtained
more directly from the simulation [73].) The transform from principle deformation axes to UV
coordinates is then:
S=QR
T
V; (V .13)
and hence the principle axes s and t in UV coordinates are the first and second columns ofS, nor-
malized, and the magnitudes of the columns serve as the conversion factor required for converting
from world distance to UV distance for convolution. Indeed, the convolution may be performed
without conversion if s and t are taken as the un-normalized columns ofS. Substituting the values
r
s
andr
t
into the parametric kernel model (V .8), (V .9) produces the kernels k
s
and k
t
at every
point on a deforming surface, allowing the deformed displacement map to be synthesized for the
entire face.
V.5 Results
Figure V .4 shows two deforming spheres with microstructure convolved according to local surface
deformation, as described in Section V .2.
Figure V .14 shows frames from a sequence of a 1cm wide digitized skin patch being deformed by
an invisible probe. It uses a relatively low-resolution finite element volumetric mesh with 25,000
108
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure V .13: Strain field visualization for a smile expression (top row) and a sad expression
(bottom row) with the first stress eigen value (a), (e), and the second eigen value (b), (f), and
strain direction visualization (c), (d), (g), and (h).
tetrahedra to simulate the mesostructure which in turns drives dynamic microstructure convo-
lution. The neutral microstructure was recorded using the system in Figure V .5 at 10 micron
resolution from the forehead of a young adult male, and its microstructure is convolved with pa-
rameters fit to match its own surface normal distribution changes under deformation as described
in Sec. V .4. The rendering was made using the V-Ray package to simualte subsurface scattering.
As seen in the accompanying video in [109], the skin microstructure bunches up and flattens out
as the surface deforms at a resolution much greater than the FEM simulation.
109
Figure V .14: A sampled skin patch is deformed with FEM which drives microstructure convolu-
tion, rendered with path tracing.
Figure V .15, and Figure V .16 highlight the effect of using no microgeometry, static microgeome-
try, and dynamic microgeometry simulated using displacement map convolution with a real-time
rendering. Rendering only with 4K resolution mesostructure from a standard facial scan produces
too polished an appearance at this scale. Adding static microstructure computed at 16K resolution
using a texture synthesis technique [60] increases visual detail but produces conflicting surface
strain cues in the compressed and stretched areas. Convolving the static microstructure according
to the surface strain using normal distribution curves from a related skin patch as in Section V .4
110
(a) no microstructure (b) static microstructure
(c) dynamic microstructure (d) photograph
Figure V .15: A rendered facial expression with (a) mesostructure only (b) static microstructure
from a neutral expression (c) dynamic microstructure from convolving the neutral microstructure
according to local surface strain compared to a reference photograph of the similar expression.
The insets show detail from the lower-left area.
produces anisotropic skin microstructure consistent with the expression deformation and a more
convincing sense of skin under tension as can be observed in the reference photograph of a sim-
ilar expression (Figure V .15 (d)). Renderings are made with a real-time hybrid normal shading
technique [100].
Figure V .17 compares the real-time renderings with static (left) and dynamic microstructure
(right) from a facial performance sequence of a female subject. Again, the dynamic microstruc-
ture rendering creates anisotropic dynamic microstructure, providing a visceral sense of surface
111
tension around the cheek region undergoing a smile expression. The details can be seen in motion
(and, ideally, full-screen) in the accompanying video in [109].
Figure V .18 shows zoom in on facial details with the specular channel on its own to highlight the
the dynamic microstructure, which introduces varying angles of anisotropic surface texture in the
areas undergoing stretching and compression.
Figure V .19 (a) through (c) show qualitative validation of our technique (b) compared with a
reference photograph of a similar facial region and expression. Facial details with the specular
channel on its own (a) highlight the dynamic microstructure which introduces qualitatively similar
varying angles of anisotropic surface texture in the areas undergoing stretching and compression.
Figure V .20 (a) through (c) show additional renderings of an older subject, exhibiting plausible
dynamic surface details under natural facial expressions.
112
(a) no microstructure
(b) static microstructure
(c) dynamic microstructure
Figure V .16: A rendered facial expression with (a) mesostructure only, (b) static microstruc-
ture from a neutral expression, and (c) dynamic microstructure from convolving the neutral mi-
crostructure according to local surface strain compared to (d) a reference photograph. The insets
show detail from the upper-left area.
113
Figure V .17: Real-time rendering of the cheek region from a facial performance animation with
enhanced dynamic surface details (right) compared to the static microstructure rendering (left).
The dynamic microstructure provides an additional indication of the deformation on the cheek
when the subject makes a smile expression.
Figure V .18: Specular-only real-time renderings of a nose from a blendshape animation, showing
anisotropic dynamic microstructure at different orientations in the expression (right) compared
to the neutral (left). The dynamic microstructure breaks up the smooth specular highlight on the
bridge of the nose.
114
(a) (b) (c)
Figure V .19: Real-time comparison renderings from a blendshape animation (middle) compared
to the reference photographs of a similar expression (right). Specular-only real-time renderings
show anisotropic dynamic microstructure at different orientations in the expressions (left). Top
two rows: young male subject’s crow’s feet region with the eyes shut tightly, and the stretched
cheek region when the mouth is pulled left. Bottom two rows: young female subject crow’s feet
region, and the nose under the squint expression.
115
(a)
(b)
(c)
Figure V .20: Real-time rendering of older subject’s mouth with a smile expression (a), raised
down forehead (b), and the stretched cheek and mouth regions (c).
116
V.6 Discussion
While in several ways our skin microstructure deformation technique produces plausible results,
it is important to note that it is at best an approximation to the complex tissue dynamics which
occur at the microscale. It does, however, get two perceptually important aspects correct: the
local orientation of the anisotropic microstructure, and, by construction, the anisotropic surface
normal distributions. As a result, given the relative efficiency of image convolution, this technique
may prove useful for increasing the realism and skin-like quality of virtual characters for both
interactive and offline rendering.
V.7 FutureWork
Our technique of simulating dynamic skin microstructure leaves open several areas for future
work. First, we currently employ simple linear kernels to filter the displacement map to approx-
imate the effects of stretching and compression. A limitation of linear filters is that they apply
the same amount of smoothing or sharpening to both ridges and grooves, whereas a ridge, filled
with tissue, is likely to deform less than a groove, filled with air. One can imagine developing
non-linear filters for microstructure deformation which might better simulate the local shape of
the deformed skin surface.
Second, while the results in this paper were made using supersampling for antialiasing, efficient
rendering of dynamic microstructure should leverage multiresolution surface detail rendering
techniques such as LEAN [112] or LEADR [41] mapping. LEADR mapping may be especially
relevant since it is designed to work with deformable animated surfaces and includes a microfacet
BRDF model with masking and shadowing. Ideally, the efficient resampling schemes of LEADR
117
mapping could be augmented to accomodate the spatially-varying convolutions performed in our
microstructure deformation process.
V.8 Conclusion
We have presented a fast approximate approach to simulating the effects of deforming surface
microstructure under compression and stretching where a high-resolution displacement map is
blurred and sharpened according to local surface strain. We measured normal distributions of real
skin samples under stretching and compression to drive the amount of blurring and compression
for animating faces. The results show a greater visual indication of surface tension seen in the
surface reflections than using a static microstructure displacement map, and more skin-like be-
havior for deforming surfaces. Since the technique can be implemented on the GPU at interactive
rates, it may be useful for rendering high-quality animated characters both for pre-rendered and
interactive applications.
V.9 IntegratingSkinMicrostructureDeformationforAnimatedChar-
acter
To demonstrate the effectiveness of the skin microstructure deformation technique, we integrated
the technique with an animated digital character. First, we conducted high resolution facial scan-
ning to accurately record facial geometry and reflectance[55], storing the reflectance maps in
squared 6K textures at submillimeter accuracy. Additionally, we employed the skin microstruc-
ture capture technique of [60], and synthesized the skin microstructures for the entire neutral
face. The synthesized microstructure data was stored in a separate squared 16K displacement
118
map. Based on reference photos, digital artists modeled hair, eyeballs including tears, eyelashes,
Figure V .21: Recorded 3D geometry and reflectance maps.
and eyebrows. The scan of the tongue was obtained in the face scanning session, and the teeth
geometry was recorded by scanning a teeth cast. From this data, we built a shader network for
displacement (Figure V .22), and other reflectance (Figure V .23). We implemented the shader in
V-Ray production renderer. Figure I.3 shows a static rendering and the input photograph side by
side using all reflectance data shown in Figure V .22, and Figure V .23.
We demonstrated our techniques with two types of animation: blendshape animation, and cap-
tured facial animation. For both of the cases, as a pre-process, we computed surface anisotropic
stress field (Figure V .24) as described in Section V .4, and from this, we simulated microstructure
deformation using displacement map convolution. For the blendshape animation, we computed a
single dynamic microstructure map for an expression, and linearly blend them in the shader. For
the captured facial animation, after we computed the anisotropic stress field, we computed dy-
namic microstructure for every frame of the animation, and stored them on a disk. Since convert-
ing a stress field to dynamic microstructures requires only common shader and image processing
119
Figure V .22: Shader network for displacement.
Figure V .23: Shader network for reflectance.
120
operations (i.e. Gaussian blur), it is possible to implement this step within a rendering pipeline as
done in the real-time demonstration in Section V .5.
Figure V .24: Visualization of anisotropic stress field (bottom) and corresponding surface reflec-
tion showing spatially varying stretching and compression effects on the face.
121
Figure V .25: Demonstration of a fully integrated digital character with realistic eyes and hairs,
exhibiting the dynamic change in both facial textures and reflections. The face was rendered
under an HDR sunset environment.
122
Figure V .26: Blendshape animation of forehead raising up (top) and down (bottom) the eyebrows.
123
Figure V .27: Blendshape animation of a cheek region showing a neutral state (top) and a puffed
cheek (bottom)
124
Figure V .28: Simulating dynamic skin microstructure with captured facial performance (top) in-
creases the visual realism in the dynamic surface reflection (bottom). A stronger sense of surface
tension on the cheek provides a more sincere smile expression. The specular reflection image on
the bottom is brightened up for visualization purposes.
125
V.10 Appendix: DrivingSkinBRDFRoughnessfromDeformation
In this chapter we have presented a technique to synthesize skin microstructures under deforma-
tion by anisotropically convolving a displacement map. As demonstrated in a variety of examples,
the technique can significantly improve the realism of the dynamic face rendering especially when
a face is shown in close-up. The simulated dynamic microstructures change the surface normal
distribution, providing the realistic skin roughness variation when they are rendered with a proper
microfacet BRDF model. While it is a viable approach to simulate the dynamic surface roughness
through the geometry change, directly driving the BRDF roughness parameter is more efficient
when an object to be rendered is too far, and the benefit of rendering the individual microstructures
is limited.
In this section, we discuss experimental results obtained by directly altering the BRDF rough-
ness parameter to match the skin patch measurement in Section V .3. Our measurement in Figure
V .9 provides us with the surface normal distribution change relative to the neutral along with the
known amount of strain. Thus with this relation given, we can look up the variance correspond-
ing to an appropriate strain amount, and scale the deformed roughness parameter relative to the
neutral. Figure V .29 shows three different visualizations of surface strain fields on a deforming
sphere (first two rows), and on corresponded facial expressions (last two rows). We computed 2D
per-pixel surface strain as done in V .4, and determine the corresponding variance value based on
the strain in the measurement (Figure V .9).
Figure V .30 compares surface reflection rendered on a deforming sphere with dynamic (a) and
static (b) BRDF roughness parameters. When squashed, the sphere rendered with the dynamic
microfacet distributions in (a) provides more plausible anisotropic surface reflection while the
surface highlight stays static without the dynamic microfacet distributions in (b).
126
Figure V .29: (a) Major strain magnitude (b) surface area change, and (c) strain anisotropy.
127
Figure V .31 (a) and (b) show surface reflection rendering of crow’s feet on a smile expression ren-
dered by the dynamic and static microfacet models, respectively. As the skin surface is crinkled,
it exhibits more variation of surface orientations in a compressed direction, making the surface
rougher in a certain manner. The dynamic model produces believable rougher specular highlights
over the wrinkles.
(a) (b)
Figure V .30: Rendered surface reflectance of a squashing sphere with dynamic (a) and static (b)
microfacet distributions.
128
(a) (b)
Figure V .31: Rendered surface reflectance of the crow’s feet on a smile expression with dynamic
(a) and static (b) microfacet distributions.
129
ChapterVI
ConclusionandFutureWork
VI.1 Conclusion
As the static realism of the digital character increases as we have shown in Figure I.3, it is in-
creasingly more important to develop a technique which can achieve the dynamic realism of the
character comparable to the static one to avoid the Uncanny Valley. In doing so, this thesis has
presented three principal techniques for achieving digital avatars which exhibit realistic dynamic
appearance at multiple scales.
Our first contribution is a unified framework to solve for dense correspondence and temporally
consistent performance tracking from multi-view input. Our technique jointly solves for consis-
tent parameterization and stereo cues, directly warping an artist friendly topology to the target.
Thus our technique does not involve expensive temporal tracking, and can be fully paralleliz-
able. Because every frame is processed independently, our process does not suffer from drifting,
and does not discriminate individual static scans, and performance sequence, and they can be
processed altogether in the same framework. A benefit of such a framework is that the model
building and the performance tracking can be solved simultaneously, which currently relies on
time consuming manual efforts, and expensive sequential tracking. We have also presented a
130
technique that can produce medium and high frequency dynamic details to enhance the realism
of dynamic faces from passive flat-lit performance.
Second, we have presented an acquisition system and a simulation algorithm for skin microstruc-
ture deformations. We measure the surface microstructure deformation with a custom 10-micron
resolution scanner, and quantify how the surface stretching and compressions alter the apparent
surface reflection. Our basic approach for simulating skin microstructure consists of anisotrop-
ically blurring and sharpening high resolution surface geometry to mimic the surface stretching
and compressions. Since our method only requires image filtering on a displacement map, our
method can efficiently simulate the dynamic surface microstructures on the entire face along with
facial animation. We demonstrated our technique in both real-time and offline rendering systems,
and showed how the skin microstructure can significantly improve the realism of animated faces.
Our third contribution is a constraint-free scene flow estimation for massively multi-view capture
systems which may consist of tens or hundreds of high resolution cameras. As high-end pro-
ductions today are adapting extremely high resolution photogrammetry systems in meeting the
demand for high quality digital characters, we need a motion estimation technique that scales
well to such a resolution input. Our constraint-free scene flow estimation technique does not rely
on regularization or target specific models, and estimates 3D scene flow entirely from data. Since
our method does not involve global regularization, the motion estimation at each vertex is fully
decoupled. Thus, our method can solve for 3D motion of each vertex in a fully parallel fashion.
Another benefit of a constraint-free approach is a great generalizability to a variety of surfaces as
the method assumes little about the target surface. To make our scene flow estimation technique
more robust, we have presented three strategies. The first is a robust descriptor representation of
the input image, which effectively ignores the low frequency shading change, and greatly stabi-
lizes the tracking. The second is the design of weight terms for outlier pruning to favor the good
131
inlier data. Third, we have presented a data-driven multi-scale propagation technique that prop-
agates a well-conditioned optimization landscape in a coarse-to-fine optimization, which ensures
the consistent motion field estimate, and accounts for large deformation. Finally our generic scene
flow estimation technique can be incorporated to any multi-view performance capture method to
improve the fidelity of the motion estimate.
VI.2 FutureWork
We believe that research conducted in this thesis opens avenues for new research directions in
high quality digital humans. Several problem specific suggestions can be found at the end of each
chapter. In this section we highlight a few general directions.
The methods presented in this thesis have advanced the realism of the digital faces significantly.
However, the techniques presented here are computationally demanding, and only accessible for
applications which can afford offline computation, such as film production. Implementing some
of the methods presented here in real-time would facilitate the wider deployment of the digital
humans in the real world applications, and unleash their full potential.
An example platform for implementing digital humans is an interactive 3D display. 3D displays
including VR head mounted displays (HMDs) create immersive experiences where the 3D virtual
environment blends seamlessly with physical space. Three-dimensionally displayed virtual hu-
mans will achieve more realistic presence for users when they exhibit accurate eye gaze and body
language. This will allow for more effective and realistic interactions between real and virtual
participants with appropriate facial expressions and gestures, actions, and reactions to verbal and
non-verbal stimuli. Specifically, the goal in using automultiscopic displays is to show the virtual
world with accurate perspective and 3D depth by multiple simultaneous viewers without the need
132
for special glasses. To that end, we present our prototypes of automultiscopic projector arrays for
interactive digital humans in the Chapter 7 Appendix. We present two prototypes of the display:
one optimized for a life-size face, and another optimized for life-size body. In the first prototype,
we demonstrate photorealistic rendering of a floating 3D face with an interactive rate combining
high resolution facial scans. The result could be further improved by the real-time implementa-
tion of skin microstructure deformation in the Chapter 5 to improve the realism of the character.
The same technique could be implemented in a VR headset to achieve more intimate interactions
with digital characters. Our second prototype, optimized for a life-size body, features interactive
rendering by resampling light field videos. The input light field video is recorded in a LED dome
consisting of a large number of HD video cameras (Figure IV .1). If 3D motion estimation is de-
sired, our constraint-free scene flow method in Chapter 4 could serve as a way for such a high
resolution input. When combined with high quality facial performance capture as presented in the
Chapter 3, the 3D displays can provide immersive live teleconferencing as demonstrated in Jones
et al. [77].
Generally, traditional approaches in synthesizing a realistic animated face break up the problem
into multiple stages (i.e. measurement, modeling, animation and simulation). Since each stage
has a different objective, and introduces errors, it is difficult to ensure that the resulting image con-
verges to the global objective (i.e. the synthetic output closely resembles the input photograph).
A promising approach to quickly generate plausible content could be a learning-based generative
model such as [57]. While their results are currently limited in resolution, such a top-down ap-
proach could be a viable way to provide plausible content with a limited budget. Important things
for the success of such a learning based method is the quantity and the quality of input data. A
new method we have presented in Chapter 3 allows us to fully automatically correspond faces of
133
different individuals with arbitrary expressions on a single template. The resulting model main-
tains pore-level correspondence, and our method could be used to automatically generate high
quality data for rapidly advancing machine learning.
Beyond a Photograph In this thesis, we have presented multiple techniques for synthesizing
photorealistic appearance of animated faces. A consistent objective employed in the thesis is that
we minimize errors with photography as a reference, and thus the result is photoreal. However
as there are many styles to represent digital characters (as we have reviewed in Introduction),
“photorealism” is merely one of the many ways to do so, and the “attractiveness” of the character
is not necessarily on the same axis as how photoreal it is, as suggested by the Uncanny Valley.
To date, there has been a great deal of research in creating photorealistic computer graphics in-
cluding digital humans. The primary goal has been to match a photograph as it is one of the most
reliable and measurable ways to achieve this objective. An advantage of the capture techniques is
that even if we do not know anything about the ingredients that make the person look like him or
her, we can blindly copy every single detail down to microstructures (Chapter 5), and can achieve
a photorealistic look. I believe that one important purpose of the thesis has been to provide a
technical solution so that if a photorealistic look is desired, a tool is already available for digital
creators.
However, the current digital character pipeline still relies on manual labors and extensive artistic
inputs as achieving a photograph is often not a true objective for the digital character creation.
What this thesis did not study is what are the most essential components that provide the im-
pression of the person. This is an ill-defined problem as the answer depends on the observer.
Nonetheless, experienced artists are good at extracting unique characteristics of a person, and can
134
effectively enhance them, or if necessary can abstract unimportant details to stylize. A good ex-
ample is a caricature, where person-specific characteristics are amplified to make it more like the
face than the face itself. The quality of the digital character owes to the excellence in the digital
artistry, as the attractive digital character is not necessarily the reproduction of the real world.
This is why a computer graphics pipeline always needs artists in the loop, and this is how digital
humans can go beyond a photograph.
As we are more equipped with a technical capability to create realistic digital characters, we are
more able to explore the true objective for creating believable digital characters. This may involve
more studies in human perceptions and psychology. Most importantly, this can only be done
through working with digital artists, who have been optimizing their approach for that objective.
135
ChapterVII
Appendix: AutomultiscopicProjectorArraysfor
InteractiveDigitalHumans
Figure VII.1: (left) A 3D face is shown on our autosteroscopic 3D projector array. (center, right)
The display combines both autostereoscopic horizontal parallax with vertical tracking to generate
different horizontal and vertical views simultaneously over a full 110
field of view. The stereo
pairs are left-right reversed for cross-fused stereo viewing.
136
VII.1 Introduction
So far in this thesis, we have proposed techniques to create content for photorealistic digital char-
acters. Another technical challenge in achieving a realistic three-dimensional digital character lies
in a presentation method. When the realistic character is seen in correct perspectives with proper
eye gaze and body languages in the real space, it generates immersive and engaging interaction
with the users. Such tool can be a powerful platform for simulating social interactions and hu-
man communications. However, unlike physical humanoid robots, three-dimensional appearance
of digital characters needs to be simulated by mimicking the view dependent shapes and reflec-
tions for any user perspectives. While there is no single display form factor suitable for different
content types and applications, some ideal features for displaying realistic 3D digital characters
include:
depth cues: the display covers basic depth cues, such as stereo parallax, motion parallax,
focus, occlusion, lighting and shading, field-of-view and etc [21].
autostereoscopic: the display does not require any special glasses.
multiple users: the display is capable of providing view dependent content over a wide field
of views to accommodate multiple users at the same time.
interactive: the display allows real-time interaction with the character.
life-size: the character appears in life-size, providing a realistic sense of scales.
full color: the character appears rich in color.
To explore a platform suitable for displaying realistic digital humans, we present automultiscopic
(autostereoscopic and multi-view [167]) displays which exhibit the above properties, and demon-
strate how high quality digital humans can be rendered in 3D on these displays. We propose two
137
prototypes of automultiscopic projector array systems. Our first prototype, optimized for display-
ing a life-size human face, displays a 3D floating face with hybrid vertical and horizontal parallax.
Since the number of views needed to be generated scales proportionally to the number of users,
most of autostereoscopic displays are limited to horizontal parallax only. While it is a reasonable
assumption as the most viewers move side-to-side, additional vertical parallax can show imagery
to multiple users with different height and distance, and allows multiple users to present around
the displayed character as if she or he was actually present. To that end, we propose a novel
vertical parallax rendering algorithm that renders the optimized vertical perspectives for multiple
users. We also demonstrate that by taking advantage of high quality dynamic facial scans, our dis-
play can show dynamic realistic human face at an interactive frame rate. Also, we show different
implementations of the system including rear and front projections, and experiment how different
screen surface magnification factors influence a system’s optical properties. In the second proto-
type, optimized for displaying life-size human body, we explore light field rendering framework
for automultiscopic display which can show realistic appearance of human performance on the
display in real-time.
Projector arrays are well suited for 3D displays because of their ability to generate dense and
steerable arrangements of pixels. As video projectors continue to shrink in size, power consump-
tion, and cost, it is possible to closely stack projectors so that their lenses are almost continuous.
We present a new HPO display utilizing a single dense row of projectors. A vertically anisotropic
screen turns the glow of each lens into a vertical stripe while preserving horizontal angular vari-
ation. The viewer’s eye perceives several stripes from multiple projectors that combine to form
a seamless 3D image. Rendering to such as a display requires the generation of multiple-center
of projection (MCOP) imagery, as different projector pixels diverge to different viewer positions.
Jones et al. [78, 77] proposed a MCOP rendering solution in the context of high-speed video
projected onto a spinning anisotropic mirror. A front-mounted projector array can be seen as an
138
unfolded spinning mirror display where each high-speed frame corresponds to a different dis-
crete projector position. In this chapter, we extend this framework for use with both front and
rear-projection 3D projector arrays.
As every viewer around a HPO display perceives a different 3D image, it is possible to customize
each view with a different vertical perspective. Such a setup has the unique advantage that every
viewer can have a unique height while experiencing instantaneous horizontal parallax. Given a
sparse set of tracked viewer positions, the challenge is to create a continuous estimate of viewer
height and distance for all potential viewing angles that provides consistent vertical perspective
to both tracked and untracked viewers. The set of viewers is dynamic, and it is possible that
the tracker misses a viewer particularly as viewers enter or leave the viewing volume. Previous
techniques for rendering MCOP images for autostereoscopic displays [78, 77] assume a constant
viewer height and distance for each projector frame. In practice, this limitation can result in
visible distortion, tearing, and crosstalk where a viewer sees slices of multiple projector frames
rendered with inconsistent vertical perspective. This is especially visible when two viewers are
close together but have different heights. We solve this problem by dynamically interpolating
multiple viewer heights and distances within each projector frame as part of a per-vertex MCOP
computation and compare different interpolation functions. Our algorithm can handle both flat
screens as well as convex mirrored screens that further increase the ray spread from each projector.
The primary contributions are:
(i). An autostereoscopic 3D projector array display built with off-the-shelf components
(ii). A new per-vertex projection algorithm for rendering MCOP imagery on standard graphics
hardware
(iii). An interpolation algorithm for computing multiple vertical perspectives for each projector
139
(iv). An analysis of curved mirrored screens for autostereoscopic projector arrays
VII.2 Life-sizeFacialDisplay
To achieve maximum angular resolution, it is preferable to stack projectors as close as possible.
Our projector array system consists of 72 Texas Instruments DLP Pico projectors each of which
has 480x320 resolution. Our projectors are evenly spaced along a 124cm curve with a radius of
60cm. This setup provides an angular resolution of 1.66
between views. At the center focus
of the curve, we place a 30cm x 30cm vertically anisotropic screen. The ideal screen material
should have a wide-vertical diffuse lobe so the 3D image can be seen from multiple heights, and
a narrow horizontal reflection that directs different projector pixels to varying horizontal views.
When a projected image passes through the vertically anisotropic screen, it forms a series of
vertical stripe. Without any horizontal diffusion, the stripe width is equivalent to the width of the
projector lens. Each projector is 1:42cm wide with a 4mm lens; so to eliminate any gap between
stripes, this would require stacking several hundred projectors in overlapping vertical rows [82].
We found that acceptable image quality could be achieved with a single row of projectors if we
use a holographic diffuser to generate 1-2 degrees of horizontal diffusion and stack the projectors
with a 2mm gap (see Figure VII.2). The width of the diffusion lobe should be equal to the angle
between projectors.
For rear-projection setups, we use a single holographic diffuser with a horizontal 2
and vertical
60
scattering profile (see Figure VII.3). Alternatively for front-projection, we reflect off a fine
lenticular screen behind the holographic diffuser (see Figure VII.1). The lenticular screen is
painted matte black on the reverse flat side to reduce ambient light and improve black levels.
As the light passes twice through the diffuser, only 1
of horizontal diffusion is required. All
our screen components are currently available off-the-shelf. The holographic diffusers are from
140
Flat Convex
Without diffuser With diffuser
Figure VII.2: The anisotropic screen forms a series of vertical lines, each corresponding to a
projector lens. A 1-2
horizontal diffuser is used to blend the lines and create a continuous image.
The top rows shows stripes and imagery reflected on a flat anisotropic screen. The bottom row
shows imagery reflected on a convex anistropic screen. By varying the curvature of a mirrored
anisotropic screen, we can decrease the pitch between reflected projector stripes. This increases
the spatial resolution at the screen depth at the expense of overall angular resolution and depth of
field.
Luiminit Co. The lenticular material is a 60lpi 3D60 plastic screen from Microlens Technology.
Other common anisotropic materials such as brushed metal can achieve similar vertical scattering,
but have limited contrast ratio and would not work for rear-projection setups.
We drive our projector array using a single computer with 24 video outputs from four ATI Eye-
finity graphics cards. We then split each video signal using 24 Matrox TripleHeadToGo video
splitters, each of which supports three HDMI outputs. To track viewer positions, a Microsoft
Kinect camera is mounted directly above the screen. The depth camera is ideal as it provides both
viewer height and distance, however our interpolation method would work with other 3D tracking
methods.
141
VII.2.1 Calibration
Even with a machined projector mount, there is still noticeable geometric and radiometic vari-
ation between projectors. We automate the geometric calibration of the projectors using a 2D
rectification process to align projectors images to the plane of the screen. We first place a dif-
fuse surface in front of the screen, then sequentially project and photograph an AR marker from
each projector [154] (see Figure VII.3). We then compute a 2D homography matrix that maps
the detected marker corners in each projector to a physical printed marker also attached to the
screen. At the same time, we also measure the average intensity of light from each projector on
the diffuse surface and compute a per-projector scale factor to achieve consistent brightness and
color temperature across the entire projector array.
Figure VII.3: (left) Photograph of our rear-mounted projector array setup. (right) Photograph of
the calibration setup for front-mounted projector array. To compute relative projector alignment,
we sequentially correspond a virtual AR marker generated by each projector with a printed AR
marker.
VII.2.2 ViewerInterpolation
As described above, anisotropic screens do not preserve a one-to-one mapping between projectors
and views as projector rays diverge to multiple viewers at potentially different angles, heights,
and distances. To generate an image that can be viewed with a single perspective, we must render
142
MCOP images that combine multiple viewing positions. A brute force solution would be to pre-
render out a large number of views with regular perspective and resample these images based on
the rays generated by the device as done by Rademacher et al. [116]. A variant of this technique
was implemented earlier [78] for existing photographic datasets. However, resampling introduces
a loss-of-resolution and this method does not scale well to large dense projector arrays with many
potential horizontal and vertical positions.
For scenes with known geometry, an alternative approach is to replace the standard perspective
projection matrix with a custom MCOP projection operator that maps each 3D vertex point to
2D projector coordinates based on different viewing transform. A per-vertex geometric warp
can captures smooth distortion effects and be efficiently implemented in a standard GPU vertex
shader. Our method is based on a similar approach used in [78] to generate MCOP renders.
The first step is to project each 3D vertex onto the screen. For each vertex (Q), we trace a ray
from the current projector position (P ) through the current vertex (Q). We intersect this first ray
(PQ) with the screen surface to find a screen point (S
P
). The second step is to compute a viewing
position associated with this vertex. The set of viewers can been seen as defining a continuous
manifold spanning multiple heights and distances. In the general case, the intersection of the ray
(PQ) with this manifold defines the corresponding viewer (V ). Finally, we then trace a second
ray from the viewer (VQ) back to the current vertex (Q) and intersect with the screen a second
time (S
V
). The actual screen position uses the horizontal position ofS
P
and the vertical position
ofS
V
. This entire process can be implemented in a single GPU vertex shader. In essence, the
horizontal screen position is determined by projecting a ray from the projector position, while
the vertical screen position is based on casting a ray from the viewer position. The difference
in behavior is due to the fact that the anisotropic screen acts as a diffuse surface in the vertical
143
dimension. We multiply the final screen position by the current calibrated projector homography
to find the corresponding projector pixel position for the current vertex.
In practice, it is not easy to define the manifold of viewer positions as we only know a few sparse
tracked positions. We propose a closed-form method for approximating this manifold as a series
of concentric rings. Previously Jones et al. [78] assumed that the viewing height and distance was
constant for all projectors. This arrangement corresponds to a single circle of potential viewers.
As the viewers are restricted to lie on this circle, the viewpoints represented by each rendered
MCOP image vary only in their horizontal angle, with no variation in height and distance. One
trivial extension would be to compute interpolate viewer height and distance once per projector
frame and adjust the radius of the viewing circle. In our comparison images we refer to this
method as ”per-projector” interpolation. However when tracking multiple viewers close together,
it is possible for the viewing height to change rapidly within the width of a single MCOP projector
frame.
We solve this issue by interpolating the viewer height and distance per-vertex. We pass the nearest
two tracked viewer positions to the vertex shader. Each viewer height and distance defines a
cylinder with a different viewing radius. We then intersect the current projector-vertex ray (PQ)
with both cylinders. To determine the final viewing position for this ray, we compute a weighted
average of the viewer heights and distances. The interpolation weights of each tracked viewer
position is a function of the angle (
1
,
2
) between the cylinder intersection and the original
tracked point. A top-down view of these intersections is shown on the left side of Figure VII.4.
When computing the weighted average, we also add a third default value with very low weight.
This value corresponds to the average user height and distance appropriate for untracked viewers.
As the viewer’s angle from each tracked point increases, the influence of the point decays and the
viewer returns to the default height and distance (right side of Figure VII.4). We implemented
144
two different falloff functions centered around each track point - a normalized gaussian and a
constant step hat function. The gaussian function smoothly decays as you move away from a
tracked viewer position, while the hat function has a sharp cutoff. For all the results shown in
the paper, the width of the gaussian and hat functions was 10 degrees. The weight function can
be further modulated by the confidence of the given tracked viewer position. This decay makes
the system more robust to viewers missed by the tracking system or new viewers that suddenly
enter the viewing volume. If two viewers overlap or stand right above each other, then their
average height and distance is used. In the worst case, the system reverts back to a standard
autostereoscopic display with only horizontal parallax. Pseudocode for computing per-vertex
projection and viewer interpolation can be found in Table VII.1. The final rendered frames appear
warped as different parts of each frame smoothly blend between multiple horizontal and vertical
perspectives. Figure VII.5 shows a subset of these MCOP frames before they are are sent to the
projector.
A related MCOP rendering algorithm was also proposed in [77]. In this later work, the entire
per-vertex projection operator from 3D vertices to 2D projector positions was precomputed as
a 6D lookup table based on 3D vertex position, mirror angle, and viewer height and distance.
The lookup table was designed to handle more complex conical anisotropic reflections that occur
when projector rays are no longer perpendicular to the mirror’s vertical anisotropic axis. For
projector arrays, we found that rays scattered by the screen remained mostly planar with very
little conical curvature. Furthermore, the lookup table’s height and distance was indexed based on
a single reflection angle per projector frame. This assumption is analogous to our ”per-projector”
interpolation examples. Instead of modeling different heights and distances for each projector,
Jones et al. used a concave anisotropic mirror to optically refocus the projector rays back towards
a single viewer. Such an approach would not work for a rear-mounted projector array where there
145
is no mirror element. Our software solution is more general and allows for a wider range of screen
shapes.
V
2
V
Q
V
1
Q - vertex
S
Q
1
Q
2
projectors
viewers
screen
V
2
V
1
interpolated viewers
default viewer
height and distance
projectors
screen
Figure VII.4: (left) Diagram showing how we compute the corresponding viewing position for
each vertex by tracing a ray from the projector positionP through the vertexQ. We intersect the
ray with the screen to find the horizontal screen position S. We intersect the ray with viewing
circles defined by the nearest two tracked viewers (V
1
;V
2
). We interpolate between the tracked
viewer positions based on their angular distance from the intersection points. (right) Diagram
showing the continuous curve formed by the interpolated viewer positions. The curve returns to
a default viewer height and distance away from tracked viewer positions (V
1
;V
2
).
VII.2.3 ConvexScreenProjection
For a front-mounted array, the pitch between reflected stripes can also be reduced by using a
convex reflective anisotropic screen. The convex shape magnifies the reflected projector positions
so they are closer to the screen, with narrower spacing and a wider field of view (Figure VII.2).
As less horizontal diffusion is required to blend between stripes, objects at the depth of the screen
have greater spatial resolution. This improved spatial resolution comes at the cost of angular
resolution and lower spatial resolution further from the screen. Zwicker et al. [167] provide a
146
Flat screen Convex screen
Figure VII.5: Our algorithm renders out a different MCOP image for each of 72 projectors. This
is a sampling of the generated images using per-vertex vertical parallax blending with a gaussian
falloff function. Each image smoothly blends between multiple horizontal and vertical viewer po-
sitions which gives the appearance of an unwrapped object. Flat front and rear projection screens
produce almost identical imagery. Convex screens have additional warping as each projector
spans a wider set of viewers.
framework for defining the depth of field and display bandwidth of anisotropic displays. For a
given initial spatial and angular resolution, the effective spatial resolution is halved every time the
distance from the screen doubles. In Figure VII.6 we plot the tradeoff between spatial resolution
and depth of field given the initial specifications of our projector prototype. A curved mirror is
also preferable for 360
applications where an anisotropic cylinder can be used to reflect in all
directions without any visible seams.
As a convex mirror effectively increases each projectors field of view, each projector frame must
represent a wider range of viewer positions. It becomes more critical to compute multiple viewer
heights per projector frame. In general the projection algorithm is the same as for flat screen with
two modifications. For front-mounted setups that use a mirrored anisotropic screen, we first un-
fold the optical path to find reflected projector positions. The rays reflected off a convex mirror do
not always reconverge at a single reflected projector position (Figure VII.6 (right)). A first order
147
approximation is to sample multiple reflected rays and average the resulting intersection points.
This can still result in distortion for extreme viewing angles. A more accurate projector position
can be computed per-vertex. In the per-vertex projection, we use the average reflected projector
to compute an initial intersection point with the screen. Based on the local curvature around this
screen point, we can then interpolate a second more accurate reflected projector position that is
accurate for that part of the screen. This process could be iterated to further refine the reflected
projector position, though in our tests the reflected positions converged in a single iteration. Sec-
ondly, we discard rays that reflect off the convex mirror near grazing angles as these regions are
extremely distorted and are very sensitive to small errors. In comparison to a flat screen, a con-
vex screen requires rendered frames covering a wider variation of views and greater per-vertex
warping (Figure VII.5).
VII.2.4 Results
Our display can generate stereo and motion parallax over a wide field of view. While it is difficult
to communicate the full 3D effect with 2D figures, Figure VII.1 shows several stereo photographs
printed for cross-fuse stereo viewing. The motion parallax can also be seen throughout the ac-
companying video in [80].
To evaluate our view interpolation algorithms. we render multiple geometric models from a vari-
ety of heights and distances (Figure VII.7). We use a checkered cube to identify any changes any
perspective or warping and a spherical model of the Earth to illustrate correct aspect ratio. We
also show two high-resolution face models as examples of organic shapes. We then photographed
the display from three views: a lower left view, a center view, and a high right view. We measured
the real camera positions and rendered matching virtual views to serve as ground-truth validation
(Figure VII.7, row 1). With no vertical tracking, the display only provides horizontal parallax and
148
0 1 2 3 4 5 6
0
50
100
150
200
250
300
350
Distance from Screen (meters)
Spatial Resolution (stripes per meter)
x1 (flat)
Horizontal Magnification:
x2
x3
x4
x5
projectors
reflected projectors
screen
Figure VII.6: (left) Graph showing tradeoff between spatial magnification using convex mirrored
screens and depth of field. Greater mirror curvature increases density of projector stripes and
spatial resolution at the screen, however spatial resolution falls off rapidly away from the screen.
(right) For a convex mirrored screen, reflected projector rays no longer converge on a single
projector position. We sample multiple points on the mirror surface and compute an average
reflected projector position for each real projector. We then iteratively refine by reflected position
per-vertex by tracing a ray from the average position through the vertex to the convex screen. We
then compute a more accurate reflected position based on the local neighborhood of the screen
point.
all viewers will see a foreshortened image as they move above or below the default height (row
2). Note that the top and bottom faces of the cube are never visible and the facial eye gaze is
inconsistent. If we enable tracking for the left and right cameras (rows 3 through 6), then it is
possible to look up and down on the objects.
In Figure VII.7, we also compare our new interpolation algorithm for handling multiple different
viewers. For rows 2 and 3, we compute a per-vertex viewer height and distance as described in
Section VII.2.2. Per-vertex vertical parallax interpolation produces plausible and consistent per-
spective across the entire photographed view. In contrast, rows 4 and 5 demonstrate interpolation
that uses a constant viewer height and distance per projector. Each projector still interpolates
149
low left center high right low left center high right low left center high right
Figure VII.7: This figure shows three virtual objects viewed by an untracked center camera and
two tracked left and right cameras. As a ground-truth comparison, we calibrate the positions of
three cameras and render out equivalent virtual views of the objects (1st row), while the remaining
rows show photographs of the actual display prototype. If a single constant viewer height and dis-
tance is used then the viewer sees significant foreshortening from high and low views (2nd row).
We also compare different viewer interpolation functions for interpolating multiple viewer heights
and distances. The tracked view positions are interpolated either with a constant height/distance
per-projector (3rd and 4th rows) or with different height/distance per-vertex (5th and 6th rows),
or. Photographs taken with per-vertex interpolation show less distortion with consistent vertical
perspective across the entire image, and the untracked center view is not affected by the nearby
left viewer. Photographs with per-projector interpolation exhibit multiple incorrect perspectives
with warped lines and image tearing, and the untracked center view is distorted by the nearby left
viewer. The local weight falloff of each tracked position is implemented as either a normalized
gaussian (3rd and 5th rows) or sharp step hat function (4th and 6th rows). Gaussian interpola-
tion errors appear as incorrect curved lines while errors using a sharp hat falloff result in a image
tearing.
. 150
the nearest two tracked viewers positions, however the interpolation weights are uniform across
all vertices. Per projector interpolation generates significant distortion for all three views where
the vertical perspective on the left side of the image does not match the perspective on the right
side. These errors also depends on the shape of the weight falloff function. Using per-projector
gaussian weights (row 5) makes straight lines curved while a per-projector step hat function (row
6) causes image tearing as the view abruptly changes from one vertical height to another. The
distortion is less visible on organic objects such as a face, though the left and right eyes are no
longer looking in the same direction. Additional results in the video show how this distortion
ripples across the geometry as the camera moves back and forth between two tracked projector
positions.
Another advantage of per-vertex interpolation is that it reduces errors for untracked viewers. Un-
tracked viewers see the 3D scene based on a default height and distance that should not be affected
by the vertical movement of nearby tracked viewers. Despite the fact that each projector frame
may be seen by multiple viewers, by computing multiple vertical perspectives within each pro-
jector frame (per-vertex) we can isolate each viewer (Figure VII.7, rows 5 and 6). In contrast,
using a single height per-projector, the center view of the face appears distorted as this untracked
view shares some projectors with the nearby lower left camera (rows 3 and 4). The same effect
is shown for a dynamic user in the accompanying video of [80]. Without per-vertex interpolation
(time 3:18), the untracked viewer is clearly distorted whenever the tracked viewer is nearby. At
time 3:05 in the video, you can see that with per-vertex interpolation, this extraneous crosstalk is
considerably reduced.
We tested our projection algorithm on a curved mirror with a 10 degree curvature and a mag-
nification factor of 1.43. Figure VII.8 shows imagery on the curved mirror with and without
151
left low center high right left low center high right
Figure VII.8: Comparison of different viewer interpolation functions for a convex mirror. The
top row uses per-vertex viewer height and distance with a gaussian falloff. The bottom row uses
constant height and distance per-projector with a gaussian falloff. Photographs taken with per-
vertex interpolation show less distortion with consistent vertical perspective. In contrast, straight
lines appear curved in photographs using constant per-projector interpolation.
per-vertex vertical parallax interpolation. In the later case, perspective distortion increases signif-
icantly. This distortion could be reduced by using a wider gaussian function so that nearby frames
would be forced to have similar heights, but this would have the negative effect of more crosstalk
with nearby viewers. To validate our projection model for a wider range of screen shapes, we
developed a projector array simulator that can model arbitrary screen curvature and diffusion ma-
terials. The simulator uses the same render engine but projects onto a screen with a simulated
anisotropic BRDF. As shown in the accompanying video (time 1:28), we can maintain a stable
image with correct perspective as a mirror shape changes significantly. We can also determine the
ideal horizontal diffusion width by simulating different anisotropic reflectance lobes.
Our system uses a standard windows PC with no special memory or CPU. The only requirement
is that the motherboard can accommodate 4 graphics cards. The CPU is primarily used for user-
tracking; all other operations are performed on the GPU. The animations shown in the video used
around 5000 triangles and ran at 100-200fps on an ATI Eyefinity 7800. To achieve maximum
performance, rendering is distributed across all four graphics cards. As not all graphics libraries
provide this low level control, we explicitly create a different application and render context per
GPU, with the different instances synchronized through shared memory. The main bottleneck is
Front Side Bus data transfer as textures and geometry need to be uploaded to all GPUs.
152
void warpVertex(
float3 Q : POSITION, // input 3D vertex position
uniform float3x3 Homography // 2D projector homography
uniform float4 P, // current projector position
uniform float3 T[], // tracked viewers positions
uniform float C[], // tracked viewer confidences
out float3 oQ : POSITION) // output 2D screen coordinate
{
// use reflected projector position if front-projection screen
P = computeProjectorPostion(P);
// find viewer that sees current vertex from current projector
V = interpolateViewer(Q, P, V, T, C);
// define ray from viewer position V to vertex Q
float3 VQ = V - Q;
// intersect with planar or cylindrical curved screen
float3 S = intersectRayWithScreen(VQ)
// apply projector homography to screen coordinate
oQ = mul( Homography, S );
}
// interpolate tracked viewer positions per-vertex
float3 interpolateViewer(float3 Q, float3 P, float3 V, float3[] T, float[] C) {
float sum_of_weights;
float current_viewer;
// define ray from reflected projector position P to vertex Q
float3 PQ = Q - P;
for each tracked viewer t
{
// radius of cylinder is tracked viewer distance
float3 I = intersectRayWithCylinder(PQ, radius(t));
// compute angle between intersection and tracked viewer
float angle = computeAngle(T, I);
// falloff function could be Gaussian or Hat function
float weight = falloffFunction(angle);
// also weight by tracking confidence
weight
*
= C[t];
current_viewer += t
*
weight;
sumWeight += weight;
}
// add in default viewer
current_viewer += default_weight
*
default_viewer;
// with low default weight
sumWeight += default_weight;
// compute weighted average of tracked viewer positions
return currentViewer / sumWeight;
}
Table VII.1: Vertex shader pseudocode that projects a 3D scene vertex into 2D screen coordinates
as described in Sections VII.2.2. The code interpolates between multiple tracked viewer positions
per vertex. It assumes helper functions are defined for basic geometric intersection operations.
153
VII.3 Life-sizeFullBodyDisplay
VII.3.1 DisplayHardware
Previous interactive ”digital humans” systems [88, 50] were displayed life-size but using conven-
tional 2D technology such as large LCD displays, semi-transparent projection screens or Pepper’s
ghost displays [89]. While many different types of 3D displays exist, most are limited in size
and/or field of view. Our system utilizes a large automultiscopic 3D projector array display ca-
pable of showing a full human body. Early projector array systems [31, 57] utilized a multilayer
vertical-oriented lenticular screen. The screen optics refracted multiple pixels behind each cylin-
drical lens to multiple view positions. Recent projector arrays [84, 40, 111, 37] utilize different
screens based on vertically scattering anisotropic materials. The vertical scattering component al-
lows the image to be seen by multiple heights. The narrow horizontal scattering allows for greater
angular density as it preserves the angular variation of the original projector spacing. In order to
reproduce full-body scenes, the projectors making up the array require higher pixel resolutions
and brightness. Secondly as full bodies have more overall depth, we must increase the angular
resolution to resolve objects further away from the projection screen. We use LED-powered Qumi
v3 projectors in a portrait orientation, each with 1280 by 800 pixels of image resolution (Figure
VI.2 and VI.3). A total of 216 video projectors are mounted over (135
) in a 3.4 meter radius
semi-circle. At this distance, the projected pixels fill a 2 meter tall anisotropic screen with a life-
size human body (Figure VI.3). The narrow 0.625
spacing between projectors provide a large
display depth of field. Objects can be shown within about 0.5 meters of the screen with minimal
aliasing. For convincing stereo and motion parallax, the angular spacing between views was also
chosen to be small enough that several views are presented within the interocular distance. The
screen material is a vertically-anisotropic light shaping diffuser manufactured by Luiminit Co.
The material scatters light vertically (60
) so that each pixel can be seen at multiple viewing
154
heights and while maintaining a narrow horizontal blur (1
). From a given viewer position, each
projector contributes a narrow vertical slice taken from the corresponding projector frame. In
Figure VI.3, we only turn on a subset of projectors, allowing these vertical slices to be observed
directly. As the angular resolution increases, the gaps decrease in size. Ideally, the horizontal
screen blur matches the angular separation between projectors thus smoothly filling in the gaps
between the discrete projector positions and forming a seamless 3D image. To maintain modu-
larity and some degree of portability, the projector arc is divided into three separate carts each
spanning 45
of the field of view. We use six computers (two per cart) to render the projector im-
ages. Each computer contains two ATI Eyefinity 7870 graphics cards with 12 total video outputs.
Each video signal is then divided three ways using a Matrox TripleHead-to-Go video DisplayPort
splitter, so that each computer feeds 36 projectors. A single master server computer sends con-
trol and synchronization commands to all connected carts (see Figure VI.2) Ideally all projectors
would receive identical HDMI timing signals based on the same internal clock. While adapters
are available to synchronize graphics cards across multiple computers (such as Nvidia’s G-Sync
cards), the Matrox video splitters follow their own internal clocks and the final video signals no
longer have subframe alignment. This effect is only noticeable due to the time-multiplexed color
reproduction on single chip DLP projectors. Short video camera exposures will see different
rainbow striping artifacts for each projector, however this effect is rarely visible to the human
eye. Designing a more advanced video splitter that maintains the input video timing or accepts
an external sync signal is a subject for future work. We align the projectors with a per-projector
2D homography that maps projector pixels to positions on the anisotropic screen. We compute
the homography based on checker patterns projected onto a diffuse screen placed in front of the
anisotropic surface.
155
Figure VII.9: Facial features at different scales.
216 projector array Anisotropic screen 2.5
1.25
0.625
Figure VII.10: (left) Photograph showing the 6 computers, 72 video splitters, and 216 video
projectors used to display the subject. (right) The anisotropic screen scatters light from each
projector into a vertical stripe. The individual stripes can be seen if we reduce the angular density
of projectors. Each vertical stripe contains pixels from a different projector.
VII.3.2 ContentCaptureandLightFieldRendering
We record each subject with an array of Panasonic X900MK cameras, spaced every 6
over a
180
semi-circle and at a distance of 4 meters from the subject. The cameras were chosen as they
can record multiple hours of 60fps HD footage directly to SD cards with MPEG compression. As
the Panasonic cameras were not genlocked, we synchronized our videos within 1/120 of a second
156
by aligning their corresponding sound waveforms. Since our cameras are much further apart
than the interocular distance, we use a novel bidirectional interpolation algorithm to upsample the
six-degree angular resolution of the camera array using pair-wise optical flow correspondences
between cameras to 0.625
resolution. As each camera pair is processed independently, the
pipeline can be highly parallelized using GPU optical flow and is faster than traditional stereo
reconstruction. Our view interpolation algorithm maps images directly from the original video
sequences to the projector display in real-time.
VII.3.3 Result
The first full application of this technology was to preserve the the experience of in-person interac-
tions with Holocaust survivors. Our first subject was Pinchas Gutter. Mr Gutter was born Poland
in 1932, lived in a Warsaw ghetto and survived six concentration camps before being liberated by
the Russians in 1945. The interview script was based on the top 500 questions typically asked of
Holocaust survivor, along with stories catered to his particular life story. The full dataset includes
1897 questions totaling 18 hours of dialog. These questions are linked to 10492 training questions
providing enough variation to simulate spontaneous and usefully informative conversations. For
this paper, we conducted two additional short interviews with standing subjects. Each interview
was limited to 20-30 questions over 2 hours, but still allow for short moderated conversations.
Figure VII.11 shows stereo photographs of all three subjects on the display. The accompanying
video shows several 3D conversations with live natural language recognition and playback.
157
Figure VII.11: Stereo photograph of subjects on the display, left-right reversed for cross-fused
stereo viewing. Each subject is shown from three positions.
158
BIBLIOGRAPHY
[1] Gino Acevedo, Sergei Nevshupov, Jess Cowely, and Kevin Norris. An accurate method
for acquiring high resolution skin displacement maps. In ACM SIGGRAPH 2010 Talks,
SIGGRAPH ’10, pages 4:1–4:1, New York, NY , USA, 2010. ACM.
[2] Sameer Agarwal, Keir Mierle, and Others. Ceres solver. http://ceres-solver.
org.
[3] Agisoft. Agisoft photoscan. http://www.agisoft.com/, 2017.
[4] O. Alexander, M. Rogers, W. Lambeth, M. Chiang, and P. Debevec. Creating a photoreal
digital actor: The digital emily project. In Visual Media Production, 2009. CVMP ’09.
Conference for, pages 176–187, Nov 2009.
[5] Oleg Alexander, Graham Fyffe, Jay Busch, Xueming Yu, Ryosuke Ichikari, Paul Graham,
Koki Nagano, Andrew Jones, Paul Debevec, Joe Alter, Jorge Jimenez, Etienne Danvoye,
Bernardo Antionazzi, Mike Eheler, Zybnek Kysela, Xian-Chun Wu, and Javier von der
Pahlen. Digital Ira: High-resolution facial performance playback. In ACM SIGGRAPH
2013 Computer Animation Festival, SIGGRAPH ’13, pages 1–1, New York, NY , USA,
2013. ACM.
[6] Oleg Alexander, Graham Fyffe, Jay Busch, Xueming Yu, Ryosuke Ichikari, Andrew Jones,
Paul Debevec, Jorge Jimenez, Etienne Danvoye, Bernardo Antionazzi, Mike Eheler, Zyb-
nek Kysela, and Javier von der Pahlen. Digital ira: Creating a real-time photoreal digital
actor. In ACM SIGGRAPH 2013 Posters, SIGGRAPH ’13, 2013.
[7] Oleg Alexander, Mike Rogers, William Lambeth, Jen-Yuan Chiang, Wan-Chun Ma,
Chuan-Chang Wang, and Paul Debevec. The Digital Emily Project: Achieving a photoreal
digital actor. IEEE Computer Graphics and Applications, 30:20–31, July 2010.
[8] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers,
and James Davis. SCAPE: shape completion and animation of people. ACM Trans. Graph.,
24(3):408–416, July 2005.
[9] David Baraff and Andrew Witkin. Large steps in cloth simulation. In Proceedings of the
25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH
’98, pages 43–54, New York, NY , USA, 1998. ACM.
159
[10] Tali Basha, Yael Moses, and Nahum Kiryati. Multi-view scene flow estimation: A view
centered variational approach. International Journal of Computer Vision, 101(1):6–21,
2013.
[11] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. High-quality
single-shot capture of facial geometry. ACM Trans. Graph., 29(4):40:1–40:9, July 2010.
[12] Thabo Beeler, Bernd Bickel, Paul Beardsley, Bob Sumner, and Markus Gross. High-quality
single-shot capture of facial geometry. ACM Trans. Graph., 29:40:1–40:9, July 2010.
[13] Thabo Beeler and Derek Bradley. Rigid stabilization of facial expressions. ACM Trans.
Graph., 33(4):44:1–44:9, July 2014.
[14] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul Beardsley, Craig Gotsman,
Robert W. Sumner, and Markus Gross. High-quality passive facial performance capture
using anchor frames. In ACM SIGGRAPH 2011 papers, SIGGRAPH ’11, pages 75:1–
75:10, New York, NY , USA, 2011. ACM.
[15] Kiran S. Bhat, Rony Goldenthal, Yuting Ye, Ronald Mallet, and Michael Koperwas. High
fidelity facial animation capture and retargeting with contours. In Proceedings of the 12th
ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’13, pages 7–
14, New York, NY , USA, 2013. ACM.
[16] Bernd Bickel, Moritz B¨ acher, Miguel A. Otaduy, Wojciech Matusik, Hanspeter Pfister, and
Markus Gross. Capture and modeling of non-linear heterogeneous soft tissue. ACM Trans.
Graph., 28(3):89:1–89:9, July 2009.
[17] Bernd Bickel, Mario Botsch, Roland Angst, Wojciech Matusik, Miguel Otaduy, Hanspeter
Pfister, and Markus Gross. Multi-scale capture of facial geometry and motion. ACM TOG,
26:33:1–33:10, 2007.
[18] Bernd Bickel, Peter Kaufmann, M´ elina Skouras, Bernhard Thomaszewski, Derek Bradley,
Thabo Beeler, Phil Jackson, Steve Marschner, Wojciech Matusik, and Markus Gross. Phys-
ical face cloning. ACM Trans. Graph., 31(4):118:1–118:10, July 2012.
[19] N. Birkbeck, D. Cobzas, and M. J¨ agersand. Basis constrained 3D scene flow on a dynamic
proxy. In Proc. ICCV, 2011.
[20] V olker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In
SIGGRAPH ’99: Proceedings of the 26th annual conference on Computer graphics and
interactive techniques, pages 187–194, New York, NY , USA, 1999. ACM Press/Addison-
Wesley Publishing Co.
[21] Bruce Block and Philip Mcnally. 3D Storytelling: How Stereoscopic 3D Works and How
to Use It. Focal Press, 2013.
160
[22] Federica Bogo, Michael J. Black, Matthew Loper, and Javier Romero. Detailed full-body
reconstructions of moving people from monocular RGB-D sequences. In Proc. IEEE In-
ternational Conference on Computer Vision, December 2015.
[23] George Borshukov, Dan Piponi, Oystein Larsen, J. P. Lewis, and Christina Tempelaar-
Lietz. Universal capture - image-based facial animation for ”the matrix reloaded”. In ACM
SIGGRAPH 2005 Courses, SIGGRAPH ’05, New York, NY , USA, 2005. ACM.
[24] Sofien Bouaziz, Yangang Wang, and Mark Pauly. Online modeling for realtime facial
animation. ACM Trans. Graph., 32(4):40:1–40:10, July 2013.
[25] Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. High resolution passive
facial performance capture. ACM Trans. Graph., 29(4):41:1–41:10, July 2010.
[26] Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. High resolution passive
facial performance capture. ACM TOG (Proc. SIGGRAPH), 29:41:1–41:10, 2010.
[27] Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial
performance capture. ACM Trans. Graph., 34(4):46:1–46:9, July 2015.
[28] Chen Cao, Qiming Hou, and Kun Zhou. Displaced dynamic expression regression for
real-time facial tracking and animation. ACM Trans. Graph., 33(4):43:1–43:10, July 2014.
[29] Chen Cao, Yanlin Weng, Stephen Lin, and Kun Zhou. 3D shape regression for real-time
facial animation. ACM Trans. Graph., 32(4), July 2013.
[30] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, and Kun Zhou. Real-time facial
animation with image-based dynamic avatars. ACM Trans. Graph., 35(4):126:1–126:12,
July 2016.
[31] Paul Charette, Mark Sagar, Greg DeCamp, and Jessica Vallot. The jester. In ACM SIG-
GRAPH 99 Electronic Art and Animation Catalog, SIGGRAPH ’99, pages 151–, New
York, NY , USA, 1999. ACM.
[32] Yang Chen and G´ erard Medioni. Object modelling by registration of multiple range im-
ages. Image Vision Comput., 10(3):145–155, April 1992.
[33] Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Dennis Evseev, David Calabrese,
Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint
video. ACM Trans. Graph., 34(4):69:1–69:13, July 2015.
[34] Matthew Cong, Michael Bao, Jane L. E, Kiran S. Bhat, and Ronald Fedkiw. Fully auto-
matic generation of anatomical face simulation models. In Proceedings of the 14th ACM
SIGGRAPH / Eurographics Symposium on Computer Animation, SCA ’15, pages 175–183,
New York, NY , USA, 2015. ACM.
[35] R. L. Cook and K. E. Torrance. A reflectance model for computer graphics. ACM Trans.
Graph., 1(1):7–24, January 1982.
161
[36] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 886–893,
June 2005.
[37] E. de Aguiar, C. Theobalt, C. Stoll, and H. P. Seidel. Marker-less deformable mesh tracking
for human shape and motion capture. In 2007 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8, June 2007.
[38] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel,
and Sebastian Thrun. Performance capture from sparse multi-view video. ACM Trans.
Graph., 27(3):98:1–98:10, August 2008.
[39] Eugene d’Eon, David Luebke, and Eric Enderton. Efficient rendering of human skin. In
Proceedings of the 18th Eurographics Conference on Rendering Techniques, EGSR’07,
pages 147–157, Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association.
[40] Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello,
Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor,
Pushmeet Kohli, Vladimir Tankovich, and Shahram Izadi. Fusion4d: Real-time perfor-
mance capture of challenging scenes. ACM Trans. Graph., 35(4):114:1–114:13, July 2016.
[41] Jonathan Dupuy, Eric Heitz, Jean-Claude Iehl, Pierre Poulin, Fabrice Neyret, and Victor
Ostromoukhov. Linear efficient antialiased displacement and reflectance mapping. ACM
Trans. Graph., 32(6):211:1–211:11, November 2013.
[42] Christopher Edwards and Ronald Marks. Evaluation of biomechanical properties of human
skin. Clinics in Dermatology, 13:375–380, 1995.
[43] John F. Federici, Nejat Guzelsu, Hee C. Lim, Glen Jannuzzi, Tom Findley, Hans R.
Chaudhry, and Art B. Ritter. Noninvasive light-reflection technique for measuring soft-
tissue stretch. Appl. Opt., 38(31):6653–6660, Nov 1999.
[44] J. Ferguson and J.C. Barbenel. Skin surface patterns and the directional mechanical prop-
erties of the dermis. In R. Marks and P.A. Payne, editors, Bioengineering and the Skin,
pages 83–92. Springer Netherlands, 1981.
[45] Y . Furukawa and J. Ponce. Accurate, dense, and robust multiview stereopsis. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 32(8):1362–1376, Aug 2010.
[46] Yasutaka Furukawa and Jean Ponce. Dense 3d motion capture for human faces. In 2009
IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2009), 20-25 June 2009, Miami, Florida, USA, pages 1674–1681, 2009.
[47] G. Fyffe, K. Nagano, L. Huynh, S. Saito, J. Busch, A. Jones, H. Li, and P. Debevec. Multi-
view stereo on consistent face topology. Computer Graphics Forum, 2017.
162
[48] Graham Fyffe, Tim Hawkins, Chris Watts, Wan-Chun Ma, and Paul Debevec. Comprehen-
sive facial performance capture. Computer Graphics Forum (Proc. EUROGRAPHICS),
30(2), 2011.
[49] Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. Driv-
ing high-resolution facial scans with video performance capture. ACM Trans. Graph.,
34(1):8:1–8:14, December 2014.
[50] Graham Fyffe, Andrew Jones, Oleg Alexander, Ryosuke Ichikari, and Paul Debevec. Driv-
ing high-resolution facial scans with video performance capture. ACM Trans. Graph.,
34(1):8:1–8:14, December 2014.
[51] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview
stereopsis by surface normal diffusion. In Proceedings of the IEEE International Confer-
ence on Computer Vision, pages 873–881, 2015.
[52] Pablo Garrido, Levi Valgaert, Chenglei Wu, and Christian Theobalt. Reconstructing de-
tailed dynamic face geometry from monocular video. ACM Trans. Graph., 32(6):158:1–
158:10, November 2013.
[53] Pablo Garrido, Levi Valgaerts, Chenglei Wu, and Christian Theobalt. Reconstructing de-
tailed dynamic face geometry from monocular video. ACM Transactions on Graphics,
32(6):158:1–158:10, 2013.
[54] Pablo Garrido, Michael Zollh¨ ofer, Dan Casas, Levi Valgaerts, Kiran Varanasi, Patrick
P´ erez, and Christian Theobalt. Reconstruction of personalized 3d face rigs from monocular
video. ACM Trans. Graph., 35(3):28:1–28:15, May 2016.
[55] Abhijeet Ghosh, Graham Fyffe, Borom Tunwattanapong, Jay Busch, Xueming Yu, and
Paul Debevec. Multiview face capture using polarized spherical gradient illumination.
ACM Trans. Graph., 30(6):129:1–129:10, December 2011.
[56] Aleksey Golovinskiy, Wojciech Matusik, Hanspeter Pfister, Szymon Rusinkiewicz, and
Thomas Funkhouser. A statistical model for synthesis of detailed facial geometry. ACM
Trans. Graph., 25(3):1025–1034, July 2006.
[57] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahra-
mani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances
in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc.,
2014.
[58] P. F. U. Gotardo, T. Simon, Y . Sheikh, and I. Matthews. Photogeometric scene flow for
high-detail dynamic 3d reconstruction. In Proc. IEEE International Conference on Com-
puter Vision, pages 846–854, Dec 2015.
[59] J. C. Gower. Generalized procrustes analysis. Psychometrika, 40(1):33–51, 1975.
163
[60] Paul Graham, Borom Tunwattanapong, Jay Busch, Xueming Yu, Andrew Jones, Paul De-
bevec, and Abhijeet Ghosh. Measurement-based synthesis of facial microgeometry. Com-
puter Graphics Forum, 32(2pt3):335–344, 2013.
[61] Brian Guenter, Cindy Grimm, Daniel Wood, Henrique Malvar, and Fredric Pighin. Making
faces. In Proceedings of the 25th Annual Conference on Computer Graphics and Interac-
tive Techniques, SIGGRAPH ’98, pages 55–66, New York, NY , USA, 1998. ACM.
[62] Brian Guenter, Cindy Grimm, Daniel Wood, Henrique Malvar, and Fredric Pighin. Making
faces. In Proc. SIGGRAPH, pages 55–66. ACM, 1998.
[63] Nejat Guzelsu, John F. Federici, Hee C. Lim, Hans R. Chauhdry, Art B. Ritter, and Tom
Findley. Measurement of skin stretch via light reflection. Journal of Biomedical Optics,
8(1):80–86, 2003.
[64] Eric Heitz. Understanding the masking-shadowing function in microfacet-based brdfs.
Journal of Computer Graphics Techniques (JCGT), 3(2):32–91, June 2014.
[65] David A. Hirshberg, Matthew Loper, Eric Rachlin, and Michael J. Black. Coregistration:
Simultaneous alignment and modeling of articulated 3d shape. In Andrew W. Fitzgibbon,
Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors, ECCV (6),
volume 7577 of Lecture Notes in Computer Science, pages 242–255. Springer, 2012.
[66] Pei-Lun Hsieh, Chongyang Ma, Jihun Yu, and Hao Li. Unconstrained realtime facial per-
formance capture. In CVPR, pages 1675–1683. IEEE Computer Society, 2015.
[67] Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. Leveraging motion capture
and 3d scanning for high-fidelity facial performance acquisition. ACM Trans. Graph.,
30(4):74:1–74:10, July 2011.
[68] Haoda Huang, Jinxiang Chai, Xin Tong, and Hsiang-Tao Wu. Leveraging motion capture
and 3D scanning for high-fidelity facial performance acquisition. ACM TOG, 30:74:1–
74:10, 2011.
[69] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. Dynamic 3d avatar creation
from hand-held video input. ACM Trans. Graph., 34(4):45:1–45:14, July 2015.
[70] Takanori Igarashi, Ko Nishino, and Shree K. Nayar. The appearance of human skin: A
survey. Foundations and Trends in Computer Graphics and Vision, 3(1):1–95, 2007.
[71] Infinite-Realities. Infinite-realities. http://ir-ltd.net/, 2017.
[72] Matthias Innmann, Michael Zollh¨ ofer, Matthias Nießner, Christian Theobalt, and Marc
Stamminger. V olumedeform: Real-time volumetric non-rigid reconstruction. In Proceed-
ings of the European Conference on Computer Vision (ECCV), 2016.
[73] G. Irving, J. Teran, and R. Fedkiw. Invertible finite elements for robust simulation of large
deformation. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics Symposium on
164
Computer Animation, SCA ’04, pages 131–140, Aire-la-Ville, Switzerland, Switzerland,
2004. Eurographics Association.
[74] Wenzel Jakob, Miloˇ s Haˇ san, Ling-Qi Yan, Jason Lawrence, Ravi Ramamoorthi, and Steve
Marschner. Discrete stochastic microfacet models. ACM Trans. Graph., 33(4):115:1–
115:10, July 2014.
[75] Henrik Wann Jensen, Stephen R. Marschner, Marc Levoy, and Pat Hanrahan. A practical
model for subsurface light transport. In Proceedings of ACM SIGGRAPH 2001, pages
511–518, 2001.
[76] Micah K. Johnson, Forrester Cole, Alvin Raj, and Edward H. Adelson. Microgeometry
capture using an elastomeric sensor. ACM Trans. Graph., 30:46:1–46:8, August 2011.
[77] A. Jones, M. Lang, G. Fyffe, X. Yu, J. Busch, I. McDowall, M. Bolas, and P. Debevec.
Achieving eye contact in a one-to-many 3d video teleconferencing system. In ACM Tr
ansactions on Graphics (TOG), volume 28, page 64. ACM, 2009.
[78] A. Jones, I. McDowall, H. Yamada, M. Bolas, and P. Debevec. Rendering for an interactive
360 light field display. ACM Transactions on Graphics (TOG), 26(3):40, 2007.
[79] A. Jones, K. Nagano, J. Busch, X. Yu, H. Y . Peng, J. Barreto, O. Alexander, M. Bo-
las, P. Debevec, and J. Unger. Time-offset conversations on a life-sized automultiscopic
projector array. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), pages 927–935, June 2016.
[80] Andrew Jones, Koki Nagano, Jing Liu, Jay Busch, Xueming Yu, Mark Bolas, and Paul De-
bevec. Interpolating vertical parallax for an autostereoscopic three-dimensional projector
array. Journal of Electronic Imaging, 23(1), March 2014.
[81] Hanbyul Joo, Hyun Soo Park, and Yaser Sheikh. MAP visibility estimation for large-scale
dynamic 3D reconstruction. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition, 2014.
[82] J. Jurik, A. Jones, M. Bolas, and P. Debevec. Prototyping a light field display involving
direct observation of a video projector array. In Computer Vision and Pattern Recognition
Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pages 15–20. IEEE,
2011.
[83] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble
of regression trees. In Proceedings of the 2014 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR ’14, pages 1867–1874, Washington, DC, USA, 2014. IEEE
Computer Society.
[84] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. Poisson surface reconstruction.
In Proc. Symposium on Geometry Processing, pages 61–70, 2006.
165
[85] Davis E. King. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res., 10:1755–1758,
December 2009.
[86] M. Klaudiny and A. Hilton. High-detail 3D capture and non-sequential alignment of facial
performance. In Proc. 3DIMPVT, pages 17–24, Oct 2012.
[87] Martin Klaudiny and Adrian Hilton. High-fidelity facial performance capture with non-
sequential temporal alignment. In Proceedings of the 3rd Symposium on Facial Analysis
and Animation, FAA ’12, pages 3:1–3:1, New York, NY , USA, 2012. ACM.
[88] Digital Human League. The wikihuman project. http://gl.ict.usc.edu/
Research/DigitalEmily2/, 2015. Accessed: 2015-12-01.
[89] JP Lewis and Ken-ichi Anjyo. Direct manipulation blendshapes. IEEE Comp. Graphics
and Applications, 30(4):42–50, 2010.
[90] Hao Li, Bart Adams, Leonidas J. Guibas, and Mark Pauly. Robust single-view geometry
and motion reconstruction. ACM Trans. Graph., 28(5), December 2009.
[91] Hao Li, Thibaut Weise, and Mark Pauly. Example-based facial rigging. ACM Trans.
Graph., 29(4):32:1–32:6, July 2010.
[92] Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. Realtime facial animation with on-the-fly
correctives. ACM Trans. Graph., 32(4):42:1–42:10, July 2013.
[93] Hao Li, Jihun Yu, Yuting Ye, and Chris Bregler. Realtime facial animation with on-the-fly
correctives. ACM Trans. Graph., 32(4), July 2013.
[94] Pengbo Li and Paul G. Kry. Multi-layer skin simulation with adaptive constraints. In
Proceedings of the Seventh International Conference on Motion in Games, MIG ’14, pages
171–176, New York, NY , USA, 2014. ACM.
[95] Yih Miin Liew, Robert A. McLaughlin, Fiona M. Wood, and David D. Sampson. Reduc-
tion of image artifacts in three-dimensional optical coherence tomography of skin in vivo.
Journal of Biomedical Optics, 16(11):116018–116018–10, 2011.
[96] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J.
Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIG-
GRAPH Asia), 34(6):248:1–248:16, October 2015.
[97] Matthew M. Loper, Naureen Mahmood, and Michael J. Black. MoSh: Motion and shape
capture from sparse markers. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia),
33(6):220:1–220:13, November 2014.
[98] David G Lowe. Distinctive image features from scale-invariant keypoints. International
journal of computer vision, 60(2):91–110, 2004.
[99] Ten24 Media LTD. Ten 24 digital capture. http://ten24.info/, 2017.
166
[100] Wan-Chun Ma, Tim Hawkins, Pieter Peers, Charles-Felix Chabert, Malte Weiss, and Paul
Debevec. Rapid acquisition of specular and diffuse normal maps from polarized spherical
gradient illumination. In Rendering Techniques, pages 183–194, 2007.
[101] Wan-Chun Ma, Andrew Jones, Jen-Yuan Chiang, Tim Hawkins, Sune Frederiksen, Pieter
Peers, Marko Vukovic, Ming Ouhyoung, and Paul Debevec. Facial performance syn-
thesis using deformation-driven polynomial displacement maps. ACM Trans. Graph.,
27(5):121:1–121:10, December 2008.
[102] Wan-Chun Ma, Andrew Jones, Tim Hawkins, Jen-Yuan Chiang, and Paul Debevec. A
high-resolution geometry capture system for facial performance. In ACM SIGGRAPH 2008
talks, SIGGRAPH ’08, pages 3:1–3:1, New York, NY , USA, 2008. ACM.
[103] Karl F MacDorman. Subjective ratings of robot video clips for human likeness, familiarity,
and eeriness: An exploration of the uncanny valley. ICCS/CogSci-2006 long symposium:
Toward social mechanisms of android science, pages 26–29, 2006.
[104] C. Malleson, J. C. Bazin, O. Wang, D. Bradley, T. Beeler, A. Hilton, and A. Sorkine-
Hornung. Facedirector: Continuous control of facial performance in video. In 2015 IEEE
International Conference on Computer Vision (ICCV), pages 3979–3987, Dec 2015.
[105] Giorgio Marcias, Nico Pietroni, Daniele Panozzo, Enrico Puppo, and Olga Sorkine.
Animation-aware quadrangulation. Computer Graphics Forum, 2013.
[106] Giorgio Marcias, Nico Pietroni, Daniele Panozzo, Enrico Puppo, and Olga Sorkine-
Hornung. Animation-aware quadrangulation. Computer Graphics Forum (proceedings
of EUROGRAPHICS/ACM SIGGRAPH Symposium on Geometry Processing), 32(5):167–
175, 2013.
[107] William Montagna and Paul F. Parakkal. The structure and function of skin, chapter 2.
Academic Press, New York, NY , USA, 1974.
[108] M. Mori, K. F. MacDorman, and N. Kageki. The uncanny valley [from the field]. IEEE
Robotics Automation Magazine, 19(2):98–100, June 2012.
[109] Koki Nagano, Graham Fyffe, Oleg Alexander, Jernej Barbic ¸, Hao Li, Abhijeet Ghosh, and
Paul Debevec. Skin microstructure deformation with displacement map convolution. ACM
Trans. Graph., 34(4):109:1–109:10, July 2015.
[110] Thomas Neumann, Kiran Varanasi, Stephan Wenger, Markus Wacker, Marcus Magnor,
and Christian Theobalt. Sparse localized deformation components. ACM Trans. Graph.,
32(6):179:1–179:10, November 2013.
[111] Richard A. Newcombe, Dieter Fox, and Steven M. Seitz. DynamicFusion: reconstruction
and tracking of non-rigid scenes in real-time. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 343–352, 2015.
167
[112] Marc Olano and Dan Baker. Lean mapping. In Proceedings of the 2010 ACM SIGGRAPH
Symposium on Interactive 3D Graphics and Games, I3D ’10, pages 181–188, New York,
NY , USA, 2010. ACM.
[113] Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. High-fidelity facial and
speech animation for vr hmds. ACM Trans. Graph., 35(6):221:1–221:14, November 2016.
[114] Stephen M. Platt and Norman I. Badler. Animating facial expressions. SIGGRAPH Com-
put. Graph., 15(3):245–252, August 1981.
[115] Gerard Pons-Moll, Javier Romero, Naureen Mahmood, and Michael J. Black. Dyna: A
model of dynamic human shape in motion. ACM Trans. Graph., 34(4):120:1–120:14, July
2015.
[116] Paul Rademacher and Gary Bishop. Multiple-center-of-projection images. In Proceed-
ings of ACM SIGGRAPH 98, Computer Graphics Proceedings, Annual Conference Series,
pages 199–206, July 1998.
[117] Capturing Reality. Realitycapture. https://www.capturingreality.com/,
2017.
[118] Olivier R´ emillard and Paul G. Kry. Embedded thin shells for wrinkle simulation. ACM
Trans. Graph., 32(4):50:1–50:8, July 2013.
[119] Shunsuke Saito, Tianye Li, and Hao Li. Real-time facial segmentation and performance
capture from RGB input. In Computer Vision - ECCV 2016 - 14th European Conference,
Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 244–
261, 2016.
[120] Brian Schulkin, Hee C. Lim, Nejat Guzelsu, Glen Jannuzzi, and John F. Federici. Polarized
light reflection from strained sinusoidal surfaces. Appl. Opt., 42(25):5198–5208, Sep 2003.
[121] Laura Sevilla-Lara, Deqing Sun, Erik G. Learned-Miller, and Michael J. Black. Optical
flow estimation with channel constancy. In Proc. European Conference on Computer Vi-
sion, pages 423–438, June 2014.
[122] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition of high-
fidelity facial performances using monocular videos. ACM Trans. Graph., 33(6):222:1–
222:13, November 2014.
[123] Fuhao Shi, Hsiang-Tao Wu, Xin Tong, and Jinxiang Chai. Automatic acquisition of high-
fidelity facial performances using monocular videos. ACM Transactions on Graphics,
33(6):222:1–222:13, 2014.
[124] Liwen Hu Koki Nagano Shunsuke Saito, Lingyu Wei and Hao Li. Photorealistic facial
texture inference using deep neural networks. In The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2017.
168
[125] Hang Si. Tetgen, a delaunay-based quality tetrahedral mesh generator. ACM Trans. Math.
Softw., 41(2):11:1–11:36, February 2015.
[126] Eftychios Sifakis, Igor Neverov, and Ronald Fedkiw. Automatic determination of facial
muscle activations from sparse motion capture marker data. In ACM SIGGRAPH 2005
Papers, SIGGRAPH ’05, pages 417–425, New York, NY , USA, 2005. ACM.
[127] Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Proceedings
of EUROGRAPHICS/ACM SIGGRAPH Symposium on Geometry Processing, pages 109–
116, 2007.
[128] Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Proceedings
of the Fifth Eurographics Symposium on Geometry Processing, SGP ’07, pages 109–116,
Aire-la-Ville, Switzerland, Switzerland, 2007. Eurographics Association.
[129] Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M. Seitz. Total moving
face reconstruction. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich,
Switzerland, September 6-12, 2014, Proceedings, Part IV, pages 796–812, 2014.
[130] Supasorn Suwajanakorn, Ira Kemelmacher-Shlizerman, and Steven M. Seitz. Total moving
face reconstruction. In Proc. ECCV, 2014.
[131] D. Terzopoulos and K. Waters. Analysis and synthesis of facial image sequences using
physical and anatomical models. IEEE Trans. Pattern Anal. Mach. Intell., 15(6):569–579,
June 1993.
[132] D. Terzopoulos and K. Waters. Analysis and synthesis of facial image sequences using
physical and anatomical models. IEEE Trans. Pattern Anal. Mach. Intell., 15(6):569–579,
June 1993.
[133] Art Tevs, Alexander Berner, Michael Wand, Ivo Ihrke, Martin Bokeloh, Jens Kerber, and
Hans-Peter Seidel. Animation cartography—intrinsic reconstruction of shape and
motion. ACM Trans. Graph., 31(2):12:1–12:15, April 2012.
[134] Justus Thies, Michael Zollhofer, Matthias Niessner, Levi Valgaerts, Marc Stamminger, and
Christian Theobalt. Real-time expression transfer for facial reenactment. ACM Trans.
Graph., 34(6):183:1–183:14, October 2015.
[135] Justus Thies, Michael Zollh¨ ofer, Marc Stamminger, Christian Theobalt, and Matthias
Nießner. Face2face: Real-time face capture and reenactment of RGB videos. In 2016
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,
NV , USA, June 27-30, 2016, pages 2387–2395, 2016.
[136] Angela Tinwell. The Uncanny Valley in Games and Animation. CRC Press, 2014.
[137] K. E. Torrance and E. M. Sparrow. Theory of off-specular reflection from roughened sur-
faces. J. Opt. Soc. Am., 57:1104–1114, 1967.
169
[138] Aggeliki Tsoli, Naureen Mahmood, and Michael J. Black. Breathing life into shape: Cap-
turing, modeling and animating 3d human breathing. ACM Trans. Graph., 33(4):52:1–
52:11, July 2014.
[139] L. Valgaerts, A. Bruhn, H. Zimmer, J. Weickert, C. Stoll, and C. Theobalt. Joint estimation
of motion, structure and geometry from stereo sequences. In Proc. ECCV, volume 6314,
pages 568–581, 2010.
[140] Levi Valgaerts, Chenglei Wu, Andr´ es Bruhn, Hans-Peter Seidel, and Christian Theobalt.
Lightweight binocular facial performance capture under uncontrolled lighting. ACM Trans.
Graph., 31(6):187:1–187:11, November 2012.
[141] Levi Valgaerts, Chenglei Wu, Andrs Bruhn, Hans-Peter Seidel, and Christian Theobalt.
Lightweight binocular facial performance capture under uncontrolled lighting. ACM Trans.
Graph., 31(6):187:1–187:11, November 2012.
[142] S. Vedula, P. Rander, R. Collins, and T. Kanade. Three-dimensional scene flow. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 27(3):475–480, March 2005.
[143] V Vinayagamoorthy, A Steed, and M Slater. Building characters: Lessons drawn from
virtual environments. Toward Social Mechanisms of Android Science: A CogSci 2005
Workshop, pages 119–126, 2005.
[144] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popovi´ c. Articulated mesh ani-
mation from multi-view silhouettes. ACM Trans. Graph., 27(3):97:1–97:9, August 2008.
[145] Javier von der Pahlen, Jorge Jimenez, Etienne Danvoye, Paul Debevec, Graham Fyffe, and
Oleg Alexander. Digital ira and beyond: Creating real-time photoreal digital actors. In
ACM SIGGRAPH 2014 Courses, SIGGRAPH ’14, pages 1:1–1:384, New York, NY , USA,
2014. ACM.
[146] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. Realtime performance-based
facial animation. ACM Trans. Graph., 30(4):77:1–77:10, July 2011.
[147] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. Realtime performance-based
facial animation. ACM TOG, 30:77:1–77:10, 2011.
[148] Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. Face/off: Live facial puppetry. In
Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Anima-
tion, SCA ’09, pages 7–16, New York, NY , USA, 2009. ACM.
[149] Manuel Werlberger, Werner Trobin, Thomas Pock, Andreas Wedel, Daniel Cremers, and
Horst Bischof. Anisotropic huber-l1 optical flow. In British Machine Vision Conference,
BMVC 2009, London, UK, September 7-10, 2009. Proceedings, pages 1–11, 2009.
[150] Tim Weyrich, Wojciech Matusik, Hanspeter Pfister, Bernd Bickel, Craig Donner, Chien
Tu, Janet McAndless, Jinho Lee, Addy Ngan, Henrik Wann Jensen, and Markus Gross.
170
Analysis of human faces using a measurement-based skin reflectance model. ACM Trans.
Graph., 25(3):1013–1024, July 2006.
[151] Lance Williams. Performance-driven facial animation. In Proceedings of the 17th Annual
Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’90, pages
235–242, New York, NY , USA, 1990. ACM.
[152] Lance Williams. Performance-driven facial animation. In Proc. SIGGRAPH, pages 235–
242. ACM, 1990.
[153] Cyrus A. Wilson, Abhijeet Ghosh, Pieter Peers, Jen-Yuan Chiang, Jay Busch, and Paul
Debevec. Temporal upsampling of performance geometry using photometric alignment.
ACM Trans. Graph., 29(2):1–11, 2010.
[154] Eric Woods, Paul Mason, and Mark Billinghurst. Magicmouse: an inexpensive 6-degree-
of-freedom mouse. In GRAPHITE 2003, pages 285–286, 2003.
[155] Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. An anatomically-
constrained local deformation model for monocular face capture. ACM Trans. Graph.,
35(4):115:1–115:12, July 2016.
[156] Chenglei Wu, Derek Bradley, Markus Gross, and Thabo Beeler. An anatomically-
constrained local deformation model for monocular face capture. ACM Trans. Graph.,
35(4):115:1–115:12, July 2016.
[157] Chenglei Wu, Kiran Varanasi, and Christian Theobalt. Full body performance capture un-
der uncontrolled and varying illumination: A shading-based approach. In Proc. European
Conference on Computer Vision, pages 757–770, 2012.
[158] Chenglei Wu, Michael Zollh¨ ofer, Matthias Nießner, Marc Stamminger, Shahram Izadi,
and Christian Theobalt. Real-time shading-based refinement for consumer depth cameras.
Proc. SIGGRAPH Asia, 2014.
[159] Ling-Qi Yan, Miloˇ s Haˇ san, Wenzel Jakob, Jason Lawrence, Steve Marschner, and Ravi Ra-
mamoorthi. Rendering glints on high-resolution normal-mapped specular surfaces. ACM
Trans. Graph., 33(4):116:1–116:9, July 2014.
[160] Hoyt Yeatman. Human face project. In Proceedings of the 29th International Conference
on Computer Graphics and Interactive Techniques. Electronic Art and Animation Catalog.,
SIGGRAPH ’02, pages 162–162, New York, NY , USA, 2002. ACM.
[161] Kuk-Jin Yoon and In So Kweon. Adaptive support-weight approach for correspondence
search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):650–656,
April 2006.
171
[162] Eduard Zell, Carlos Aliaga, Adrian Jarabo, Katja Zibrek, Diego Gutierrez, Rachel McDon-
nell, and Mario Botsch. To stylize or not to stylize?: The effect of shape and material styl-
ization on the perception of computer-generated faces. ACM Trans. Graph., 34(6):184:1–
184:12, October 2015.
[163] Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(11):1330–1334, Nov 2000.
[164] Shuang Zhao, Wenzel Jakob, Steve Marschner, and Kavita Bala. Building volumetric ap-
pearance models of fabric using micro ct imaging. ACM Trans. Graph., 30(4):44:1–44:10,
July 2011.
[165] Kun Zhou, Jin Huang, John Snyder, Xinguo Liu, Hujun Bao, Baining Guo, and Heung-
Yeung Shum. Large mesh deformation using the volumetric graph laplacian. In ACM
SIGGRAPH 2005 Papers, SIGGRAPH ’05, pages 496–503, New York, NY , USA, 2005.
ACM.
[166] Michael Zollh¨ ofer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christo-
pher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian
Theobalt, and Marc Stamminger. Real-time non-rigid reconstruction using an RGB-D
camera. ACM Trans. Graphics (Proc. SIGGRAPH), 33(4):156:1–156:12, 2014.
[167] Matthias Zwicker, Wojciech Matusik, Fr´ edo Durand, Hanspeter Pfister, and Clifton For-
lines. Antialiasing for automultiscopic 3d displays. In ACM SIGGRAPH 2006 Sketches,
SIGGRAPH ’06, New York, NY , USA, 2006. ACM.
172
Abstract (if available)
Abstract
Digitally creating a virtual human indistinguishable from a real human has been one of the central goals of Computer Graphics, Human-Computer Interaction, and Artificial Intelligence. Such digital characters are not only the primary creative vessel for immersive storytellers and filmmakers, but also a key technology to understand the process of how humans think, see, and communicate in the social environment. In order for digital character creation techniques to be valuable in simulating and understanding humans, the hardest challenge is for them to appear believably realistic from any point of view in any environment, and to behave and interact in a convincing manner. ❧ Creating a photorealistic rendering of a digital avatar is increasingly more accessible due to rapid advancement in sensing technologies and rendering techniques. However, generating realistic movement and dynamic details that are compatible with such a photorealistic appearance still relies on manual work from experts, which hinders the potential impact of digital avatar technologies in real world applications. Generating dynamic details is especially important for facial animation, as humans are extremely tuned to sense people's intentions from facial expressions. ❧ This dissertation proposes systems and approaches for capturing the appearance and motion to reproduce high fidelity digital avatars that are rich in subtle motion and appearance details. We aim for a framework which can generate consistent dynamic detail and motion at the resolution of skin pores and fine wrinkles, and can provide extremely high resolution microstructure deformation for use in cinematic storytelling or immersive virtual reality environments. ❧ This thesis presents three principal techniques for achieving multi-scale dynamic capture for digital humans. The first is a multi-view capture system and a stereo reconstruction technique which directly produces a complete high fidelity head model with consistent facial mesh topology. Our method jointly solves for stereo constraints and consistent mesh parameterization from static scans or a dynamic performance, producing dense correspondences on an artist-quality template. Additionally, we propose a technique to add dynamic per-frame high and middle frequency details from the flat-lit performance video. Second, we propose a technique to estimate high fidelity 3D scene flow from multi-view video. The motion estimation fully respects high quality data from multi-view input, and can be incorporated to any facial performance capture pipeline to improve the fidelity of the facial motion. Since the motion can be estimated without relying on any domain-specific priors or regularization, our method scales well to modern systems with many high resolution cameras. Third, we present a technique to synthesize dynamic skin microstructure details to produce convincing facial animation. We measure and quantify how skin microstructure deformation contributes to dynamic skin appearance, and present an efficient way to simulate dynamic skin microstructure. When combined with the state-of-the art performance capture and face scanning techniques, it can significantly improve the realism of animated faces for virtual reality, video games, and visual effects.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scalable dynamic digital humans
PDF
A framework for high‐resolution, high‐fidelity, inexpensive facial scanning
PDF
Compositing real and virtual objects with realistic, color-accurate illumination
PDF
Rapid creation of photorealistic virtual reality content with consumer depth cameras
PDF
Complete human digitization for sparse inputs
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Human appearance analysis and synthesis using deep learning
PDF
Data-driven 3D hair digitization
PDF
Real-time simulation of hand anatomy using medical imaging
PDF
Point-based representations for 3D perception and reconstruction
PDF
Anatomically based human hand modeling and simulation
PDF
Advanced techniques for human action classification and text localization
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Efficient inverse analysis with dynamic and stochastic reductions for large-scale models of multi-component systems
PDF
Multi-robot strategies for adaptive sampling with autonomous underwater vehicles
PDF
Correction, coregistration and connectivity analysis of multi-contrast brain MRI
PDF
Perception and haptic interface design for rendering hardness and stiffness
PDF
Exploitation of wide area motion imagery
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
Asset Metadata
Creator
Nagano, Koki
(author)
Core Title
Multi-scale dynamic capture for high quality digital humans
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/07/2017
Defense Date
04/19/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3D display,4D capture,appearance capture,autostereoscopic display,BRDF,computer animation,computer graphics,computer vision,deformation,digital human,dynamic reconstruction,facial performance capture,microfacet,microstructure,multi-view reconstruction,OAI-PMH Harvest,projector array,scene flow
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Debevec, Paul (
committee chair
), Barbic, Jernej (
committee member
), Li, Hao (
committee member
), Nakano, Aiichiro (
committee member
), Povinelli, Michelle Lynn (
committee member
)
Creator Email
knagano@usc.edu,koki.nagano0219@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-422687
Unique identifier
UC11265093
Identifier
etd-NaganoKoki-5681.pdf (filename),usctheses-c40-422687 (legacy record id)
Legacy Identifier
etd-NaganoKoki-5681.pdf
Dmrecord
422687
Document Type
Dissertation
Rights
Nagano, Koki
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3D display
4D capture
appearance capture
autostereoscopic display
BRDF
computer animation
computer graphics
computer vision
deformation
digital human
dynamic reconstruction
facial performance capture
microfacet
microstructure
multi-view reconstruction
projector array
scene flow