Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deep generative models for image translation
(USC Thesis Other)
Deep generative models for image translation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEEP GENERATIVE MODELS FOR IMAGE TRANSLATION
by
Chao Yang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2019
Copyright 2019 Chao Yang
Acknowledgments
Ever since I graduated with my bachelor degree, I have spent significant years pursuing
the PhD degree. Now that I look back at the endeavor, I do not feel it as a painful
or struggling experience at all. On the contrary, I feel I am extremely fortunate to have
embarked on a such an exciting journey to work in one of the most important and hottest
fields and get to solve interesting yet challenging problems in AI. I am also grateful that
I have a lot of thrilling experience working with different people at difference places.
Below are my deepest thanks to the people who have made my past years enjoyable and
unforgettable.
First and foremost, I am deeply grateful to my PhD advisor, Prof. C.-C. Jay Kuo, for
his unconditional care and support in numerous ways. It is impossible for me to stand at
where I am today without Prof. Kuo, and it is unlikely to further pursue academia career
if not for his immeasurable help and how he made my PhD such a positive experience.
It has always been delightful to talk with Prof. Kuo, and his wisdom and persistence
in pursuing truth and knowledge have set up a role model for me. Other than being an
awesome mentor who guided me in doing good research, Prof. Kuo is also one of the
most generous and kind-hearted person I have met. He is always nice to everybody,
especially the students. Prof. Kuo is like a family to me, and I am forever proud of
being his PhD advisee and part of the great MCL lab.
ii
I also want to thank my family, especially my parents who have always given me
support and understanding. Studying overseas limits my time to spend with my fam-
ily, but they not only respect my every decision but also support me and enable me to
patiently finish my study even when there are obstacles and setbacks.
Many of my friends have provided insightful discussions and emotional support dur-
ing my PhD study. I want to thank my labmate Yuhang Song, Bing Li, Jiali Duan, Qin
Huang, Yuanhang Su, Ye Wang and others for the collaborations and friendship. The
time we hang out together in Los Angeles and during conferences are one of the best
memories of my life. Other friends at USC, especially Weiyue Wang and Yi Zhou, have
also generously helped my research and accompanied me through ups and downs. I also
want to thank my friends outside USC, especially Xiaofeng Liu of CMU, Gerry Che
at MILA and Qingming Tang at TTIC, who almost discussed research and other topics
with me on a daily basis. I would not be able to survive PhD without the mutual support
and accompany. I also want to highlight my special thanks to my friends Guilin Liu at
NVIDIA and Prof. Jonghye Woo at Harvard for their generous help to my career.
My collaborator, Prof. Qixing Huang at UT Austin, my internship advisor in Adobe,
Zhe Lin, as well as my manager at Facebook, Ser-Nam Lim, have all given me tremen-
dous support in research and career. I also want to thank other seniors who have worked
with me in the past, including Prof. Hao Li, Prof. Xin Chen and Dr. Taehwan Kim. I
am also thankful for Lizsl, Tracy, and Jennifer of USC for helping me with the logistics
during my PhD program.
Finally, I would like to thank my qualifying and thesis committee members: Prof.
Keith Jenkins, Prof. Ram Nevatia, Prof. Cyrus Shahabi, Prof. Ulrich Neumann, and
Prof. Jernej Barbic. Their feedbacks have been extremely helpful for my research.
iii
Contents
Acknowledgments ii
List of Tables vii
List of Figures viii
Abstract xiii
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 High-Resolution Image Inpainting . . . . . . . . . . . . . . . . 4
1.2.2 Unsupervised Image Translation . . . . . . . . . . . . . . . . . 5
1.2.3 Multi-view Learning for Human Retargeting . . . . . . . . . . 5
1.2.4 Unconstrained Facial Expression Transfer . . . . . . . . . . . . 6
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 7
2 High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthe-
sis 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 The Joint Loss Function . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 The Content Network . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4 The Texture Network . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Image Inpainting using Block-wise Procedural Training with Annealed Adver-
sarial Counterpart 26
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iv
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 The Generator Head . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 The Training Losses . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.3 Blockwise Procedural Training with Adversarial Loss Annealing 36
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Comparison with Existing Methods . . . . . . . . . . . . . . . 39
3.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4 Interactive Guided Inpainting . . . . . . . . . . . . . . . . . . 44
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Show, Attend and Translate: Unsupervised Image Translation with Self-
Regularization and Attention 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Towards Disentangled Representations for Human Retargeting by Multi-
view Learning 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.1 Learning Domain-Invariant Representations . . . . . . . . . . . 76
5.3.2 Baseline: Conditional Variational Autoencoder . . . . . . . . . 77
5.3.3 Disentangling with Multi-view Information . . . . . . . . . . . 78
5.3.4 Detailed Implementation . . . . . . . . . . . . . . . . . . . . . 81
5.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 Cross-identity Image Translation . . . . . . . . . . . . . . . . . 83
5.4.3 Identity-Invariant Representations . . . . . . . . . . . . . . . . 87
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 Unconstrained Facial Expression Transfer using Style-based Generator 91
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.1 Face Reconstruction and Rendering . . . . . . . . . . . . . . . 94
6.2.2 Face Retargeting and Reenactment . . . . . . . . . . . . . . . . 95
v
6.2.3 Deep Generative Model for Image Synthesis and Disentanglement 96
6.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.2 Face Detect and Normalize . . . . . . . . . . . . . . . . . . . . 97
6.3.3 Style Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.4 Style Fuse and Regenerate . . . . . . . . . . . . . . . . . . . . 100
6.3.5 Warp and Blend . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.1 Experiment Setup and Results . . . . . . . . . . . . . . . . . . 101
6.4.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7 Conclusion and Future Work 112
7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 114
Bibliography 116
vi
List of Tables
2.1 Numerical comparison on Paris StreetView dataset. Higher PSNR value
is better. Note % in the Table is to facilitate reading. . . . . . . . . . . . 20
3.1 Comparison of training losses used in different methods. . . . . . . . . 33
3.2 Numerical comparison between CAF, CE and GLI, our generator head
results and our final results. Up/down are results of center/random region
completion. Note that for SSIM, larger values mean greater similarity
in terms of content structure and indicate better performance. . . . . . . 42
4.1 User study results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Unsupervised map prediction accuracy. . . . . . . . . . . . . . . . . . 65
4.3 Unsupervised classification results. . . . . . . . . . . . . . . . . . . . . 65
4.4 Unsupervised 3DMM prediction results (MSE). . . . . . . . . . . . . . 66
5.1 Numerical comparisons. Our results have favorable quality (as shown in
AE and classification error) and best preserve the semantics (as shown
in`
2
error). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Numerical comparisons with cGAN and direct transfer. . . . . . . . . . 103
vii
List of Figures
2.1 Qualitative illustration of the task. Given an image (512 512) with a
missing hole (256 256) (a), our algorithm can synthesize sharper and
more coherent hole content (d) comparing with Context Encoder [148]
(b) and Content-Aware Fill using PatchMatch [1] (c). . . . . . . . . . . 9
2.2 Framework Overview. Our method solves for an unknown image x
using two loss functions, the holistic content loss (E
c
) and the local
texture loss (E
t
). At the smallest scale, the holistic content loss is con-
ditioned on the output of the pre-trained content network given the input
x
0
(f(x
0
)). The local texture loss is derived by feeding x into a pre-
trained network (the texture network) and comparing the local neural
patches betweenR (the hole) and the boundary. . . . . . . . . . . . . . 13
2.3 The network architecture for structured content prediction. Unlike the
`
2
loss architecture presented in [148], we replaced all ReLU/ReLU
leaky layers with the ELU layer [32] and adopted fully-connected lay-
ers instead of channel-wise fully-connected layers. The ELU unit makes
the regression network training more stable than the ReLU leaky layers
as it can handle large negative responses during the training process. . . 15
2.4 Comparison with Context Encoder (`
2
loss), Context Encoder (`
2
loss +
adversarial loss) and Content-Aware Fill. We can see that our approach
fixes the wrong textures generated by Content-Aware Fill, and is also
more clear than the output of Context Encoder. . . . . . . . . . . . . . 20
2.5 Visual comparisons of ImageNet result. From top to bottom: input
image, Content-Aware Fill, Context Encoder (`
2
and adversarial loss),
our result. All images are scaled from 512 512 to fit the page size. . . 21
2.6 The effect of different texture weight. . . . . . . . . . . . . . . . . . 22
2.7 (a) Output of content network trained with `
2
loss (b) The final result
using (a). (c) Output of content network trained with`
2
and adversarial
loss. (d) The final result using (c). . . . . . . . . . . . . . . . . . . . . 22
2.8 Evaluation of different components. (a) input image. (b) result without
using content constraint. (c) our result. . . . . . . . . . . . . . . . . . 23
2.9 Failure cases of our method. . . . . . . . . . . . . . . . . . . . . . . . 23
viii
2.10 Visual comparisons of Paris Streetview result. From top to bottom:
input image, Content-Aware Fill, Context Encoder (`
2
and adversarial
loss) and our result. All images are scaled from 512512 to fit the page
size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11 Arbitrary object removal. From left to right: input image, object mask,
Content-Aware Fill result, our result. . . . . . . . . . . . . . . . . . . . 25
3.1 An example of inpainting (top) and harmonzation (bottom) comparing
with state-of-the-art methods. Zoom in for best viewing quality. . . . . 27
3.2 Generator head and the training losses. We only illustrate one scale of
Patch Adversarial Loss and Patch Perceptual Loss. Note that to compute
the Patch Adversarial Loss, we need to use the mask to find out which
patch overlaps with the hole. . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Illustration of block-wise procedural training. The yellow residual blocks
refer to the generator head which is trained first. The green residual
blocks are progressively added one at a time. We also draw the skip
connections between the already trained residual blocks and the up-
sampling back end. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Fixed hole result comparisons. GH are results generated by generator
head, and Final are results generated with block procedural training. All
images have original size 256x256. Zoom in for best viewing quality. . 40
3.5 Random hole completion comparison. GH are results generated by gen-
erator head, and Final are results generated with block procedural train-
ing. All images have original size 256x256. Zoom in for best viewing
quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Face completion results comparing with [117]. . . . . . . . . . . . . . 42
3.7 Visual comparison of a result using different types of convolutional lay-
ers. (a) Input; (b) Original Conv; (c) Interpolated Conv; (d) Dilated+Interpolated
Conv; (e) Dilated Conv (ours). . . . . . . . . . . . . . . . . . . . . . . 43
3.8 Visual comparison of results by directly training a network of 12 blocks
((b),(e)) and procedural training ((c), (f)). . . . . . . . . . . . . . . . . 44
3.9 Visual comparison of results by training training without and with ALA. 44
3.10 Examples of image harmonization results. For (a) and (d), the microwave
and the zebra on the back have unusual color. Our method correctly
adjusts their appearance and makes the images coherent and realistic. . 45
3.11 Examples of interactive guided inpainting result. The segmentation
mask is given by our foreground/background segmentation network trained
on COCO. The final result combines the outputs of harmonization and
inpainting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
ix
4.1 Horse!zebra image translation. Our model learns to predict an atten-
tion map (b) and translates the horse to zebra while keeping the back-
ground untouched (c). By comparison, CycleGAN [225] significantly
alters the appearance of the background together with the horse (d). . . 47
4.2 Model overview. Our generatorG consists of a vanilla generatorG
0
and
an attention branchG
attn
. We train the model using self-regularization
perceptual loss and adversarial loss. . . . . . . . . . . . . . . . . . . . 54
4.3 Image translation results of horse to zebra [85] and comparison with
UNIT and CycleGAN. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Image translation results on more datasets. From top to bottom: apple
to orange [85], dog to cat [147], photo to DSLR [85], yosemite summer
to winter [85]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 More image translation results. From left to right: edges to shoes [85];
edges to handbags [85]; SYNTHIA to cityscape [161, 33]. Given the
source and target domains are globally different, the initial translation
and final result are similar with the attention maps focusing on the entire
images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Comparing our results w/o attention with baselines. From top to bottom:
dawn to night (SYNTHIA [161]), non-smile to smile (CelebA [124])
and photos to Van Gogh [85]. . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Failure case of the attention map: it did not detect the ROI correctly and
removed the zebra stripes. . . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 Effects of using different layers as feature extractors. From left to right:
input (a), using the first two layers of VGG (b), using the last two layers
of VGG (c) and using the first three layers of VGG (d). . . . . . . . . . 64
4.9 Unsupervised map prediction visualization. . . . . . . . . . . . . . . . 65
4.10 Visualization of image translation from MNIST (a),(d) to USPS (b),(e)
and MNIST-M (c),(f). . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.11 Visualization of rendered face to real face translation. (a)(d): input ren-
dered faces; (b)(e): CycleGAN results; (c)(f): Our results. . . . . . . . . 66
5.1 Visual results of face (with head-mounted displays) and body retarget-
ing. The leftmost column is the input, and the right four columns are the
output after changing the identity labels. After learning the disentangled
representations of the expressions and poses from unannotated data, we
can switch the identities and keep the expressions or poses the same. . . 70
5.2 The baseline CV AE model. The decoder is conditioned on identity c,
which is encoded as 1-hot vector. . . . . . . . . . . . . . . . . . . . . . 75
5.3 Two variants of our models. Left: jointly train two CV AEs for images
and keypoints and enforce latent-consistency constraint. Right: train a
single CV AE but generate the keypoints alongside the image. . . . . . . 76
x
5.4 Top: results of image-based CV AE baseline. Middle: results of keypoints-
based CV AE. Bottom: results of jointly train image and keypoints with
latent-consistency constraint. We can see for this example the image-
based CV AE fails to encode the facial expression into the latent rep-
resentation, while the keypoints-based CV AE successfully models and
transfers the expressions. By jointly training the two CV AEs we can
encourage the latent code of images to preserve the identity-invariant
semantics as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Visual results of HMD image translation between two identities. The
top row are the source id with different expressions (col 1-5) and the
target id (col 6). Note how CycleGAN sometimes generates completely
wrong expressions (e.g. row 4, col 3). . . . . . . . . . . . . . . . . . . 84
5.6 Visual results of Panoptic image translation. The top row are the source
id with different poses (col 1-5) and the target id (col 6). . . . . . . . . 85
5.7 Interpolation between expressions of two different ids. We mark the
source images with colored borders. The rest are interpolation results. . 86
5.8 Generalizing to new identities. (a) is the target person not seen during
training. (b) and (d) are source expressions. (c) and (e) are the retargeted
faces where the decoder takes the regressed id of (a) as the conditioning
label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.9 t-SNE visualization of the embeddings of five identities and four expres-
sions (shown above). t-SNE top: ours. t-SNE bottom: CV AE baseline.
We color the points by both identities (left) and expressions (right). . . . 89
5.10 Driving the 3D avatar of another person in VR while wearing the head-
set. From left to right: mouth, left eye, right eye, and the rendered 3D
avatar of target identity (different from the source id wearing the head-
set). Note the left and right are mirrored between the image and the
avatar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1 Examples of facial expression transfer between two images. Our method
could take any two face images as input and combine the appearance (a)
and expression (b) to synthesize realistic-looking reenactment result (c). 91
6.2 Our system pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 StyleGAN architecture. Figure courtesy of [93]. . . . . . . . . . . . . . 98
6.4 The loss curve as we iteratively optimize 6.1. . . . . . . . . . . . . . . 99
6.5 Visualization ofg(s) as we iteratively optimize 6.1. . . . . . . . . . . . 100
6.6 Post-processing using warping and blending. (a) generatedI
0
=g(s
0
);
(b) warpedI
0
; (c) facial mask; (d) originalI
1
; (e) final composite. . . . 101
6.7 Expression transfer of videos. (a) target identity; (b) top: source expres-
sions; bottom: transferred results. . . . . . . . . . . . . . . . . . . . . . 103
xi
6.8 paGAN [140] comparison. (a) Source identity; (b) Source expressions
with shadow (above) and occlusion (below); (c) Texture reconstructions
with paGAN; (d) Our expression transfer results. . . . . . . . . . . . . 104
6.9 Comparison with cGAN. (a) target identity; (b) source expression; (c)
their result; (d) our results. Note their mouth interiors are directly copied
from source expression. . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.10 Comparison with DeepFake. (a) source identity; (b) source expression;
(c) DeepFake result; (d) our result. . . . . . . . . . . . . . . . . . . . . 105
6.11 Comparison with DVP and Face2Face. (a) source expression; (b) Face2Face;
(c) DVP; (d) ours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.12 Comparison with [7]. (a) target identity; (b) source expression; (c) their
result; (d) ours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.13 Comparison with [104]. (a) source expression; (b) [104] face swap
result using cage-net or swift-net; (c) random image of Nicolas Cage or
Taylor Swift; (d) ours transfer result. . . . . . . . . . . . . . . . . . . . 108
6.14 Top row: two imagesI
1
andI
2
as input. Bottom rows: replacei layers
starting fromj
th
layer ofs
1
(inferred fromI
1
) with those ofs
2
(inferred
fromI
2
) and regenerate the image.j =-1 (last column) indicates replac-
ing thei layers at the end. . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.15 Reconstruction result when using different layer L of the VGG network
as perceptual loss during style vector inference. . . . . . . . . . . . . . 110
6.16 Examples of failure cases. Each set consists of target identity (left),
source expression (middle) and result (right). . . . . . . . . . . . . . . 110
6.17 Examples of visual results. Each set consists of identity (left), expres-
sion (middle) and result (right). Test images from top to bottom: CelebA-
HQ images, FFHQ images, random web images. Note the StyleGAN
model is only trained with FFHQ dataset. . . . . . . . . . . . . . . . . 111
xii
Abstract
Deep learning has been an extremely powerful tool in many computer vision tasks. For
example, image classification, segmentation, detection, and other problems have all ben-
efited greatly from the invention of powerful deep learning models and advanced train-
ing schemes. A particular sub-category of computer vision that has been revolutionized
by deep learning is image translation, where the goal is to generate a new image given
an input source image. Deep generative models and their variants have been highly suc-
cessful to solve image translation given its ability to model complex data distributions.
Based on whether we have access to paired training data, image translation can be super-
vised or unsupervised. In this dissertation, several image translation related problems
are examined: 1) High-resolution image inpainting; 2) Unsupervised image translation
and 3) Cross-identity facial expression and body pose transfer.
Learning supervised image translation can be directly applied to training a genera-
tive model for applications like image inpanting. However, it is difficult to model the
precise distributions of natural image pixels and therefore the results may be prone to
artifacts or noise. We propose a two-stage approach for image inpainting. First, we
design novel network architectures and training losses to improve the quality of the
results when directly training an inpainting network. By using progressive training and
multi-scale perceptual losses we are able to generate images that are of higher percep-
tual and quantitative quality. In the second stage, we propose an optimization-based
xiii
approach that is inspired by Patch Match, where we try to optimize the neural patches
inside the hole such that their distance from the neural patches outside the hole is mini-
mized. Experiments show that we are able to scale to high-resolution inpainting with the
optimization-based approach, surpassing the state-of-the-art results by a large margin.
The second research problem is the unsupervised image translation between differ-
ent domains. The problem is motivated by the need for modeling the distribution depen-
dencies between two related domains such as horse and zebra or MNIST and USPS.
To reduce ambiguity, existing methods have been focusing on using cycle-consistency
or shared latent space for unsupervised image translation. We propose a much simpler
image translation network which translates images from one domain to another domain
that leverages the self-regularization loss and adversarial loss. We further combine the
translation module with attention module to learn the attention mask and focus the trans-
lation on the specific region. Extensive experiments have shown that our approach out-
performs existing methods when evaluated visually or quantitatively.
Finally, we tackle the problem of translating faces and bodies between different
identities without paired training data: we cannot directly train a translation module
using supervised signals in this case. Instead, we propose to train a conditional varia-
tional auto-encoder (CV AE) to disentangle different latent factors such as identity and
expressions. In order to achieve effective disentanglement, we further use multi-view
information such as keypoints and facial landmarks to train multiple CV AEs. By relying
on these simplified representations of the data we are using a more easily disentangled
representation to guide the disentanglement of image itself. Experiments demonstrate
the effectiveness of our method in multiple face and body datasets.
To address the issue of scaling to new identities and also generate better-quality
results, we further propose an alternative approach that uses self-supervised learning
based on StyleGAN to factorize out different attributes of face images, such as hair
xiv
color, facial expressions, skin color, and others. Using pre-trained StyleGAN combined
with iterative style inference we can easily manipulate the facial expressions or combine
the facial expressions of any two people, without the need of training a specific new
model for each of the identity involved. This is one of the first scalable and high-quality
approach for generating DeepFake data, which serves as a critical first step to learn a
robust classifier against adversarial examples.
xv
Chapter 1
Introduction
1.1 Significance of the Research
Deep learning has been highly successful in many computer vision applications, even
outperforming humans in a variety of tasks, such as classification on ImageNet [74],
playing Go [169] or mastering Texas Hold’em poker [139]. All these tasks fall into the
category of discriminative models where we rely on labeled knowledge or reinforcement
learning to optimize the decision boundaries. Another category of the tasks is to create
or generate data by learning from the known distributions. It is known as a famous
saying “whatever I cannot create, I do not understand”. Therefore, in order to measure
the true intelligence of computers, part of the criteria would be how well they can create
data similar to real-world examples. In the area of computer vision, generating photo-
realistic images have been a hot yet challenging research topic that draws much attention
from the community [19, 146, 93].
Natural images obey intrinsic invariance that has historically been difficult to quan-
tify. Traditionally, it involves extremely complicated processes to achieve photo-realistic
image rendering using standard graphics techniques, as we need to take geometry, mate-
rials, light transport, and many other factors into account [77, 57]. On the other hand,
rendering photo-realistic images using a model learned from data turns the process of
graphics rendering into a learning and inference problem. The models that learn data
distribution and draw samples are called generative models, which aim to discover the
essence of data and find the best distribution to represent it. Deep generative models
1
leverage the representation power of deep neural networks and have been able to more
accurately capture the statistics of training data and generate better synthetic images
than ever before.
Two basic models of deep generative models have been most successful in model-
ing natural image distributions and synthesizing new images. The first one is based on
autoencoders and variational inference. Autoencoders are designed to generate data by
extracting internal regularities from training set [198]. Variational Autoencoders [100]
(V AE) extend this framework and assumes that the code layer, learned as the low-
dimensional representation of data, comes from a Gaussian distribution of mean
and variance
2
. Variational autoencoders maximize variational lower bound on the
log-likelihood of the training data, which can be easily trained but introduce restric-
tive assumptions about the approximate posterior distribution. In addition, given the`
2
reconstruction loss used, V AE tends to generate blurry and low-quality images. Gener-
ative Adversarial Network (GAN) [66] is another type of deep generative models that
are based on a game theory scenario called the minimax game where a discriminatorD
and a generatorG compete against each other. The scenario is modeled as a zero-sum
game where the carefully calculated rewards from both networks are proven to be equal
to the JS divergence of the synthetic data and real data. Recent work has shown that
GANs can create convincing image samples on datasets even on high resolutions and
high variety of class labels [19, 146, 93]. However, unlike V AE, GAN does not model
the posterior likelihood directly but instead only provides a proxy to sample from it. It
is also more challenging to train as the model suffers from the instability of training and
mode collapses [164].
This dissertation focuses on conditional image generation using deep generative
models: rather than generating images from scratch, we generate images by condition-
ing on another image as input. This covers a variety of important applications such as
2
style transfer [87], image inpainting [149], face manipulation/expression transfer [143]
and others. This problem, also known as image translation, have different settings based
on whether the task is supervised or whether example image pairs are available. If we
have access to paired images as training data, we could leverage conditional models
such as conditional GAN [86] to enable end-to-end training and synthesis. However,
obtaining paired training data can be difficult and expensive. In the unsupervised setting
where paired data are not present, several techniques have been proposed to amortize the
ambiguity of the problem, by assuming cycle-consistency or shared latent space based
on GAN [225] or V AE [120]. However, it is yet difficult to scale to a large number
of dataset or create high-quality translation results. General image translation prob-
lem has largely remained unsolved, and we demonstrate several novel perspective and
techniques that combine domain knowledge with deep generative models to tackle the
problem.
1.2 Contributions of the Research
In this dissertation, we address four image translation problems and show the advan-
tage to combine specific domain knowledge with deep generative models to achieve
high-quality results. First, in the supervised image translation setting, we design a
two-stage system that consists of a context network and a texture network to generate
high-resolution image inpainting contents. Second, in the dual-domain unsupervised
image translation setting, we propose an extremely simple network that learns the map-
ping between two arbitrary, unpaired dataset, which is further combined with an atten-
tion module to jointly learn the attention mask and translation function. Third, in the
multi-domain unsupervised image translation setting, we propose a multi-view CV AE
3
model that leverage multiple representations of human faces to guide the disentangle-
ment of appearance and expression to enable multi-identity expression transfer. Finally,
in the self-supervised setting, we show how we can disentangle different attributes of a
face image using a pre-trained model which enables photo-realistic, unconstrained facial
expression transfer between two arbitrary images.
1.2.1 High-Resolution Image Inpainting
We study the problem of completing a high-resolution image with photo-realistic and
plausible contents. We notice that directly training a neural network end-to-end is dif-
ficult to generate high-resolution contents even with adversarial training. On the other
hand, traditional patch-based methods are capable of filling large holes of an image with
visually plausible inpaintings but are prone to errors in terms of semantics or global
structures. We propose to combine the strength of the learning-based method and the
patch-based method to account for both the correctness of the semantics and the quality
of textures:
We propose a two-stage approach for image inpainting that combines content pre-
diction and texture optimization.
The first stage is a content network that uses a trainable network to predict rough
contents inside the hole. The second stage is multi-scale texture optimization,
which uses a pre-trained VGG network to match the neural patches inside the
hole to their nearest neighbors outside the hole.
We show that features extracted from middle layers of the neural network are
useful to synthesize realistic image contents and textures.
4
1.2.2 Unsupervised Image Translation
We examine the problem of unsupervised image translation, where we have a collection
of image data from two separate domains, and we aim to translate the images from one
domain to another. These two domains can be horse and zebra, MNIST and USPS,
synthetic and real data as well as a variety of other loosely related concepts. We notice
that with unpaired training data, existing methods take advantage of cycle-consistency or
shared latent space assumptions. We notice that by regularizing the output to be similar
to the input locally but resembles the target domain globally, we could train a one-way
simple generator to learn the mapping from the source domain to the target domain:
We propose an extremely simple framework that learns the mapping from the
source domain to the target domain in an unsupervised fashion. The framework
consists of a simple one-way generator trained with adversarial loss and self-
regularization loss.
We further propose to combine an attention module with the generator so that we
can jointly learn the attention mask prediction as well as image translation, which
significantly improves the final results.
Finally, we show how our image translation framework could be applied
to domain adaptation and improve state-of-the-art unsupervised classification
results.
1.2.3 Multi-view Learning for Human Retargeting
We address the problem of unsupervised disentanglement and human retargeting
(expression transfer, pose retargeting, etc.) across multiple domains. We notice that
existing approaches can rarely handle large collections of unregistered domains. Cycle-
GAN or UNIT, for example, can only handle two domains. StarGAN is an extension
5
of CycleGAN for multiple domains but is unable to generate high-quality results in the
case that the number of domains becomes large (in our case, we have 100+ identities
where each identity is taken as a domain). We propose a multi-view learning framework
based on conditional variational autoencoder (CV AE) which learns latent factor disen-
tanglement and image translation simultaneously. The process can be summarized as
the following steps:
We propose to train an end-to-end system based on conditional variational autoen-
coder (CV AE) to learn disentangled representation of human faces/bodies. The
CV AE model is conditioned on identity labeling, and the latent code learned aims
to encode domain-invariant expression/pose information.
Observing the trade-off difficulty between disentanglement and reconstruction
quality, we propose to train a multi-view system that trains multiple CV AEs at
the same time and takes both image and keypoints as input. We encourage their
latent code to be consistent with each other so that the latent code of the image is
“supervised” by that of the keypoints.
As the last step, we show that we not only achieve state-of-the-art image trans-
lation results but also learn a better identity-invariant representation that can be
directly applied to other downstream tasks.
1.2.4 Unconstrained Facial Expression Transfer
As the last topic, we propose to learn unconstrained facial expression transfer using
the style-based generator. We observe that multi-view learning for human retarget-
ing are still unable to generate high-quality images, nor can they easily scale to new
identities. We propose a new framework that uses a pre-trained StyleGAN generator
to achieve high-quality, unconstrained facial expression transfer between two images.
6
We notice that the StyleGAN generator learns hierarchical disentanglement of facial
attributes from fine local details to global appearances, and can be directly applied to
combine the appearance and expression of two images:
Given two arbitrary images as the identity and expression respectively, we first
detect, crop and normalize the face regions from the images.
We then propose an iterative optimization approach to infer the style vector given
the normalized face and a pre-trained StyleGAN generator.
Finally, we use an integer linear programming method to combine two style vec-
tors, which is then given as input to the StyleGAN generator to generate a new
face that combines the expression of one image and the identity of the other.
1.3 Organization of the Dissertation
The rest of the dissertation is organized as follows. In Chapter 2, we introduce our
work in supervised-image translation where we develop a framework for high-resolution
image inpainting. In Chapter 3, we describe an alternative feed-forward framework for
image inpainting which achieves comparable quality but only runs in a fraction of time.
In Chapter 4, we describe our research in unsupervised image translation between two
domains using a simple one-way translation module combined with another attention
module. In Chapter 5, we introduce our work in multi-domain face/body translation
where multi-view learning is applied to take advantage of multiple sources of data repre-
sentations. In Chapter 6, we present our work in self-supervised image translation where
an image is first disentangled into hierarchical style attributes, and then two styles are
combined in a way such that the final output inherits the appearance of one face and the
7
expression of another. Finally, we will summarize the work and discuss future research
directions in Chapter 7.
8
Chapter 2
High-Resolution Image Inpainting
using Multi-Scale Neural Patch
Synthesis
2.1 Introduction
(a) Input Image (b) Context Encoder (c) PatchMatch (d) Our Result
Figure 2.1: Qualitative illustration of the task. Given an image (512 512) with a
missing hole (256 256) (a), our algorithm can synthesize sharper and more coherent
hole content (d) comparing with Context Encoder [148] (b) and Content-Aware Fill
using PatchMatch [1] (c).
Before sharing a photo, users may want to make modifications such as erasing dis-
tracting scene elements, adjusting object positions in an image for better composition,
or recovering the image content in occluded image areas. These, and many other edit-
ing operations, require automated hole-filling (image completion), which has been an
active research topic in the computer vision and graphics communities for the past few
9
decades. Due to its inherent ambiguity and the complexity of natural images, general
hole-filling remains challenging.
Existing methods that address the hole-filling problem fall into two groups. The
first group of approaches relies on texture synthesis techniques, which fills in the hole
by extending textures from surrounding regions [51, 50, 109, 108, 34, 47, 202, 204,
101, 103, 9]. A common idea in these techniques is to synthesize the content of the hole
region in a coarse to fine manner, using patches of similar textures. In [47, 204], multiple
scales and orientations are introduced to find better matching patches. Barnes et al. [9]
proposed PatchMatch as a fast approximate nearest neighbor patch search algorithm.
Although such methods are good at propagating high-frequency texture details, they
do not capture the semantics or global structure of the image. The second group of
approaches hallucinates missing image regions in a data-driven fashion, leveraging large
external databases. These approaches assume that regions surrounded by similar context
likely possess similar content [71]. This approach is very effective when it finds an
example image with sufficient visual similarity to the query but could fail when the
query image is not well represented in the database. Additionally, such methods require
access to the external database, which greatly restricts possible application scenarios.
More recently, deep neural network is introduced for texture synthesis and image
stylization [60, 62, 114, 25, 196, 88]. In particular, Phatak et al. [148] trained an
encoder-decoder CNN (Context Encoder) with combined `
2
and adversarial loss [67]
to directly predict missing image regions. This work is able to predict plausible image
structures, and is very fast to evaluate, as the hole region is predicted in a single forward
pass. Although the results are encouraging, the inpainting results of this method some-
times lack fine texture details, which creates visible artifacts around the border of the
hole. This method is also unable to handle high-resolution images due to the difficulty
of training regarding adversarial loss when the input is large.
10
In a recent work, Li and Wand [114] showed that impressive image stylization results
can be achieved by optimizing for an image whose neural response at mid-layer is sim-
ilar to that of a content image, and whose local responses at a low convolutional layers
resemble local responses from a style image. Those local responses were represented
by small (typically 3 3) neural patches. This method proves able to transfer high-
frequency details from the style image to the content image, hence suitable for realistic
transfer tasks (e.g., transfer of the look of faces or cars). Nevertheless, transferring of
more artistic styles are better addressed by using gram matrices of neural responses [60].
To overcome the limitations of aforementioned methods, we propose a hybrid opti-
mization approach that leverages the structured prediction power of encoder-decoder
CNN and the power of neural patches to synthesize realistic, high-frequency details.
Similar to the style transfer task, our approach treats the encoder-decoder prediction as
the global content constraint, and the local neural patch similarity between the hole and
the known region as the texture constraint.
More specifically, the content constraint can be constructed by training a global con-
tent prediction network similar to Context Encoder, and the texture constraint can be
modeled with the image content surrounding the hole, using the patch response of the
intermediate layers using the pre-trained classification network. The two constraints
can be optimized using backpropagation with limited-memory BFGS. In order to fur-
ther handle high-resolution images with large holes, we propose a multi-scale neural
patch synthesis approach. For simplicity of formulation, we assume the test image is
always cropped to 512512 with a 256256 hole in the center. We then create a three-
level pyramid with step-size two, downsizing the image by half at each level. It renders
the lowest resolution of a 128128 image with a 6464 hole. We then perform the hole
filling task in a coarse-to-fine manner. Initialized with the output of content prediction
network at the lowest level, at each scale (1) we perform the joint optimization to update
11
the hole, (2) upsample to initialize the joint optimization and set content constraint for
the next scale. We then repeat this until the joint optimization is finished at the highest
resolution (Sec. 2.3).
We show experimentally that the proposed multi-scale neural patch synthesis
approach can generate more realistic and coherent results preserving both the structure
and texture details. We evaluate the proposed method quantitatively and qualitatively on
two public datasets and demonstrate its effectiveness over various baselines and existing
techniques as shown in Fig. 2.1 (Sec. 2.4).
The main contributions of this paper are summarized as follows:
We propose a joint optimization framework that can hallucinates missing image
regions by modeling a global content constraint and local texture constraint with
convolutional neural networks.
We further introduce a multi-scale neural patch synthesis algorithm for high-
resolution image inpainting based on the joint optimization framework.
We show that features extracted from middle layers of the neural network could
be used to synthesize realistic image contents and textures, in addition to previous
works that use them to transfer artistic styles.
2.2 Related Work
Structure Prediction using Deep Networks Over the recent years, convolutional neural
networks have significantly advanced the image classification performance, as presented
in [105, 179, 180, 76]. Meanwhile, researchers use deep neural networks for structure
prediction [127, 26, 129, 37, 185, 67, 68, 84, 42, 145], semantic segmentation [127, 26,
129], and image generation [67, 68, 37, 145]. We are motivated by the generative power
12
Local Texture Loss
Holistic Content Loss
P l
Pnn(l)
Content Network
(Trained for Content Prediction)
Texture Network
(Trained for Classification)
x
x
0
f(x
0
)
g
(x)
E
t
( g
(x),R)
h(x,R)
Ec(h(x,R),f(x0))
Figure 2.2: Framework Overview. Our method solves for an unknown imagex using
two loss functions, the holistic content loss (E
c
) and the local texture loss (E
t
). At the
smallest scale, the holistic content loss is conditioned on the output of the pre-trained
content network given the inputx
0
(f(x
0
)). The local texture loss is derived by feedingx
into a pre-trained network (the texture network) and comparing the local neural patches
betweenR (the hole) and the boundary.
of deep neural network and use it as the backbone of our hole-filling approach. Unlike
the image generation tasks discussed in [45, 67, 68, 37], where the input is a random
noise vector and the output is an image, our goal is to predict the content in the hole,
conditioned on the known image regions. Recently, [148] proposed an encoder-decoder
network for image inpainting, using the combination of the`
2
loss and the adversarial
loss (Context Encoder). In our work, we adapt Context Encoder as the global content
prediction network and use the output to initialize our multi-scale neural patch synthesis
algorithm at the smallest scale.
Style Transfer In order to create realistic image textures, our work is motivated by the
recent success of neural style transfer [60, 62, 114, 25, 196, 88]. These approaches are
largely used to generate images combining the “style” of one image and the “content”
of another image. Our technique is motivated by the astounding performance of neural
style transfer. In particular, we show neural features are also extremely powerful to
create fine textures and high-frequency details of natural images.
13
2.3 The Approach
2.3.1 Framework Overview
We seek an inpainted image ~ x that optimizes over the loss function, which is formulated
as a combination of three terms: the holistic content term, the local texture term, and the
tv-loss term. The content term is a global structure constraint that captures the semantics
and the global structure of the image, and the texture term models the local texture statis-
tics of the input image. We first train the content network and use it to initialize the con-
tent term. The texture term is computed using the VGG-19 network [170](Figure 2.2)
pre-trained on ImageNet.
To model the content constraint, we first train the holistic content networkf. The
input is an image with the central squared region removed and filled with the mean color,
and the ground truth imagex
t
is the original content in the center. We trained on two
datasets, as discussed in Section 2.4. Once the content network is trained, we can use
the output of the networkf(x
0
) as the initial content constraint for joint optimization.
The goal of the texture term is to ensure that the fine details in the missing hole
are similar to the details outside of the hole. We define such similarity with neural
patches, which have been successfully used in the past to capture image styles. In order
to optimize the texture term, we feed the image x into the pre-trained VGG network
(we refer to this network as local texture network in this paper) and enforce that the
response of the small (typically 3 3) neural patches inside the hole region are similar
to neural patches outside the hole at pre-determined feature layers of the network. In
practice we use the combination of relu3 1 and relu4 1 layers to compute the neural
features. We iteratively updatex by minimizing the joint content and texture loss using
limited-memory BFGS.
14
3"
128"
128"
4000"
128"
16"
16"
256"
8"
8"
512"
4"
64"
32"
32" 4"
512"
4"
4"
64"
64"
64"
256"
8"
8"
128"
16"
64"
32"
32"
3"
64"
64"
16"
x
0
f(x
0
) Encoder Decoder
Figure 2.3: The network architecture for structured content prediction. Unlike the`
2
loss
architecture presented in [148], we replaced all ReLU/ReLU leaky layers with the ELU
layer [32] and adopted fully-connected layers instead of channel-wise fully-connected
layers. The ELU unit makes the regression network training more stable than the ReLU
leaky layers as it can handle large negative responses during the training process.
The proposed framework naturally applies to the high-resolution image inpainting
problem using multiscale scheme. Given a high-resolution image with a large hole,
we first downsize the image and obtain a reference content using the prediction of the
content network. Given the reference content we optimize w.r.t. the content and texture
constraints at the low resolution. The optimization result is then upsampled and used as
the initialization for joint optimization at the fine scales. In practice, we set the number
of scales to be 3 for images of size 512 512.
We describe the details of the three loss terms in the following.
2.3.2 The Joint Loss Function
Given the input imagex
0
we would like to find the unknown output imagex. We use
R to denote a hole region inx, andR
to denote the corresponding region in a feature
map(x) of the VGG-19 network. h() defines the operation of extracting a sub-image
or sub-feature-map in a rectangular region, i.e. h(x;R) returns the color content of
x within R, and h((x);R
) returns the content of (x) within R
, respectively. We
denote the content network asf and the texture network ast.
15
At each scalei = 1; 2:::;N (N is the number of scales), the optimal reconstruction
(hole filling) result ~ x is obtained by solving the following minimization problem:
~ x
i+1
= arg min
x
E
c
(h(x;R);h(x
i
;R))
+E
t
(
t
(x);R
) +(x) (2.1)
where h(x
1
;R) = f(x
0
),
t
() represents a feature map (or a combination of feature
maps) at an intermediate layer in the texture networkt, and is a weight reflecting the
importance between the two terms. Empirically setting and to be 5e
6
balances the
magnitude of each loss and gives best results in our experiment.
The first termE
c
in Equation 2.1 which models the holistic content constraint is
defined to penalize the `
2
difference between the optimization result and the previous
content prediction (from the content network or the result of optimization at the coarser
scale):
E
c
(h(x;R);h(x
i
;R)) =kh(x;R)h(x
i
;R)k
2
2
(2.2)
The second term E
t
in Equation 2.1 models the local texture constraint, which
penalizes the discrepancy of the texture appearance inside and outside the hole. We first
choose a certain feature layer (or a combination of feature layers) in the networkt, and
extract its feature map
t
. For each local query patchP of sizessc in the holeR
,
we find its most similar patch outside the hole, and compute the loss by averaging the
distances of the query patch and its nearest neighbor.
E
t
(
t
(x);R) =
1
jR
j
X
i2R
kh(
t
(x);P
i
)h(
t
(x);P
nn(i)
)k
2
2
(2.3)
16
wherejR
j is the number of patches sampled in the region R
, P
i
is the local neural
patch centered at locationi, andnn(i) is the computed as
nn(i) = arg min
j2N(i)^j= 2R
kh(
t
(x);P
i
)h(
t
(x);P
j
)k
2
2
(2.4)
whereN (i) is the set of neighboring locations ofi excluding the overlap withR
. The
nearest neighbor can be fast computed as a convolutional layer, as shown in [114].
We also add the TV loss term to encourage smoothness:
(x) =
X
i;j
((x
i;j+1
x
i;j
)
2
+ (x
i+1;j
x
i;j
)
2
) (2.5)
2.3.3 The Content Network
A straightforward way to learn the initial content prediction network is to train a regres-
sion networkf to use the responsef(x) of an input imagex (with the unknown region)
to approximate the ground truthx
g
at the regionR. Recent studies have used various
loss functions for image restoration tasks, for instance,`
2
loss, SSIM loss [220, 43, 159],
`
1
loss [220], perceptual loss [88], and adversarial loss [148]. We experimented with`
2
loss and adversarial loss. For each training image, the`
2
loss is defined as:
L
l2
(x;x
g
;R) =kf(x)h(x
g
;R)k
2
2
(2.6)
The adversarial loss is defined as:
L
adv
(x;x
g
;R) = max
D
E
x2X
[log(D(h(x
g
;R)))
+ log(1D(f(x)))] (2.7)
where D is the adversarial discriminator.
17
We use the joint `
2
loss and the adversarial loss the same way as the Context
Encoder [148]:
L =L
l2
(x;x
g
;R) + (1)L
adv
(x;x
g
;R) (2.8)
where is 0:999 in our implementation.
2.3.4 The Texture Network
We use the VGG-19[170] network pre-trained for ImageNet classification as the texture
network, and use the relu3 1 layer and the relu4 1 layer to calculate the texture term.
We found using a combination of relu3 1 and relu4 1 leads to more accurate results than
using a single layer. As an alternative, we tried to use the content network discussed in
the previous section as the texture network, but found the results are of lower quality
than using the pre-trained VGG-19. This can be explained by the fact that the VGG-19
network was trained for semantic classification, so features of its intermediate layers
manifest strong invariance w.r.t. texture distortions. This helps infer more accurate
reconstruction of the hole content.
2.4 Experiments
This section evaluates our proposed approach visually and quantitatively. We first intro-
duce the datasets and then compare our approach with other methods, demonstrating its
effectiveness in high-resolution image inpainting. At the end of this section we show a
real world application where we remove distractors from photos.
Datasets We evaluate the proposed approach on two different datasets: Paris
StreetView [40] and ImageNet [162]. Labels or other information associated with these
images are not used. The Paris StreetView contains 14,900 training images and 100 test
18
images. ImageNet has 1,260,000 training images, and 200 test images that are randomly
picked from the validation set. We also picked 20 images with distractors to test out our
algorithm for distractor removal.
Experimental Settings We first compare our method with several baseline methods
in the low-resolution setting (128 128). First, we compared with results of Context
Encoder trained with`
2
loss. Second, we compare our method with the best results that
Context Encoder have achieved using adversarial loss, which is the state-of-the-art in
the area of image inpainting using deep learning. Finally, we compare with the results
of Content-Aware Fill using PatchMatch algorithm from Adobe Photoshop. Our com-
parisons demonstrate the effectiveness of the proposed joint optimization framework.
While comparisons with baselines show the effectiveness of the overall joint opti-
mization algorithm and the role of the texture network in joint optimization, we further
analyze the separate role of the content network and the texture network by changing
their weights in the joint optimization.
Finally, we show our results on high-resolution image inpainting and compare with
Content-Aware Fill and Context Encoder (`
2
and adversarial loss). Note that for Context
Encoder the high-resolution results are acquired by directly upsampling from the low-
resolution outputs. Our approach shows significant improvement in terms of the visual
quality.
Quantitative Comparisons We first compare our method quantitatively with the base-
line methods on low-resolution images (128 128) on the Paris StreetView dataset.
Results in Table 2.1 show that our method achieves highest numerical performance. We
attribute this to the nature of our method – it can infer the correct structure of the image
where Content-Aware Fill fails, and can also synthesize better image details comparing
with the results of Context Encoder (Fig. 2.4). In addition, we argue that the quantitative
evaluation may not be most effective measure of the inpainting task given that the goal
19
is to generate realistic-looking content, rather than exact same content that was in the
original image.
Method Mean L1 Loss Mean L2 Loss PSNR
Context Encoder`
2
loss 10.47% 2.41% 17.34 dB
Content-Aware Fill 12.59% 3.14% 16.82 dB
Context Encoder (`
2
+ adversarial loss) 10.33% 2.35% 17.59 dB
Our Method 10.01% 2.21% 18.00 dB
Table 2.1: Numerical comparison on Paris StreetView dataset. Higher PSNR value is
better. Note % in the Table is to facilitate reading.
Figure 2.4: Comparison with Context Encoder (`
2
loss), Context Encoder (`
2
loss +
adversarial loss) and Content-Aware Fill. We can see that our approach fixes the wrong
textures generated by Content-Aware Fill, and is also more clear than the output of
Context Encoder.
The effects of content and texture networks One ablation study we did was to drop
the content constraint term and only use the texture term in the joint optimization. As
20
Figure 2.5: Visual comparisons of ImageNet result. From top to bottom: input image,
Content-Aware Fill, Context Encoder (`
2
and adversarial loss), our result. All images
are scaled from 512 512 to fit the page size.
shown in Fig. 2.8, without using the content term to guide the optimization, the structure
of the inpainting results is completely incorrect. We also adjusted the relative weight
between the content term and the texture term. Our finding is that by using more content
constraint, the result is more consistent with the initial prediction of the content network
but may lack high frequency details. Similarly, using more texture term gives sharp
result but does not guarantee the overall image structure is correct (Fig. 2.6).
21
(a) Input image (b) = 1e6 (c) = 1e5 (d) = 4e5
Figure 2.6: The effect of different texture weight.
The effect of the adversarial loss We analyze the effect of using adversarial loss in
training the content network. One may argue without using the adversarial loss, the con-
tent network is still able to predict the structure of the image and the joint optimization
will calibrate the textures later. However we found that the quality of the initialization
given by the content network is important to the final result. When the initial prediction
is blurry (using `
2
loss only), the final result becomes more blurry as well comparing
with using the content network trained with both`
2
and adversarial loss (Fig. 2.7).
(a) (b) (c) (d)
Figure 2.7: (a) Output of content network trained with`
2
loss (b) The final result using
(a). (c) Output of content network trained with `
2
and adversarial loss. (d) The final
result using (c).
High-Resolution image inpainting We demonstrate our result of high-resolution image
(512 512) inpainting in Fig. 2.5 and Fig. 2.10 and compare with Content-Aware Fill
and Context Encoder (`
2
+ adversarial loss). Since Context Encoder only works with
22
(a) (b) (c)
Figure 2.8: Evaluation of different components. (a) input image. (b) result without
using content constraint. (c) our result.
128x128 images and when the input is larger, we directly upsample the 128 128 out-
put to 512 512 using bilinear interpolation. In most of the results, our multi-scale,
iterative approach combines the advantage of the other approaches, producing results
with coherent global structure as well as high-frequency details. As shown in figures,
a significant advantage of our approach over Content-Aware Fill is that we are able to
generate new textures as we do not propagate the existing patches directly. However,
one disadvantage is that given our current implementation, our algorithm takes roughly
1 min to fill in a 256 256 hole of a 512 512 image with a Titan X GPU, which is
significantly slower than Content-Aware Fill.
Figure 2.9: Failure cases of our method.
Real-World Distractor Removal Scenario Finally, our algorithm is easily extended to
handle arbitrary shape of holes. We first use a bounding rectangle to cover the arbitrary
23
Figure 2.10: Visual comparisons of Paris Streetview result. From top to bottom: input
image, Content-Aware Fill, Context Encoder (`
2
and adversarial loss) and our result. All
images are scaled from 512 512 to fit the page size.
hole, which is again filled with mean-pixel values. After proper cropping and padding
such that the rectangle is positioned at the center, the image is given as input to the
content network. In the joint optimization, the content constraint is initialized with the
output of the content network inside the arbitrary hole. The texture constraint is based
on the region outside the hole. Fig. 2.11 shows several examples and its comparison with
Content-Aware Fill algorithm (note that Context Encoder is unable to handle arbitrary
holes explicitly so we do not compare with it here).
24
Figure 2.11: Arbitrary object removal. From left to right: input image, object mask,
Content-Aware Fill result, our result.
2.5 Conclusion
We have advanced the state of the art in semantic inpainting using neural patch synthe-
sis. The insight is that the texture network is very powerful in generating high-frequency
details while the content network gives strong prior about the semantics and global struc-
ture. This may be potentially useful to other applications such as denoising, superres-
olution, retargeting and view/time interpolation. There are cases when our approach
introduces discontinuity and artifacts (Fig. 2.9) when the scene is complicated. In addi-
tion, the speed remains a bottleneck of our algorithm. We aim to address these issues in
future work.
25
Chapter 3
Image Inpainting using Block-wise
Procedural Training with Annealed
Adversarial Counterpart
3.1 Introduction
Image inpainting is the task to fill in the missing part of an image with visually plausible
contents. It is one of the most common operations of image editing [59] and low-level
computer visions [102, 72]. The goal of image inpainting is to create semantically
plausible contents with realistic texture details. The inpainting can be consistent with
the original image or different but coherent with the known context. Other than restoring
and fixing damaged images, inpainting can also be used to remove unwanted objects, or
in the case of guided inpainting, be used to composite with another guide image. In the
latter scenario, we often need inpainting to fill in the gaps and remove discontinuities
between target region of interest on the guide image and the source context. Image
harmonization is also required to adjust the appearance of the guide image such that it
is compatible with the source, making the final composition appear natural.
Traditional image inpainting methods mostly develop texture synthesis techniques
to address the problem of hole-filling [11, 102, 203, 10, 12, 205]. In [10], Barnes et al.
proposes the Patch-Match algorithm which efficiently searches the most similar patch to
reconstruct the missing regions. Wilczkowiak et al. [205] takes further steps and detects
26
(a) Input (b) GLI [83] (c) Ours (d) Input (e) DH [190] (f) Ours
Figure 3.1: An example of inpainting (top) and harmonzation (bottom) comparing with
state-of-the-art methods. Zoom in for best viewing quality.
desirable search regions to find better match patches. However, these methods only
exploit the low-level signal of the known contexts to hallucinate missing regions and
fall short of understanding and predicting high-level semantics. Furthermore, it is often
inadequate to capture the global structure of images by simply extending texture from
surrounding regions. Another line of work in inpainting aims to fill in holes with con-
tents from another guide image by using composition and harmonization [72, 190]. The
guide is often retrieved from a large database based on similarity with the source image
and is then combined with the source. Although these methods are able to propagate
high-frequency details from the guide image, they often lead to inconsistent regions and
gaps which are easily detectable with human eyes.
More recently, deep neural networks have shown excellent performance in various
image completion tasks, such as texture synthesis and image completion. For inpainting,
adversarial training becomes the de facto strategy to generate sharp details and natural
looking results [149, 213, 117, 212, 83]. Pathak et al. [149] first proposes to train an
encoder-decoder model for inpainting using both reconstruction loss and adversarial
loss. In [213], Yeh et al. uses a pre-trained model to find the most similar encoding of
a corrupted image and uses the found encoding to synthesize what is missing. In [212],
Yang et al. proposes a multi-scale approach and optimizes the hole contents such that
neural feature extracted from a pre-trained CNN matches with features of the surround-
ing context. The optimization scheme improves the inpainting quality and resolution at
27
the cost of computational efficiency. In [83], Iizuka et al. proposes a deep generative
model trained with global and local adversarial losses, and can achieve good inpainting
performance for mid-size images and holes. However, it requires extensive training (two
months as described in the paper), and the results often contain excessive artifacts and
noise patterns. Another limitation of [212] and [83] is that they are unable to handle
perceptual discontinuity, making it necessary to resort to post-processing (e.g. Poisson
blending).
In practice, we found˚ a that directly training a very deep generative network to syn-
thesize high-frequency details is challenging due to optimization difficulty and unstable
adversarial training. As a result, as the network becomes deeper, the inpainting quality
may worsen. To overcome this difficulty, we come up with a new training scheme that
guides and stabilizes the training of a very deep generative model, which, combining
with carefully designed training losses, significantly reduces artifacts and improves the
quality of results. The main strategy, referred to as Block-wise Procedural Training
(BPT), progressively increases the number of residual blocks and the depth of the net-
work. With BPT, we first train a cGAN-based Generator Head until converging, fol-
lowed by adding more residual blocks one at a time, which refines and improves the
results. This enables us to train a network deeper than [83] and generates more realistic-
looking details. We also observe that to reduce the noise level, it is essential to steadily
reduce the weight of adversarial loss given to the generator. We refer this training
scheme as Adversarial Loss Annealing (ALA). Finally, we also propose two new losses
specifically designed for inpainting: the Patch Perceptual Loss (PPL) and Multi-Scale
Patch Adversarial Loss (MSPAL). Experiments show that these losses work better than
traditional losses used for inpainting, such as`
2
and the more general adversarial loss.
To evaluate the proposed approach, we conduct extensive experiments on a variety
of datasets. Shown by qualitative and quantitative results, also by the user-study, our
28
approach generates high-quality inpainting results and outperforms other state-of-the-
art methods. We also show that our framework, although being designed for inpainting,
can be directly applied on general image translation tasks such as image harmonization
and composition, and easily achieve results superior to other methods (Fig. 3.1). This
enables us to train a multi-task model and use it for interactive guided inpainting, which
is a very common and useful image editing scenario. We will describe it in detail in
Sec. 3.4.4.
In summary, in this paper we present:
1. A novel and effective framework that generates high-quality results for several
image editing tasks like inpainting and harmonization. We also provide a thorough
analysis and ablation study about the components in our framework.
2. Extensive qualitative and quantitative evaluations on a variety of datasets, such as
scenes, objects, and faces. We also conduct a user-study to make fair and rigorous
comparisons with other state-of-the-art methods.
3. Results of interactive guided inpainting as a novel and useful application of our
approach.
3.2 Related Work
Deep Image Generation and Manipulation Generative Adversarial Network
(GAN) [66] uses a mini-max two-player game to alternatively train a generator and a dis-
criminator, and has shown impressive ability to generate natural-looking images of high-
quality. However, training instability of the original GAN makes it hard to scale to large
images, and many more advanced techniques have been proposed [38, 152, 221, 6, 70].
Recently, Progressive GAN [92] is proposed which can be trained to generate images of
29
unprecedented quality. Our procedural block-wise training and Progressive GAN share
the basic idea of progressively increasing the depth of the network while training. How-
ever, both the problems being studied (image generation vs conditional synthesis) and
the model architectures used are different.
Adversarial training has also been applied to many image editing tasks [95, 41, 112,
85, 225]. For image inpainting, many DNN-based approaches achieve good perfor-
mance using different network topology and training procedure [149, 212, 213, 83].
For image harmonization which aims to adjust the appearances of image composition
such that it looks more natural and plausible, recent approaches mostly use deep neural
network to leverage its expressive power and semantic knowledge [223, 190].
Non-neural Inpainting and Harmonization Traditional image completion algorithms
can be either diffusion-based [11, 53] or patch-based [12, 10]. Diffusion-based meth-
ods usually cannot synthesize plausible contents for large holes or textures, due to the
fact that it only propagates low-level features. Patch-based methods, however, largely
rely on the assumption that the desired patches exist in the database. For harmoniza-
tion, traditional methods usually apply color and tone optimization, by matching global
or multi-scale statistics [154, 175], extracting gradient domain information [150, 181],
utilizing semantic clues [191] or leveraging external data [89].
3.3 Our Method
Our method consists of training a generator head as initialization and optimizing addi-
tional residual blocks as refinement. The generator head is trained for inpainting and is
based on conditional GAN framework. Vanilla cGAN for image inpainting consists of
a GeneratorG and a DiscriminatorD. The generator learns to predict the hole contents
and restore the complete image, while the discriminator learns to distinguish real images
30
Patch Adversarial Loss
w
Avg
Patch Perceptual Loss
●●●
Dilated Blocks
PatchGAN Discriminator
AlexNet
Figure 3.2: Generator head and the training losses. We only illustrate one scale of Patch
Adversarial Loss and Patch Perceptual Loss. Note that to compute the Patch Adversarial
Loss, we need to use the mask to find out which patch overlaps with the hole.
from generated images. Using the original image as ground truth, the model is trained
in a self-supervised manner via the following minimax objective:
min
G
max
D
E
(s;x)
[logD(s;x)] +E
s
[log(1D(s;G(s)))]: (3.1)
Heres andx are the incomplete image as input and the original image as target. G(s)
is the output of the generator which could be the hole content or the complete image. In
the case whenG(s) restores the complete image, only the contents inside the hole are
kept to combine with the known regions ofs.
3.3.1 The Generator Head
Existing works have investigated different architectures ofG, most notably the encoder-
decoder style of [149] and the FCN style of [83]. [83] shows that FCN generates
higher-quality and less blurred results comparing with encoder-decoder, largely because
31
the network used is much deeper and is fully convolutional. In contrast, the encoder-
decoder in [149] uses an intermediate bottleneck, fully-connected layer, which results
in size reduction and information loss.
Similar to [83], our generator head is based on FCN and leverage the many properties
of convolutional neural networks, such as translation invariance and parameter sharing.
However, a major limitation of FCN is the constraint of the receptive field size. Given
the convolution layers are locally connected, pixels far away from the hole carry no
influence on the hole. We propose several modifications to alleviate this drawback. First,
we build our network with three components: a down-sampling front end to reduce the
size, followed by multiple residual blocks [75], and an up-sampling back end to restore
the full dimension. Using down-sampling increases the receptive field of the residual
blocks and makes it easier to learn the transformation at a smaller size. Second, we
stack multiple residual blocks to further enlarge the view at later layers. Finally, we
adopt the dilated convolution [216] in all residual blocks, with the dilation factor set to
2. Dilated convolutions use spaced kernels, making it compute each output value with
a wider view of input without increasing the number of parameters and computational
burden. From experiments, we observe that increasing the size of the receptive field,
especially by using dilated convolution, is critical for enhanced inpainting quality. In
contrast, other image translation tasks such as super-resolution, denoising, etc., usually
rely more on local statistics rather than global context.
The detail of our architecture is as follows: the down-sampling front end consists
of three convolutional layers, each with a stride of 2. The intermediate part contains 9
residual blocks, and the up-sampling back-end consists of three transposed convolution,
also with a stride of 2. Each convolutional layer is followed by batch normalization
(BN) and ReLu activation, except for the last layer which outputs the image. Similar
architecture without dilated convolution has been used in [199] for image synthesis.
32
Comparing with [83], our receptive field is much larger, as we adopted more down-
sampling and dilated layers. This makes us generate better results with less artifacts. As
ablation study, we compare different types of layers and architectures in Sec. 3.4.3.
3.3.2 The Training Losses
Different losses have been used to train an inpainting network. These losses can be cast
into two categories. The first category which can be referred to as similarity loss, is
used to measure the similarity between the output and the original image. The second
category which we refer to as the realism loss, is used to measure how realistic-looking
is the output image. We summarize the losses used in different inpainting methods in
Table 3.1.
Method Similarity Loss Realism Loss
Context Encoder (CE) [149] `
2
Global Adversarial Loss
Global Local Inpainting (GLI) [83] `
2
Global and Local Adversarial Loss
Our Approach Patch Perceptual Loss (PPL) Improved Multi-Scale Adversarial Loss
Table 3.1: Comparison of training losses used in different methods.
Patch Perceptual Loss As shown in Table 3.1, using`
2
as reconstruction loss to mea-
sure the difference between the output and the original image has been the default choice
of previous inpainting methods. However, it is known that`
2
loss does not correspond
well to human perception of visual similarity (Zhang et al. [219]). This is because using
`
2
assumes each output pixel is conditionally independent of all others, which is not
the case. An example is that blurring an image leads to small changes in terms of `
2
distance but causes significant perceptual difference. Recent research suggests that a
better metric for perceptual similarity is the internal activations of deep convolutional
networks, usually trained on a high-level image classification task. Such loss is called
“perceptual loss”, and is used in various tasks such as neural style transfer [63], image
super-resolution [87], and conditional image synthesis [44, 27].
33
Based on this observation, we propose a new “patch perceptual loss” as the substitute
of the`
2
losses. Traditional perceptual loss typically uses VGG-Net and computes the
`
2
distance of the activations on a few feature layers. Recently, [219] trained a specific
perceptual network based on AlexNet to measure the perceptual differences between
two image patches, making it an ideal candidate for our task. The perceptual network
computes the activations and sums up the `
2
distances across all feature layers, each
scaled by a learned weight. Furthermore, taking both the local view and the global
view into account, we compute PPL at two scales. Local PPL considers the local hole
patch, while the global PPL slightly zooms out to cover a larger contextual area. More
formally, our PPL is defined as:
X
p=1;2
PPL
k
(G(s)
p
;x
p
) =
X
k=1;2
X
l
1
H
l
W
l
X
h;w
kw
T
l
(
^
F (x
p
)
l
hw
^
F (G(s)
p
)
k
hw
)k
2
2
: (3.2)
Herep refers to the hole patch.
^
F is the AlexNet andl is the feature layer. w
l
is the
layer-wise learned weight. Ablation study in Sec. 3.4.3 shows that our patch perceptual
loss gives better inpainting quality than traditional`
2
.
Multi-Scale Patch Adversarial Loss Adversarial losses are given by trained discrimi-
nators to discern whether an image is real or fake. The global adversarial loss of [149]
takes the entire image as input and outputs a single real/fake prediction which does not
consider local realism of holes. The additional local adversarial loss of [83] adds another
discriminator specifically for the hole, but it requires the hole to be fixed in shape and
size during training to fit the local discriminator. To consider both the global view and
the local view, and to be able to use multiple holes of arbitrary shape, we propose to use
PatchGAN discriminators [85] at three scales of image resolutions. The discriminator
at each scale is identical and only the input is a differently scaled version of the entire
image. Each discriminator is a fully convolutional PatchGAN and outputs a vector of
34
real/fake predictions and each value corresponds to a local image patch. In this way,
the discriminators are trained to classify global and local patches across the image at
multiple scales, and it enables us to use multiple holes of arbitrary shapes since now the
input is the entire image rather than the hole itself. However, since the output image is
the composition of synthesized holes and known background, directly using PatchGAN
is problematic because it does not differentiate between the hole patches (fake) and the
background patches (real). To address this issue, when computing the PatchGAN loss,
only the patches that overlap with the holes are labeled as fake. More formally, our
Patch-wise Adversarial Loss is defined as:
min
G
max
D
1
;D
2
;D
3
X
k=1;2;3
L
GAN
(G;D
k
) (3.3)
=
X
k=1;2;3
E
(s
k
;x
k
)
[log(s
k
;x
k
)] +E
s
k
[log(Q
k
D
k
(s
k
;G(s
k
)))]: (3.4)
Here k refers to the image scale, and Q
k
is a patch-wise real/fake vector, based on
whether the patch overlaps with the holes. Using multiple GAN discriminators at the
same or different image scale has been proposed in unconditional GANs [49] and con-
ditional GANs [199]. Here we extend the design to take into account the holes, which
is critical for obtaining semantically and locally coherent image completion results.
To summarize, our full objective combines both losses is therefore defined as:
min
G
(( max
D
1
;D
2
;D
3
Adv
X
k=1;2;3
L
GAN
(G;D
k
)) +
PPL
X
k=1;2
PPL
k
(G(s)
p
;x
p
)): (3.5)
We use
Adv
and
PPL
to control the importance of the two terms. In our experiment,
we set initial
Adv
= 1 and
PPL
= 10 as used in [149, 199]. As training progresses, we
gradually decrease the weight of
Adv
based on adversarial loss annealing (Sec. 3.3.3).
Fig. 3.2 illustrates the architecture of our generator head and the training losses.
35
●●●
Downsample
Upsample
Figure 3.3: Illustration of block-wise procedural training. The yellow residual blocks
refer to the generator head which is trained first. The green residual blocks are pro-
gressively added one at a time. We also draw the skip connections between the already
trained residual blocks and the up-sampling back end.
3.3.3 Blockwise Procedural Training with Adversarial Loss
Annealing
Our experiments show that using the described generator head and training losses
already generate results of descent quality. An intuitive effort to improve the results
would be to stack more intermediate residual blocks to further expand the receptive view
and increase the expressiveness of the model. However, we found that directly stacking
more residual blocks makes it more difficult to stabilize the training. As the parameter
space becomes much larger, it is also more challenging to find local optimum. In the
end, the inpainting quality deteriorates as the model depth increases.
To address the challenge of training a deeper network, we propose to use procedural
block-wise training to gradually increase the depth of the inpainting network. More
specifically, we begin by training the generator head until it converges. Then, we add
a new residual block after the already trained residual blocks, right before the back-
end up-sampling layers. In order to smoothly introduce the new residual block without
breaking the trained model, we add another skip path from the already trained residual
blocks to the up-sampling layers. Initially, the weight of the skip path is set to 1, while
the weight of the path containing the new block is set to 0. This essentially makes the
initial network identical to the already trained network. We then slowly decrease the
weight of the skip path and increase the weight of the new residual block as training
36
progresses. In this way, the newly introduced residual block is trained to be a fine-
tuning component, which adds another layer of fine details to the original results. This
step are repeated multiple times, and each time we deepen the network by adding a new
residual block. In our experiments, we found that the results improve significantly over
generator head, after training with the first residual block added. The output becomes
stabilized after three residual blocks, and very little changes can be detected if more
residual blocks are brought in. We illustrate the procedural training process in Fig. 3.3.
The block-wise procedural training has several benefits. First, it guides the training
process of a very deep generator. Starting with the generator head and gradually fine-
tuning with more residual blocks makes it easier to discover the mapping between the
incomplete image and the complete image, even though the search space is huge given
the diversity of natural images and the randomness of holes. Another benefit is training
efficiency, as we found decoupling the training of the generator head and the fine-tuning
of additional residual blocks requires significantly less training time comparing with
training the network all at once. Similar idea of progressive training has been used in
Progressive GAN [92]. However the task and the model are different in our case.
Adversarial Loss Annealing During training, the generator adversarial loss updates the
generator weight if the discriminator successfully detects the generated image as fake:
X
k=1;2;3
E
s
[log(
QD
k
(s
k
;G(s
k
)))]: (3.6)
Note that here
Q reverses Q of 3.3.2 as this is the loss to the generator. We observe
that the generator adversarial loss becomes dominant over PPL as training progresses,
as the discriminator becomes increasingly good at detecting fake images during train-
ing. This is less of a problem for image synthesis tasks. However, for the inpainting
task which requires the output to be faithful to the original image, the outcome is that
37
the generator deliberately adds noise patterns to confuse the discriminators, leading to
artifacts or incorrect textures. Based on this observation, we propose to use adversarial
loss annealing, which decreases the weight of the adversarial loss for the generator at
the time of adding new residual blocks. More formally, let the initial weight of the gen-
erator adversarial loss be
0
adv
, and the weight of the generator adversarial loss be
i
adv
after adding thei
th
residual block. We found that simply decay the weight linearly by
setting
i
G
adv
= 10
i
0
G
adv
gives results significantly better than using constant weight.
In Sec. 3.4, we analyze the effect of ALA in more detail.
Finally, we summarize the discussed training schemes in Alg. 1.
Algorithm 1 Training the Inpainting Network
1: Set batch size to 8 and basic learning rate tolr
0
0:0002.
2: Set
PPL
10 and
0
adv
1.
3: Train the Generator HeadG
0
using MSPAL and PPL (3.3.2) for 150,000 iterations.
4: for i=1 to 3 do
5: Add the skip path and the residual blockr
i
.
6: Setlr
i
10
i
lr
0
and
i
G
adv
10
i
0
G
adv
.
7: Train the generatorG
3
with the addedr
i
for 1,500 iterations.
8: end for
9: returnG
3
3.4 Results
In this section, we first describe our datasets and experiment setting (Sec. 3.4.1). We
then provide quantitative and qualitative comparisons with several methods, and also
report a subjective human perceptual test based on user-study (Sec. 3.4.2). In Sec. 3.4.3,
we conduct several ablation study about the design choice of the framework. Finally,
we show how our method can be applied to image harmonization and interactive guided
inpainting (Sec. 3.4.4).
38
3.4.1 Experiment Setup
We evaluate our inpainting method on ImageNet [162], COCO [118], Place2 [222] and
CelebA [228]. CelebA consists of 202,599 face images with various viewpoints and
expressions. For a fair comparison, we follow the standard split with 162,770 images
for training, 19,867 for validation and 19,962 for testing.
In order to compare with existing methods, we train on images of size 256x256.
We also train another network at a larger scale of 512x512 to demonstrate its ability
to handle higher resolutions. For each image, we apply a mask with a single hole or
multiple holes placed at random locations. The size of hole is between 1/4 and 1/2 of
the image’s size. We set the batch size to 8, and regardless of the actual dataset size,
we train 150,000 iterations for the generator head and another 1,500 iterations for each
additional residual block. For each dataset, the training takes around 2 days to finish on
a single Titan X GPU, which is significantly less time than [83].
3.4.2 Comparison with Existing Methods
For ImageNet, COCO and Place2, we compare our results with Content-Aware Fill
(CAF) [10], Context Encoder (CE) [149], Neural Patch Synthesis (NPS) [212] and
Global Local Inpainting (GLI) [83]. For CE, NPS, and GLI, we use pre-trained mod-
els made available by the authors. CE and NPS are only trained with holes with fixed
size and location (image center), while CAF and GLI can handle arbitrary holes. For
fair comparisons, we evaluate on both settings. For center hole completion, we com-
pare with CAF, CE, NPS and GLI on ImageNet [162] test images. For random hole
completion, we compare with CAF and GLI using test images from COCO and Place2.
Only GLI applies Poisson Blending as post-processing. We also show the results by our
39
(a) Input (b) CAF [10] (c) CE [149] (d) NPS [212] (e) GLI [83] (f) GH (g) Final
Figure 3.4: Fixed hole result comparisons. GH are results generated by generator head,
and Final are results generated with block procedural training. All images have original
size 256x256. Zoom in for best viewing quality.
Generator Head and compare. The images shown in Fig. 3.4, 3.5 are randomly sampled
from the test set.
We see that, while CAF can generate realist-looking details using patch propagation,
they often fail to capture the global structure or synthesize new contents. CE’s result is
significantly blurrier as it can only inpaint low-resolution images. NPS’s results heavily
depend on CE’s initialization. Our approach is much faster and generates results of
higher visual quality most of the time. Comparing with GLI, our results do not need
post-processing, and are more coherent with less artifacts. Our final results also show
significant improvement over Generator Head, demonstrating the effectiveness block-
wise procedural fine-tuning.
40
(a) Input (b) CAF [10] (c) GLI [83] (d) GH (e) Final
Figure 3.5: Random hole completion comparison. GH are results generated by generator
head, and Final are results generated with block procedural training. All images have
original size 256x256. Zoom in for best viewing quality.
For CelebA, we compare our results with Generative Face Completion [117] in
Fig. 3.6. Since [117]’s model inpaints images of 128x128, we directly up-sample the
results to 256x256 to compare. The images shown are also chosen at random. We can
see that while [117] is an approach specific for face inpainting, our method for general
inpainting tasks actually generates much better face completion results.
Quantitative Evaluation Table 3.2 shows quantitative comparison between CAF, CE,
GLI and our approach. The values are computed based on a random subset of 200
images selected from the test set. Both`
1
and`
2
errors are computed with pixel values
normalized between [0; 1] and are summed up using all pixels of the image. We can see
that our method performs better than other methods in terms of SSIM and`
1
error. For
`
2
, CAF and GLI have smaller errors. This is understandable as our model is trained
41
(a) Input (b) GFC [117] (c) Ours (d) Input (e) GFC [117] (f) Ours
Figure 3.6: Face completion results comparing with [117].
with perceptual loss rather than`
2
loss. Besides,`
2
loss awards averaging color values
and does not faithfully reflect perceptual quality. From the numerical values, we can
also see that procedural fine-tuning reduces the errors from the generator head.
Method Mean`
1
Error Mean`
2
Error SSIM
CAF [10]
968.8 209.5 0.9415
1660 363.5 0.9010
CE [149]
2693 545.8 0.7719
N/A N/A N/A
GLI [83]
868.6 269.7 0.9452
1640 378.7 0.9020
Ours (Generator Head)
913.8 245.6 0.9458
1629 439.4 0.9073
Ours (Final)
838.3 253.3 0.9486
1609 427.0 0.9090
Table 3.2: Numerical comparison between CAF, CE and GLI, our generator head results
and our final results. Up/down are results of center/random region completion. Note that
for SSIM, larger values mean greater similarity in terms of content structure and indicate
better performance.
User Study To more rigorously evaluate the performance, we conduct a user study based
on the random hole results. We asked for feedback from 20 users, by giving each user
30 tuples of images to rank. Each tuple contains results of NPS, GLI, and ours. The
42
user study shows that our method have highest scores, outperforming NPS and GLI by
a large margin. Among 600 total comparisons, our results are ranked the best 72.3%
of the time. Our results are overwhelmingly better than NPS, as our results are ranked
better 95.8% of the time. Comparing with GLI, our results are ranked better 73.5% of
the time and are ranked the same 15.5% of the time.
3.4.3 Ablation Study
Comparison of Convolutional Layer Choosing the proper convolutional layer
improves the inpainting quality and also reduces the noise. We consider three types
of convolutional layers: original, dilated [216] and interpolated [141]. We train three
networks under the same setting to specifically test their effects. We observed that using
dilation significantly improves the inpainting quality while using interpolated convolu-
tion tends to generate over-smoothed results. Fig. 3.7 shows a qualitative comparison.
(a) (b) (c) (d) (e)
Figure 3.7: Visual comparison of a result using different types of convolutional layers.
(a) Input; (b) Original Conv; (c) Interpolated Conv; (d) Dilated+Interpolated Conv; (e)
Dilated Conv (ours).
Effect of Procedural Training Qualitative comparisons in Sec. 3.4.2 shows that pro-
cedurally adding residual blocks to fine-tune improves the results. However, one may
wonder whether we can directly train a deeper network. To find out, we trained a net-
work with 12 residual blocks from scratch, keeping all other hyper-parameters the same.
Fig. 3.8 shows two examples of the result comparisons. We found that further increasing
43
the number of residual blocks has a negative effect on inpainting, possibly due to a vari-
ety of factors such as gradient vanishing, optimization difficulty, etc. This demonstrates
the necessity and importance of the procedural training scheme.
(a) Input (b) (c) (d) Input (e) (f)
Figure 3.8: Visual comparison of results by directly training a network of 12 blocks
((b),(e)) and procedural training ((c), (f)).
Effect of Adversarial Loss Annealing We also investigate the effect of adversarial
loss annealing and train another network without using ALA. Visually, we observe that
using adversarial loss annealing reduces the noise level without sacrificing the overall
sharpness and realism. We show two randomly selected examples in Fig. 3.9 to illustrate
the effects.
(a) Input (b) w/o ALA (c) ALA (d) Input (e) w/o ALA (f) ALA
Figure 3.9: Visual comparison of results by training training without and with ALA.
3.4.4 Interactive Guided Inpainting
In this section, we consider object-based guided inpainting as a practical scenario.
Specifically, we assume the user would like to add to the input image with objects from
another guide image. Given an input imageI and a guide imageI
g
, we allow the user
to select a region of I
s
by dragging a bounding box B
g
containing the object that he
44
desires to add. Note that unlike previous settings of image composition, we do not
require the user to accurately segment the object, but instead only providing a bounding
box is sufficient. This greatly reduces the workload and simplifies the task. Next, we
perform segmentation use a network to extract the object foreground. The segmentation
network is adapted from Mask R-CNN [73] and trained on COCO, but is agnostic to
object categories and only considers the foreground/background classification. Finally,
we resize and paste B
g
to I. To make the composition look natural and realistic, we
need to remove and inpaint the background ofB
g
and also perform harmonization on
the foreground object such that it is coherent the overall image appearance.
To accomplish this task with known segmentation, we need to address two separate
problems: inpainting and harmonization. Our model can be easily extended to train for
both tasks at the same time, only requiring changing the input and output. To train this
model, the data acquisition is similar to [190], where we create artificially composited
images using statistics color transfer [154] and another randomly chosen guide image.
The network jointly trained for inpainting and harmonization performs well in both
tasks. Fig. 3.10 shows that it generates harmonization results better than [190], which
is the state-of-the-art deep harmonization method. Finally, in Fig. 3.11 we show several
examples of interactive guided inpainting results.
(a) Input (b) DH [190] (c) Ours (d) Input (e) DH [190] (f) Ours
Figure 3.10: Examples of image harmonization results. For (a) and (d), the microwave
and the zebra on the back have unusual color. Our method correctly adjusts their appear-
ance and makes the images coherent and realistic.
45
(a) Input (b) Segmentation (c) Inpainting (d) Final result
Figure 3.11: Examples of interactive guided inpainting result. The segmentation mask
is given by our foreground/background segmentation network trained on COCO. The
final result combines the outputs of harmonization and inpainting.
3.5 Conclusion
We propose a novel model and training scheme to address the inpainting problem and
achieve excellent performance on several tasks, including general inpainting, image har-
monization and guided inpainting. Although we only explore the ability of our approach
in image editing related tasks, we believe it is a general methodology that could be eas-
ily applied to other generative problems, such as image translation, image synthesis, etc.
As future work, we plan to study the effectiveness of the proposed network and losses,
and especially the procedural training scheme on a larger scope.
46
Chapter 4
Show, Attend and Translate:
Unsupervised Image Translation with
Self-Regularization and Attention
4.1 Introduction
(a) Input image (b) Predicted Attention Map (c) Final result (d) CycleGAN [225]
.
Figure 4.1: Horse!zebra image translation. Our model learns to predict an attention
map (b) and translates the horse to zebra while keeping the background untouched (c).
By comparison, CycleGAN [225] significantly alters the appearance of the background
together with the horse (d).
Many computer vision problems can be cast as an image-to-image translation prob-
lem: the task is to map an image of one domain to a corresponding image of another
domain. For example, image colorization can be considered as mapping gray-scale
images to corresponding images in RGB space [218]; style transfer can be viewed as
translating images in one style to corresponding images with another style [63, 87, 61].
47
Other tasks falling into this category include semantic segmentation [128], super-
resolution [112], image manipulation [85], etc. Another important application of image
translation is related to domain adaptation and unsupervised learning: with the rise of
deep learning, it is now considered crucial to have large labeled training datasets. How-
ever, labeling and annotating such large datasets are expensive and thus not scalable.
An alternative is to use synthetic or simulated data for training, whose labels are triv-
ial to acquire [227, 194, 163, 158, 151, 133, 90, 31]. Unfortunately, learning from
synthetic data can be problematic and most of the time does not generalize to real-
world data, due to the data distribution gap between the two domains. Furthermore,
due to the deep neural networks’ capability of learning small details, it is anticipated
that the trained model would easily over-fits to the synthetic domain. In order to close
this gap, we can either find mappings or domain-invariant representations at feature
level [18, 56, 130, 173, 195, 69, 24, 4, 96] or learn to translate images from one domain
to another domain to create “fake” labeled data for training [17, 225, 122, 112, 123, 215].
In the latter case, we usually hope to learn a mapping that preserves the labels as well as
the attributes we care about.
Typically there exist two settings for image translation given two domains X and
Y . The first setting is supervised, where example image pairsx;y are available. This
means for the training data, for each image x
i
2 X there is a corresponding y
i
2 Y ,
and we wish to find a translator G : X ! Y such that G(x
i
) y
i
. Representative
translation systems in the supervised setting include domain-specific works [52, 77,
110, 167, 128, 201, 207, 218] and the more general Pix2Pix [85, 199]. However, paired
training data comes at a premium. For example, for image stylization, obtaining paired
data requires lengthy artist authoring and is extremely expensive. For other tasks like
object transfiguration, the desired output is not even well defined.
48
Therefore, we focus on the second setting, which is unsupervised image translation.
In the unsupervised setting,X andY are two independent sets of images, and we do not
have access to paired examples showing how an imagex
i
2X could be translated to an
imagey
i
2Y . Our task is then to seek an algorithm that can learn to translate between
X and Y without desired input-output examples. The unsupervised image translation
setting has greater potentials because of its simplicity and flexibility but is also much
more difficult. In fact, it is a highly under-constrained and ill-posed problem, since there
could be unlimited many number of mappings betweenX andY : from the probabilistic
view, the challenge is to learn a joint distribution of images in different domains. As
stated by the coupling theory [119], there exists an infinite set of joint distributions that
can arrive the two marginal distributions in two different domains. Therefore, additional
assumptions and constraints are needed for us to exploit the structure and supervision
necessary to learn the mapping.
Existing works that address this problem assume that there are certain relationships
between the two domains. For example, CycleGAN [225] assumes cycle-consistency
and the existence of an inverse mappingF that translates fromY toX. It then trains
two generators which are bijections and inverse to each other and uses adversarial con-
straint [66] to ensure the translated image appears to be drawn from the target domain
and the cycle-consistency constraint to ensure the translated image can be mapped back
to the original image using the inverse mapping (F (G(x)) x and G(F (y)) y).
UNIT [122], on the other hand, assumes shared-latent space, meaning a pair of images
in different domains can be mapped to some shared latent representations. The model
trains two generators G
X
;G
Y
with shared layers. BothG
X
and G
Y
maps an input to
itself, while the domain translation is realized by lettingx
i
go through part ofG
X
and
part ofG
Y
to gety
i
. The model is trained with an adversarial constraint on the image,
49
a variational constraint on the latent code [100, 155], and another cycle-consistency
constraint.
Assuming cycle consistency ensures 1-1 mapping and avoids mode collapses [164],
both models generate reasonable image translation and domain adaptation results. How-
ever, there are several issues with existing approaches. First, such approaches are usu-
ally agnostic to the subjects of interest and there is little guarantee it reaches the desired
output. In fact, approaches based on cycle-consistency [225, 121] could theoretically
find any arbitrary 1-1 mapping that satisfies the constraints, and this renders the train-
ing unstable and the results random. This is problematic in many image translation
scenarios. For example, when translating from a horse image to a zebra image, most
likely we only wish to draw the particular black-white stripes on top of the horses
while keeping everything else unchanged. However, what we observe is that existing
approaches [225, 122] do not differentiate between the horse/zebra from the scene back-
ground, and the colors and appearances of the background often significantly change
during translation (Fig. 4.1). Second, most of the time we only care about one-way trans-
lation, while existing methods like CycleGAN [225] and UNIT [121] always require
training two generators of bijections. This is not only cumbersome but it is also hard to
balance the effects of the two generators. Third, there is a sensitive trade-off between
the faithfulness of the translated image to the input image and how similar it resem-
bles the new domain, and it requires excessive manual tuning of the weight between the
adversarial loss and the reconstruction loss to get satisfying results.
To address the aforementioned issues, we propose a simpler yet more effective image
translation model that consists of a single generator with an attention module. We first
re-consider what the desired outcome of an image translation task should be: most of
the time the desired output should not only resemble the target domain but also preserve
certain attributes and share similar visual appearance with input. For example, in the
50
case of horse-zebra translation [225], the output zebra should be similar to the input
horse in terms of the scene background, the location and the shape of the zebra and
horse, etc. In the domain adaptation task that translates MNIST [111] to USPS [36], we
expect the output is visually similar to the input in terms of the shape and structure of
the digit such that it preserves the label. Based on such observation, our model proposes
to use a single generator that mapsX toY and is trained with a self-regularization term
that enforces perceptual similarity between the output and the input, together with an
adversarial term that enforces the output to appear like drawn fromY . Furthermore, in
order to focus the translation on key components of the image and avoid introducing
unnecessary changes to irrelevant parts, we propose to add an attention module that
predicts a probability map as to which part of the image it needs to attend to when
translating. Such probability maps, which are learned in a completely unsupervised
fashion, could further facilitate segmentation or saliency detection (Fig. 4.1). Third, we
propose an automatic and principled way to find the optimal weight between the self-
regularization term and the adversarial term such that we do not have to manually search
for the best hyper-parameter.
Our model does not rely on cycle-consistency or shared representation assumption,
and it only learns one-way mapping. Although the constraint is susceptible to over-
simplify certain scenarios, we found that the model works surprisingly well. With the
attention module, our model learns to detect the key objects from the background con-
text and is able to correct artifacts and remove unwanted changes from the translated
results. We apply our model on a variety of image translation and domain adaptation
tasks and show that our model is not only simpler but also works better than existing
methods, achieving superior qualitative and quantitative performance. To demonstrate
51
its application in real-world tasks, we show our model can be used to improve the accu-
racy of face 3D morphable model [16] prediction by augmenting the training data of
real images with adapted synthetic images.
4.2 Related Work
Generative adversarial networks (GANs) Using GAN framework [66] for generative
image modeling and synthesis has gained remarkable progress recently. The basic idea
of GAN training is to train a generator and a discriminator jointly such that the gen-
erator produces realistic images that confuse the discriminator. It is known that the
vanilla GAN suffers from instability in training. Several techniques have been pro-
posed to stabilize the training process and enable it to scale to higher resolution images,
such as DCGAN [152], energy-based GAN [221], Wasserstein GAN (WGAN) [164, 6],
WGAN-GP [70], BEGAN [13], LSGAN [135] and the Progressive GANs [92]. In our
work, adversarial training is the fundamental element which ensures that the output
sample from the generator appears like drawn from the target domain.
Image translation Image translation can be seen as generating an image in target
domain conditioning on an image in the source domain. Similar problems of con-
ditional image generation include text to image translation [217, 153], super resolu-
tion [95, 41, 112], style transfer [63, 87, 115, 80] etc. Based on the availability of
paired training data, image translation can be either supervised (paired) or unsupervised
(unpaired). Isola et al. [85] first proposes a unified framework called Pix2Pix for paired
image-to-image translation based on conditional GANs. Wang [199] further extends the
framework to generate high-resolution images by using deeper, multi-scale networks
and improved training losses. [54] uses variational U-Net instead of GAN for condi-
tional image generation. UNIT [121] and BiCycleGAN [226] incorporate latent code
52
embedding into existing frameworks and enable generating randomly sampled transla-
tion results. On the other hand, when paired training data is not available, additional con-
straints such as cycle-consistency loss is employed [225, 81]. Such constraint enforces
an image to map to another domain and back to itself to ensure 1-1 mapping between
the two domains. However, such techniques heavily rely on “laziness” of the generator
network and often introduce artifacts or unwanted changes to the results. Our model
leverages recent advances in neural network training and employs the perceptual-based
loss [87, 219] as self-regularization, such that cycle-consistency becomes unnecessary
and we can also obtain more accurate translation results.
Attention Recently, attention mechanism has been successfully introduced in many
applications in computer vision and language processing, e.g., image captioning [209],
text to image generation [210], visual question answering [208], saliency detection
[106], machine translation [8] and speech recognition [30]. Attention mechanism
helps models to focus on the relevant portion of the input to resolve the corresponding
output without any supervision. In machine translation [8], it attends on relevant words
in the source language to predict the current output in the target language. To generate
an image from text [210], it attends on different words for the corresponding sub-region
of the image. Inversely, for image captioning [209], image sub-regions were attended
for the next generated word. In the same spirit, we propose to use an attention module
to attend to the region of interest for the image translation task in an unsupervised way.
4.3 Our Method
We begin by explaining our model for unsupervised image translation. Let X and Y
be two image domains, our goal is to train a generator G
: X ! Y , where are
the function parameters. For simplicity, we omit and use G instead. We are given
53
Residual Blocks
⨂ ⨁
Adversarial Loss
VGG-19
VGG-19
Perceptual Loss
$
%&&'
$
()
%&&'
()
()
Figure 4.2: Model overview. Our generatorG consists of a vanilla generatorG
0
and an
attention branchG
attn
. We train the model using self-regularization perceptual loss and
adversarial loss.
unpaired samplesx2 X andy2 Y , and the unsupervised setting assumes thatx and
y are independently drawn from the marginal distributionsP
xX
(x) andP
yY
(y). Let
y
0
=G(x) denote the translated image, the key requirement is thaty
0
should appear like
drawn from domainY , while preserving the low-level visual characteristics ofx. The
translated imagesy
0
can be further used for other downstream tasks such as unsupervised
learning. However, in our case, we decouple image translation from its applications.
Based on the requirements described, we propose to learn by minimizing the fol-
lowing loss:
L
G
=`
adv
(G(x);Y ) +`
reg
(x;G(x)): (4.1)
HereG(x) =G
attn
(x)
G
0
(x) + (1G
attn
(x))
x, whereG
0
is the vanilla generator
andG
attn
is the attention branch. G
0
outputs a translated image whileG
attn
predicts a
probability map that is used to compositeG
0
(x) withx to get the final output. The first
part of the loss, `
adv
, is the adversarial loss on the image domain that makes sure that
54
G(x) appears like domainY . The second part of the losses`
reg
makes sure thatG(x) is
visually similar tox. In our case,`
adv
is given by a discriminatorD trained jointly with
G, and`
reg
is measured with perceptual loss. We illustrate the model in Fig. 4.3.
The model architectures: Our model consists of a generatorG and a discriminatorD.
The generator G has two branches: the vanilla generator G
0
and the attention branch
G
attn
.G
0
translates the inputx as a whole to generate a similar imageG
0
(x) in the new
domain, andG
attn
predicts a probability mapG
attn
(x) as the attention mask. G
attn
(x)
has the same size asx and each pixel is a probability value between 0-1. In the end, we
composite the final imageG(x) by adding upx andG
0
(x) based on the attention mask.
G
0
is based on Fully Convolutional Network (FCN) and leverages properties of con-
volutional neural networks, such as translation invariance and parameter sharing. Simi-
lar to [85, 225], the generatorG is built with three components: a down-sampling front-
end to reduce the size, followed by multiple residual blocks [75], and an up-sampling
back-end to restore the original dimensions. The down-samping front-end consists of
two convolutional blocks, each with a stride of 2. The intermediate part contains nine
residual blocks that keep the height/width constant, and the up-sampling back-end con-
sists of two deconvolutional blocks, also with a stride of 2. Each convolutional layer is
followed by batch normalization and ReLU activation, except for the last layer whose
output is in the image space. Using down-sampling at the beginning increases the recep-
tive field of the residual blocks and makes it easier to learn the transformation at a
smaller scale. Another modification is that we adopt the dilated convolution in all resid-
ual blocks, and set the dilation factor to 2. Dilated convolutions use spaced kernels,
enabling it to compute each output value with a wider view of input without increasing
the number of parameters and computational burden. G
attn
consists of the initial layers
of the VGG-19 network [171] (up to conv3 3), followed by two deconvolutional blocks.
55
In the end it is a convolutional layer with sigmoid that outputs a single channel probabil-
ity map. During training, the VGG-19 layers are warm-started with weights pretrained
on ImageNet [162].
For the discriminator, we use a five-layer convolutional network. The first three lay-
ers have a stride of 2 followed by two convolution layers with stride 1, which effectively
down-samples the networks three times. The output is a vector of real/fake predictions
and each value corresponds to a patch of the image. Classifying each patch as real/fake
introduces PatchGAN, and is shown to work better than the global GAN [225, 85].
Adversarial loss: Generative Adversarial Network [66] plays a two-player min-max
game to update the networkG andD. G learns to translate the imagex toG(x) which
appears as if it is from Y , while D learns to distinguish G(x) from y which is the
real image drawn fromY . The parameters ofD andG are updated alternatively. The
discriminatorD updates its parameters by maximizing the following objective:
L
D
= log(D(y)) log(1D(G(x))): (4.2)
The adversarial loss used to update the generatorG is defined as:
L
adv
(G(x);Y ) = log(D(G(x))): (4.3)
By minimizing the loss function, the generatorG learns to create translated image
that fools the networkD into classifying the image as drawn fromY .
Self-regularization loss: Theoretically, adversarial training can learn a mappingG that
produces outputs identically distributed as the target domainY . However, if the capac-
ity is large enough, a network can map the input images to any random permutations
of images in the target domain. Thus, adversarial loses alone cannot guarantee that the
learned functionG maps the input to the desired output. To further constrain the learned
56
mapping such that it is meaningful, we argue thatG should preserve visual characteris-
tics of the input image. In other words, the output and the input need to share percep-
tual similarities, especially regarding the low-level features. Such features may include
color, edges, shape, objects, etc. We impose this constraint with the self-regularization
term, which is modeled by minimizing the distance between the translated imagey
0
and
the inputx:`
reg
=d(x;G(x)). Hered is some distance functiond, which can be`
2
,`
1
,
SSIM, etc. However, recent research suggests that using perceptual distance based on
a pre-trained network corresponds much better to human perception of similarity com-
paring with traditional distance measures [219]. In particular, we defined the perceptual
loss as:
`
reg
(G(x);x) =
X
l=1;2;3
1
H
l
W
l
(4.4)
X
h;w
(kw
l
(
^
F (x)
l
hw
^
F (G(x))
l
hw
)k
2
2
):
Here
^
F is VGG pretrained on ImageNet used to extract the neural features; we usel to
represent each layer, andH
l
;W
l
are the height and width of feature
^
F
l
. We extract neural
features with
^
F across multiple layers, compute the`
2
difference at each locationh;w
of
^
F
l
and average over the feature height and width. We then scale it with layer-wise
weightw
l
. We did extensive experiments to try different combinations of feature layers
and obtained the best results by only using the first three layers of VGG and setting
w
1
;w
2
;w
3
to be 1:0=32; 1:0=16; 1:0=8 respectively. This conforms to the intuition that
we would like to preserve the low-level traits of the input during translation. Note that
this may not always be true (such as in texture transfer), but it is a hyper-parameter
that could be easily adjusted based on different problem settings. We also experimented
with using different pre-trained networks such as AlexNet to extract neural features as
suggested by [219] but do not observe much difference in results.
57
Training scheme: In our experiment, we found that training the attention branch and
the vanilla generator branch is difficult as it is hard to balance the learned translation and
mask. In our practice, we train the two branches separately. First, we train the vanilla
generator G
0
without the attention branch. After it converges, we train the attention
branchG
attn
while keeping the trained generatorG
0
fixed. In the end, we jointly fine-
tune them with a smaller learning rate.
Adaptive weight induction: Like other image translation methods, the resemblance
to the new domain and faithfulness to the original image is a trade-off. In our model,
it is determined by the weight of the self-regularization term relative to the image
adversarial term. If is too large, the translated image will be close to the input but
does not look like the new domain. If is too small, the translated image would fail
to pertain the visual traits of the input. Previous approaches usually decide the weight
heuristically. Here we propose an adaptive scheme to search for the best: we start by
setting = 0, which means we only use the adversarial constraint to train the generator.
Then we gradually increase. This would lead to the decrease of the adversarial loss as
the output would shift away fromY toX, which makes it easier forD to classify. We
stop increasing when the adversarial loss sinks below some threshold`
t
adv
. We then
keep constant and continue to train the network until converging. Using the adaptive
weight induction scheme avoids manual tuning of for each specific task and gives
results that are both similar to the inputx and the new domainY . Note that we repeat
such process both when trainingG
0
andG
attn
.
Analysis: Our model is related to CycleGAN in that if we assume 1-1 mapping, we
can define an inverse mapping F : Y ! X such that F (G(x)) = x. This satisfies
the constraints of CycleGAN in that the cycle-consistency loss is zero. This shows that
our learned mapping belongs to the set of possible mappings given by CycleGAN. On
the other hand, although CycleGAN tends to learn the mapping such that the visual
58
distance betweeny
0
andx is small possibly due to cycle-consistency constraint, it does
not guarantee to minimize the perceptual distance betweenG(x) andx. Comparing with
UNIT, if we add another constraint thatG(y) =y, then it is a special case of the UNIT
model where all layers of the two generators are shared which leads to a single generator
G. In this case, the cycle-consistency constraint is implicit as G(G(x)) = G(x) and
mind(x;G(x)) = mind(x;G(G(x))). However, we observe that adding the additional
self-mapping constraint for domainY does not improve the results.
Even though our approach assumes the perceptual distance betweenx and its cor-
respondingy2 Y is small, our approach generalizes well to tasks where the input and
output domains are significantly different, such as translation of photo to map, day to
night, etc., as long as our assumption generally holds. For example, in the case of photo
to map, the park (photo) is labeled as green (map) and the water (photo) is labeled as blue
(map), which provides certain low-level similarities. Experiments show that even with-
out the attention branch, our model produces results consistently similar or better than
other methods. This indicates that the cycle-consistency assumption may not be neces-
sary for image translation. Note that our approach is a meta-algorithm, and we could
potentially improve the results by using new/more advanced components. For example,
the generator and discriminator could be easily replaced with the latest GAN architec-
tures such as LSGAN [136], WGAN-GP [70], or adding spectral normalization [138].
We may also improve the results by employing a more specific self-regularizaton term
that is fine-tuned on the datasets we work on.
4.4 Results
We tested our model on a variety of datasets and tasks. In the following, we show the
qualitative results of image translation, as well as quantitative results in several domain
59
(a) Input (b) Initial trans (c) Attention map (d) Final result (e) UNIT [121] (f) CycleGAN [225]
Figure 4.3: Image translation results of horse to zebra [85] and comparison with UNIT
and CycleGAN.
adaptation settings. In our experiments, all images are resized to 256x256. We use
Adam solver [98] to update the model weights during training. In order to reduce model
oscillation, we update the discriminators using a history of generated images rather than
the ones produced by the latest generative models [168]: we keep an image buffer that
stores the 50 previously generated images. All networks were trained from scratch with
a learning rate of 0.0002. Starting from 5k iteration, we linearly decay the learning rate
over the remaining 5k iterations. Most of our training takes about 1 day to converge on
a single Titan X GPU.
60
4.4.1 Qualitative Results
(a) Input (b) Initial (c) Attention (d) Final (e) Input (f) Initial (g) Attention (h) Final
Figure 4.4: Image translation results on more datasets. From top to bottom: apple to
orange [85], dog to cat [147], photo to DSLR [85], yosemite summer to winter [85].
Figure 4.5: More image translation results. From left to right: edges to shoes [85];
edges to handbags [85]; SYNTHIA to cityscape [161, 33]. Given the source and target
domains are globally different, the initial translation and final result are similar with the
attention maps focusing on the entire images.
Fig. 4.3 shows visual results of image translation of horse to zebra. For each image,
we show the initial translation G
0
(x), the attention map G
attn
(x) and the final result
G(x) composited using G
0
(x) and x based on G
attn
(x). We also compare the results
61
with CycleGAN [225] and UNIT [121], and all models are trained using the same num-
ber of iterations. For the baseline implementation, we use the original authors’ imple-
mentations. We can see from the examples that without the attention branch, our simple
translation modelG
0
already gives results similar or better than [225, 121]. However,
all these results suffer from perturbations of background color/texture and artifacts near
the region of interest. With the predicted attention map which learns to segment the
horses, our final results have much higher visual quality, with the background keeping
untouched and artifacts near the ROI removed (row 2, 4). Complete results of horse-
zebra translations and comparisons are available online
1
.
Fig. 4.4 shows more results on a variety of datasets. We can see that for all these
tasks, our model can learn the region of interest and generate compositions that are
not only more faithful to the input, but also have fewer artifacts. For example, in dog
to cat translation, we notice most attention maps have large values around the eyes,
indicating the eyes are key ROI to differentiate cats from dogs. In the examples of photo
to DSLR, the ROI should be the background that we wish to defocus, while the initial
translation changes the color of the foreground flower in the photo. The final result, on
the other hand, learns to keep the color of the foreground flower. In the second example
of summer to winter translation, we notice the initial result incorrectly changes color of
the person. With the guidance of attention map, the final result removes such artifacts.
In a few scenarios, the attention map is less useful as the image does not explicitly
contain region of interest and should be translated everywhere. In this case, the compos-
ited results largely rely on the initial prediction given byG
0
. This is true for tasks like
edges to shoes/handbags, SYNTHIA to cityscape (Fig. 4.5) and photo to map (Fig. 4.9).
Although many of these tasks have very different source and target domains, our method
is general and can be applied to get satisfying results.
1
http://www.harryyang.org/img_trans
62
To better demonstrate the effectiveness of our simple model, Fig. 4.6 shows several
results before training with the attention branch and compares with baseline. We can
see that even without the attention branch, our model generates better qualitative results
comparing with CycleGAN and UNIT (more samples of photo to Van Gogh is available
online
2
).
(a) Input (b) CycleGAN (c) UNIT (d) Ours w/o attn
Figure 4.6: Comparing our results w/o attention with baselines. From top to bottom:
dawn to night (SYNTHIA [161]), non-smile to smile (CelebA [124]) and photos to Van
Gogh [85].
User study: To more rigorously evaluate the performance, we perform a user study to
compare the results. The procedure is as following: we asked for feedbacks from 22
users (all are graduate students and researchers). Each user is given 30 sets of images
2
http://www.harryyang.org/img_trans/vangogh
63
Figure 4.7: Failure case of the attention map: it did not detect the ROI correctly and
removed the zebra stripes.
(a) (b) (c) (d)
Figure 4.8: Effects of using different layers as feature extractors. From left to right:
input (a), using the first two layers of VGG (b), using the last two layers of VGG (c) and
using the first three layers of VGG (d).
to compare. Each set has 5 images, which are the input, initial result (w/o attention),
final result (with attention), CycleGAN results and UNIT results. In total there are
300 different image sets randomly selected from horse to zebra and photo to Van Gogh
translation tasks. The images in each set are in random order. The user is then asked to
rank the four results from highest visual quality to lowest. The user is fully informed
about the task and is aware of the goal as to translate the input image into a new domain
while avoiding unnecessary changes.
Table 4.1 shows the user-study results. We listed results of: CycleGAN vs ours
initial/final; UNIT vs ours initial/final; and ours initial vs ours final. We can see that our
results, even without applying the attention branch (ours initial), achieves higher ratings
than CycleGAN or UNIT. The attention branch also significantly improves the results
64
(Ours final). In terms of directly evaluating the effects of attention branch, ours final is
overwhelmingly better than ours initial based on user rankings (Table 4.1 row 5). We
further examined the few cases where the attention results receive lower scores, and we
found that the reason is due to incorrect attention maps (Fig. 4.7).
Method 1 Method 2 1 better About same 2 better
Ours initial
CycleGAN 43.6% 30.0% 26.4%
UNIT 77.4% 17.5% 5.1%
Ours final
CycleGAN 63.0% 21.9% 15.1%
UNIT 83.8% 14.4% 1.8%
Ours initial 74.2% 18.5% 7.3%
Table 4.1: User study results.
input Pix2Pix CycleGAN Ours GT
Figure 4.9: Unsupervised map prediction visualization.
Method Accuracy
Pix2Pix [85] 43.18%
CycleGAN [225] 45.91%
Ours 46.72%
Table 4.2: Unsupervised
map prediction accuracy.
(a) (b) (c) (d) (e) (f)
Figure 4.10: Visualization of image translation from
MNIST (a),(d) to USPS (b),(e) and MNIST-M (c),(f).
Method USPS MNIST-M
CoGAN [123] 95.65% -
PixelDA [17] 95.90% 98.20%
UNIT [122] 95.97% -
CycleGAN [225] 94.28% 93.16%
Target-only 96.50% 96.40%
Ours 96.80% 98.33%
Table 4.3: Unsupervised classification
results.
65
(a) (b) (c) (d) (e) (f)
Figure 4.11: Visualization of rendered face to real face translation.
(a)(d): input rendered faces; (b)(e): CycleGAN results; (c)(f): Our
results.
Method MSE
Baseline 2.26
CycleGAN [225] 2.04
Ours 1.97
Table 4.4: Unsuper-
vised 3DMM prediction
results (MSE).
Effects of using different layers as feature extractors: We experimented using dif-
ferent layers of VGG-19 as feature extractors to measure the perceptual loss. Fig. 4.8
shows visual example of the horse to zebra image translation results trained with differ-
ent perceptual terms. We can see that only using high-level features as regularization
leads to results that are almost identical to the input (Fig. 4.8 (c)) while only using low-
level features as regularization leads to results that are blurry and noisy (Fig. 4.8 (b)).
We find the balance by adopting the first three layers of VGG-19 as feature extractor
which does a good job of image translation and also avoids introducing too many noise
or artifacts (Fig. 4.8 (d)).
4.4.2 Quantitative Results
Map prediction: We translate images from satellite photos to maps with unpaired train-
ing data and compute the pixel accuracy of predicted maps. The original photo-map
dataset consists of 1096 training pairs and 1098 testing pairs, where each pair contains
a satellite photo and the corresponding map. To enable unsupervised learning, we take
the 1096 photos from the training set and the 1098 maps from the test set, using them
as the training data. Note that no attention is used here since the change is global and
we observe training with attention yields similar results. At test time, we translate the
test set photos to maps and again compute the accuracy. If the total RGB difference
between the color of a pixel on the predicted map and that on the ground truth is larger
66
than 12, we mark the pixel as wrong. Figure 4.9 and Table 4.2 show the visual results
and the accuracy results, and we can see our approach achieves highest map prediction
accuracy. Note that Pix2Pix is trained with paired data.
Unsupervised classification: We show unsupervised classification results on
USPS [36] and MNIST-M [56] in Figure 4.10 and Table 4.3. On both tasks, we assume
we have access to labeled MNIST dataset. We first train a generator that maps MNIST to
USPS or MNIST-M and then use the translated image and original label to train the clas-
sifier (we do not apply the attention branch here as we did not observe much difference
after training with attention). We can see from the results that we achieve the highest
accuracy on both tasks, advancing state-of-the-art. The qualitative results clearly show
that our MNIST-translated images both preserve the original label and are also visually
similar to USPS/MNIST-M. We also notice that our model acheives even better results
than the model trained on target labels and conjecture that the classifiers get the benefit
of the larger training set size of MNIST dataset.
3DMM face shape prediction: As a real-world application of our approach, we study
the problem of estimating 3D face shape, which is modeled with the 3D morphable
model (3DMM) [14]. 3DMM is widely used for recognition and reconstruction. For
a given face, the model encodes its shape with a 100 dimension vector. The goal of
3DMM regression is to predict the 100 dimension vector and we compare them with
the ground truth using mean squared error (MSE). [189] proposes to train a very deep
neural network [75] for 3DMM regression. However, in reality, the labeled training
data for real faces are expensive to collect. We propose to use rendered faces instead,
as their 3DMM parameters are readily available. We first rendered 200k faces as the
source domain and use human selfie photo data of 645 face images we collected as the
target domain. For test, we use our collected 112 3D-scanned faces as test data. For
the purpose of domain adaptation, we first use our model to translate the rendered faces
67
to real faces and use the results as the training data, assuming the 3DMM parameters
stay unchanged. The 3DMM regression model structure is 102-layer Resnet [75] as in
[189], and was trained with the translated faces. Figure 4.11 and Table 4.4 show the
qualitative results and the final accuracy of 3DMM regression. From the visual results,
we see that our translated face preserves the shape of the original rendered face and
has higher quality than using CycleGAN. We also reduced the 3DMM regression error
compared with baseline (where we trained on rendered faces and tested on real faces)
and the CycleGAN results.
4.5 Conclusion
We propose to use a simple model with attention for image translation and domain
adaption and achieve superior performance in a variety of tasks demonstrated by both
qualitative and quantitative measures. The attention module is particularly helpful to
focus the translation on region of interest, remove unwanted changes or artifacts, and
may also be used for unsupervised segmentation or saliency detection. Extensive exper-
iments show that our model is both powerful and general, and can be easily applied to
solve real-world problems.
68
Chapter 5
Towards Disentangled Representations
for Human Retargeting by Multi-view
Learning
5.1 Introduction
We consider the problem of learning disentangled representations for sensory data
across multiple domains. Ideally, such representations are interpretable and can separate
factors of variations. Given such factors could be domain-specific (such as identities,
categories, etc.) and domain-invariant (such as shared expressions, poses, illuminations,
etc.), a simpler (and more practical) question to ask is how to separate the domain-
invariant factors from the domain-specific factors. This could potentially enable cross-
domain image translations or in-domain attribute interpolation/manipulations. The
problem then boils down to learning an underlying representation that is invariant to
domain changes, yet encompasses the entirety of other factors that describe the data.
Unfortunately, without explicit correspondences or prior knowledge, such representa-
tions are inherently ambiguous [178].
In this work, we are specifically interested in learning identity-invariant representa-
tions from human captures. The data could come from different sources such as cam-
eras [126] or RGBD sensors [91]. Learning identity-invariant representations is one of
69
.
Figure 5.1: Visual results of face (with head-mounted displays) and body retargeting.
The leftmost column is the input, and the right four columns are the output after chang-
ing the identity labels. After learning the disentangled representations of the expressions
and poses from unannotated data, we can switch the identities and keep the expressions
or poses the same.
the critical steps towards understanding the underlying factors describing human appear-
ances and movements, and could foster applications on multiple fronts. In more detail,
our motivations to tackle this problem are i) such datasets are common and relatively
easy to acquire without requiring additional annotations, and can serve as ideal test-
beds for generative models. ii) it enables human face/body retargeting and synthesis
which have interesting real-world applications [187, 186, 94, 143] and iii) The learned
identity-invariant representations can be used to train other down-stream tasks such as
expression/pose detection or 3D avatar driving [126, 116].
70
Recently, deep learning techniques have led to highly successful image trans-
lations and feature disentanglement. Many existing works address cross-domain
image translations without explicitly modeling the underlying factors that explain the
images [225, 199, 86, 5]. The limitation is that it is impossible to directly manipulate
the attributes or use the learned representations as a proxy for other tasks. Other tech-
niques that jointly learn representation disentanglement and image translation require
paired data during training [65] to marginalize and factor out independent factors. Sev-
eral works have been proposed to eliminate the need for paired training data by using
adversarial training or cycle-consistency assumption. However, there is little guarantee
that we separate content and style as desired as the networks could lose essential intra-
domain information [113, 137, 122, 81]. One of the most similar work to ours is [120],
which augments a CV AE with adversarial training to achieve multi-domain image trans-
lation and learn domain-invariant representations. However, our experiments show that
intra-domain semantic information such as poses or expressions can be easily lost if we
apply adversarial loss on the latent space.
We observe that a successful representations should meet two requirements. First, it
needs to capture intra-domain variances between data within the same domain, and sec-
ond, it needs to ignore inter-domain variances of data across different domains. In the
case of face retargeting, this means that different people making the same expressions
should have identical representations, while different expressions should have different
representations. A straight-forward approach is to learn variational auto-encoder condi-
tioning on domain labels (CV AE) [172, 211], such that the intermediate latent codes are
expected to leave out the domain information. However, as conditioned CV AE’s achieve
this factorization by applying a prior over the posterior, in the form of Kullback–Leibler
divergence with a pre-specified prior distribution [78], this formulation typically trades
71
off domain-invariant consistency against fidelity of reconstruction based on the regular-
ization weight chosen. We observe that this trade-off becomes a more severe issue when
augmenting the formulation with adverserial regularization [120].
To address this challenge, we notice that for many of the real-world problems, we
often have access to multi-view information that can be used. For example, it is easy
to detect facial keypoints or body poses from images. The advantage of such auxiliary
information is that, by construction, they already encoder semantic information about
the visual phenomena as they are human defined and annotated. We show that it is
easier to learn identity-invariant representations from the simplified data. However, the
identity-invariant representations learned from keypoints or poses alone do not contain
all the information needed to reconstruct the images – we may lose information like the
facial details or global illuminations. Therefore, it is critical to extract information from
both the image data and auxiliary data. Our core idea is to leverage the auxiliary data
to simplify the task of learning domain-invariant representations from the images. We
propose two variants of models that can achieve this goal, showing that using multi-view
information such as keypoints or poses are useful, and can lead to better disentangled
representations and image reconstructions.
We demonstrate the effectiveness of our approach on several datasets. The first one is
the large-scale head-mounted captures (HMC) data that we collected. The data consists
of video sequences of more than 100 identities wearing virtual reality (VR) headsets,
each making a variety of facial expressions. The goal is to learn an identity-invariant
representation that encodes the facial expressions. This can be applied to drive 3D
avatars in VR in real time when wearing the headset. We also test our algorithm on the
CMU Panoptic dataset [91] that consists of 30 people making different poses. Similarly,
the goal is to learn an identity-invariant representation that contains body pose infor-
mation. We evaluate our approach by plotting t-SNE [132] of our latent code and also
72
measuring the reconstruction quality and the semantic correctness. We also compare
our method against several baselines and existing techniques to show its advantage.
We summarize our contributions as following:
1. We propose two simple CV AE-based frameworks that use multi-view information
to learn domain-invariant representations. Our approach (and also our implemen-
tation of CV AE) uses domain labels as privileged information. By this way, for the
unseen images without domain labels in the future, our inference is still capable
to learn representations.
2. We enable large-scale image translations of multiple domains that achieve state-
of-the-art results.
3. We show several novel applications in human retargeting and VR.
5.2 Related Work
Disentangled Representation The problem of learning disentangled representation has
long been studied with vast literature. Before deep learning, [182, 183] uses bilinear
models to separate content and style and [79] uses convolution filters to learn a hier-
archy of sparse feature that is invariant to small shifts and distortions. A lot of recent
developments on representation disentanglement are based on models like generative
adversarial network (GAN) [66, 152] and variational auto-encoders (V AE) [100]. Rep-
resentative works include pixel-level domain separation [18] and adaptation [17], video
disentanglement [197, 39, 193] and 3D graphics rendering [107, 46]. However, many
of the existing works require supervision or semi-supervision [99, 142]. InfoGAN [28]
proposes to learn disentangled representations in a completely unsupervised manner,
although is limited to learn disentangled representation in a singled domain and cannot
73
be easily extended to describe cross-domain data. In beta-V AE [78], it modifies the V AE
framework, and introduce an adjustable hyper-parameter beta that balances reconstruc-
tion accuracy and compactness of the latent code. Nevertheless, it is still hard to achieve
satisfying disentangling performance. [178, 137] improve the performance of unsuper-
vised disentanglement by applying adversarial training in the output space. We found
while it usually improves the quality of generated images but does not learn disentan-
gled representations better. One of the most similar works to ours is [120], which is
based on a conditional V AE (CV AE) trained across multiple domains, with adversarial
loss applied to the latent space. We notice that while it is effective in making the latent
code more domain-invariant, it makes the reconstruction less accurate as the number of
hidden factors increases.
Image Translation One of the main applications of our disentangled representation is
for cross-domain image translations. Deep learning-based image translation has moti-
vated many recent research. Isola et al. [86] proposes Pix2Pix, which uses conditional
GANs and paired training data for image translation between two domains. As follow-
up works, Wang et al. [200] extended the framework to generate high-resolution images
conditioning on the source domain; BicycleGAN [226] adds noise as input and applies
it to multimodal image translation. Without paired training data as supervision, addi-
tional constraints are used to regularize the translation. For example, CycleGAN [225]
proposes to use cycle-consistency to ensure 1-1 mapping, which generates impressive
results on many datasets. Similar ideas have been applied by Kim et al. [96], Yi
et al. [214] and Liu et al. [123]. StarGAN [29] recently proposes to extend this frame-
work to multi-domain image translation and shows it could switch the diverse attributes
of human faces without annotating the correspondences. However, these methods do
not explicitly learn disentangled latent representations and the applications are limited.
Other works tackle the problem of image translation and disentangled representation
74
Figure 5.2: The baseline CV AE model. The decoder is conditioned on identityc, which
is encoded as 1-hot vector.
jointly, mostly relying on adversarial training [160, 120, 178] or cycle-consistency con-
straints [113, 122, 81]. However, they are either limited to dual domains or suffer from
the aforementioned reconstruction-disentanglement trade-off.
Human Retargeting Face retargeting and reenactment attracts a lot research interest
given its wide application [35, 134, 58]. Similar to our paper, Olszewski et al. [144]
studies the facial and speech animation in the setting of VR head-mounted displays
(HMDs). Face2Face [187] achieves real-time face capture and reenactment through
online tracking and expression transfer. The quality of the facial textures is further
improved using conditional GANs in [143]. Similarly, [104] achieves face swapping
using multi-scale neural networks. For body image synthesis and transfer, [131] applies
a two-step process, which generates the global structure first and then synthesizes fine
details. [54] devises a framework based on variational u-net that separates appearances
and poses. However, it requires a neutral representation of the target identity as input
and is unable to achieve retargeting directly.
75
Figure 5.3: Two variants of our models. Left: jointly train two CV AEs for images and
keypoints and enforce latent-consistency constraint. Right: train a single CV AE but
generate the keypoints alongside the image.
5.3 Our Approach
5.3.1 Learning Domain-Invariant Representations
We first define the problem of learning domain-invariant representations. Given a col-
lection of image datafX
c
g
N
c=1
acrossN domains (identities here), the task of learning
domain-invariant representations is to learn a low-dimensional embedding z for each
image I given its domain id c, where z should capture the semantics shared across
domains such as poses and expressions, and disregard identity-specific attributes. When
we know the semantics explicitly, or we have the correspondences of semantics between
different domains as supervision, we can directly inferz to be consistent with the seman-
tics. However, acquiring such supervision is usually expensive. Instead, we are inter-
ested in the setting where no explicit supervision is available, except we know the class
label of each image. This becomes a much harder problem as we need to infer the
cross-domain correspondence of the semantics implicitly.
76
5.3.2 Baseline: Conditional Variational Autoencoder
Conditional Variational Autoencoder (CV AE) is a type of deep generative models,
which aims to form probabilistic representations that explains the data given the domain
labels. Specifically, for a sample x, CV AE learns low-dimensional embedding z that
maximizes the likelihood p(xjz;c). It is known that directly maximizing p(xjc) =
R
z
(p(x;zjc)p(zjc)) is intractable, therefore it maximizes the Evidence Lower Bound
(ELBO) (5.1) instead:
E
q
(zjx)
[logp
(xjz;c)]+D
KL
(q
(zjx)jjp
(zjc)) (5.1)
In (5.1), are the model parameters of decoderG (generation network), and are the
model parameters of encoderE (inference network), andc is a specific identity.p
(zjc)
is the prior distribution (usuallyp
(zjc) =N (0; 1)) andq
(zjx) is the variational pos-
terior aims at approximatingp
(zjx;c). ELBO is a lower bound of logp
(xjc) that can
be computed and optimized in a tractable way. From a “perspective” closer to Encoder
Decoder, we can show that maximizing ELBO is equivalent to minimizing the following
loss function (5.2) by assuming generation function p
(xjz;c) follows Gaussian with
covariance matrix
1
2
I:
L(;) =kG(E(I);c)Ik
1
+KL(q(E(I))jjN(0;1)) (5.2)
By training a plain CV AE as shown in (5.2), we can learn a low-dimensional representa-
tionz from input imageI only. We only concatenate identityc with samples from latent
variablez towards reconstruction, so identityc is actually used as privileged informa-
tion during learning. Ideally, we wish to learn latent codez =E(I;c) only pertinent to
domain-invariant factord
I
. This is usually addressed by several techniques: i) increase
the weight ofKL regularization [78]; ii) reduce the dimensionality of z [113] or iii)
apply domain-adversarial loss onz [178, 120]. All of these can lead to a more compact
77
representation ofz and tends to leave out information only correlated with class labelc.
However, we experimentally noticed that the major drawback is that they also limit the
ability of z to fully encode d and model the intra-domain variances, hence the recon-
struction quality suffers. Also, it may not be convenient as it requires time-consuming
trial-and-error procedures to find the appropriate hyper-parameters. We show these chal-
lenges could be alleviated by leveraging multi-view information from the data.
5.3.3 Disentangling with Multi-view Information
.
Figure 5.4: Top: results of image-based CV AE baseline. Middle: results of keypoints-
based CV AE. Bottom: results of jointly train image and keypoints with latent-
consistency constraint. We can see for this example the image-based CV AE fails to
encode the facial expression into the latent representation, while the keypoints-based
CV AE successfully models and transfers the expressions. By jointly training the two
CV AEs we can encourage the latent code of images to preserve the identity-invariant
semantics as well.
We notice a major hurdle of learning disentangled, domain-invariant representations
from unannotated data is the number of factors that could explain the variations, ranging
78
from expressions, poses, shapes, global illuminations to headset locations. Faithfully
modeling all these factors while separating them fromc is a challenging task for training
the model. On the other hand, if the variations are reduced to a small number of factors,
it becomes much easier to disentangle and learn the z that models d and independent
of c. We notice for our task of human retargeting, one of the essential intra-domain
variations can be explained by facial keypoints or body poses, which are relatively easy
to acquire [23, 174]. With the keypoints, we can train a conditional V AE using the same
formula 5.2, but here I are representations of keypoints. Fig. 5.4 shows an example
where the keypoints-based CV AE successfully models the facial expression changes
when the image-based CV AE fails. The latent codez
K
learned from keypoints could be
viewed as a simplified version of the latent codez
I
learned from images. To model the
remaining factors of variations, we could then train two CV AEs (one for keypoint and
one for image) end-to-end, such that the objective function becomes:
L(;) =kG
K
(E
K
(K);c)Kk
1
+D
KL
(q(E
K
(K))jjN(0;1))
+kG
I
(E
I
(I);z
K
;c)Ik
1
+D
KL
(q(E
I
(I))jjN(0;1))
Unfortunately, our experiments show that this still suffers from the disentanglement-
reconstruction trade-off as before. Alternatively, we propose two unified frameworks
that explicitly use multi-view information as constraints and effectively leverage the
simplified structure of keypoints to guide the training of image-based CV AE (Fig. 5.3).
79
Multi-view Training and Multi-view Output
We jointly train two CV AE models in parallel, with encoder-decoder (E
I
;G
I
) for image
and encoder-decoder (E
K
;G
K
) for keypoints. We then encourage their latent codes to
be consistent by usingkz
I
z
K
k
2
as regularizer. The overall training objective is:
L(;) =kG
K
(E
K
(K);c)Kk
1
+
kl
D
KL
(q(E
I
(I)jjN(0;1))
+kG
I
(E
I
(I);c)Ik
1
+
kl
D
KL
(q(E
K
(K))jjN(0;1))
+
z
kz
I
z
K
k
2
2
As shown in Fig. 5.3 left, we associate the two CV AEs with the latent-consistency term,
such that we explicitly encourage the latent code z
I
to contain the domain-invariant
semantics withinz
K
. The hyper-parameter
z
determines the similarity betweenz
I
and
z
K
. If
z
is very large,z
I
would be identical toz
K
and is unable to model variations of
factors other than keypoints semantics. On the contrary, if
z
is too small, it decouples
the training of
I
and
K
, which leads to the image-based CV AE. We study the effects
of
z
in the ablation study.
Another variant of our model is to train a single CV AE for images, but in addition
to output the image reconstruction, we have a separate decoder G
K
that outputs the
keypoints and compare with the ground truth:
L(;) =kG
I
(E
I
(I);c)Ik
1
+kG
K
(E
I
(I);c)Kk
1
+
kl
D
KL
(q(E
I
(I))jjN(0;1)):
By generating the keypoints as well as the images, it also encourages z
I
to preserve
the semantic information of the keypoints. Of course, we could potentially output other
multi-view information such that they are also correctly modeled in the latent codez
I
.
The framework is illustrated as in Fig. 5.3 right.
80
The two variants of our model, multi-view training and multi-view output, both sig-
nificantly improves the disentanglement and reconstruction quality in our experiments
comparing with the baseline model of CV AE, especially for those challenging cases
where the image-based CV AE fails (Fig. 5.4). Our simple frameworks also outperform
other more complicated methods by virtue of its more direct use of multi-view knowl-
edge (Sec. 5.4). The two frameworks have similar performances in most situations, and
we will mainly show and discuss the results of the multi-view training. It is important to
note that the multi-view information such as keypoints and poses are directly extracted
from the images; therefore no additional information is given. However, we show that
by explicitly encouraging the latent code to model such information is a crucial step
towards improving the model performance.
5.3.4 Detailed Implementation
Our image-based CV AE model consists of an encoder and a decoder; each has six con-
volution or transposed convolution layers [48]. Both the convolution and the trans-
posed convolution layers have the following parameters: kernel size = 4; stride = 2 and
padding = 1. For the input, the images are randomly cropped and resized to 256x256. At
the end of the encoder, the features are flattened as a 128-dim mean and variance vector,
from which the 128-dim latent codez
I
is sampled using the re-parameterization trick.
The domain label is represented using 1-hot, which is first mapped to another 128-dim
code z
c
with a fully connected layer. During the decoding phase, z
I
is first concate-
nated withz
c
and is then decoded back to the image. The weights are initialized using
Xavier [64] except for the identity-conditioned fully connected layer, whose weights are
sampled fromN (0; 1). By default, we use`
1
as reconstruction loss, and the weight
of KL regularization is set to
kl
= 0:1.
81
For the multi-view model, the keypoints-based CV AE takes a 1-dim keypoints as
input. The keypoint network also consists of an encoder and a decoder; each has four
fully connected layers whose output channels are 500. The size of the latent codez
k
is
also 128. We also use`
1
to measure keypoint reconstruction. For comparison between
z
K
and z
I
, we use `
2
to measure the distance. By default, the weight of the latent-
consistency loss
z
is set to 1.
5.4 Experiment Results
We do experiments on two datasets: the Head-Mounted Display (HMD) dataset and
the CMU Panoptic datasets [91]; both of them have images and keypoints available.
Our goal is to learn the identity-invariant latent representation, which can be applied to
face/pose retargeting and reenactment. We compare against several methods regarding
the quality of the output image and the learned latent code: the image-based CV AE
baseline, UFDN [120], CycleGAN [225] and StarGAN [29]. Finally as an application,
we also show how the latent code can be used to drive 3D avatars in VR and precisely
transfer the facial expressions and eye movements.
5.4.1 Datasets
HMD Dataset The Head-Mounted Display (HMD) dataset has images captured from
cameras mounted on a head-mounted display (HMD). There are 123 different identities
in total. For each identity, a diverse set of expressions and sentences are recorded in
frames and are labeled as “neutral face”, “smile”, “frown”, “raise cheeks” and others.
For each frame, we capture different views such as the mouth, left eyes, and right eyes.
During preprocessing, we sample the images near the peak frame of each expression,
resulting in around 1,500 images for each identity. The images are gray-scale and are
82
normalized and resized to 256x256 before given to the network as input. We also train
a keypoint detector to extract the keypoints near the mouth, nose and the eyes.
CMU Panoptic Dataset We use the Range of Motion data in the CMU Panoptic
dataset [2]. It consists of 32 identities, where each identity makes different poses under
multiple VGA and HD cameras from different viewpoints. We use the images captured
with the front-view HD camera, and each identity has around 7,000 images. The dataset
also provides the 3D pose for each frame, which we transform and project to 2D to align
with the images and use it as input.
5.4.2 Cross-identity Image Translation
To achieve cross-identity image translation, we first train our multi-view CV AE model
with all the identities. At test time, we only need to use the image-based CV AE com-
ponent. Given an input image of source id, we first encode and sample the latent rep-
resentation. Before decoding, we change the conditional id label to the target id to
translate the image to the new identity while keeping the facial expressions. In Fig. 5.5,
we show examples of HMD results between two identities and compares with other
methods. As our baseline, the image-based CV AE does not deliver satisfying image
quality nor preserves the semantics well. By applying adversarial training to the latent
code [120], the quality improves while all expressions collapse to neutral faces. Cycle-
GAN [225] generates sharp images given the adversarial training applied on the output
image; although the semantics are not well preserved for some frames (column 3, 4 for
example). CycleGAN is also limited to two domains and is not obvious how to infer
the latent representations. StarGAN [29] as the multi-domain version of CycleGAN,
produces fuzzier images with many artifacts. Our results, although not being as sharp as
CycleGAN results, have fewer artifacts and more accurately preserves the expressions.
Similar observations can be made from the results of Panoptic Datasets (Fig. 5.6). Note
83
Image CV AE
UFDN [120]
CycleGAN [225]
StarGAN [29]
Ours
Figure 5.5: Visual results of HMD image translation between two identities. The top
row are the source id with different expressions (col 1-5) and the target id (col 6). Note
how CycleGAN sometimes generates completely wrong expressions (e.g. row 4, col 3).
that although we only showed results between two identities, CV AE, UFDN, StarGAN
and ours are all multi-domain models and are trained and tested with all the identities in
the dataset.
Fig. 5.7 shows examples of expression interpolation between different identities.
Given two source images of different ids, we extract the latent code of each image using
the trained model. Since the codes are identity-invariant, we can interpolate between
84
Image CV AE
UFDN [120]
CycleGAN [225]
Ours
Figure 5.6: Visual results of Panoptic image translation. The top row are the source id
with different poses (col 1-5) and the target id (col 6).
them as approximate representations of transitional expressions. We then concatenate
the code with either identity label and decode to images to get identity-specific interpo-
lated faces between the two source expressions.
For new identity not present in the training set, we cannot directly retarget since
the model does not “see” the label corresponding to that identity during training. This
can be handled by taking the existing ids as the basis, and the new id can be viewed
as a combination of existing ids. To achieve this, we learn a regressor from the training
images to its 1-hot labels. We then regress the images of the new person to a combination
of the existing 1-hot labels as the new label for decoding. Fig. 5.8 shows example results
of face retargeting to a new identity.
85
Figure 5.7: Interpolation between expressions of two different ids. We mark the source
images with colored borders. The rest are interpolation results.
(a) (b) (c) (d) (e)
Figure 5.8: Generalizing to new identities. (a) is the target person not seen during
training. (b) and (d) are source expressions. (c) and (e) are the retargeted faces where
the decoder takes the regressed id of (a) as the conditioning label.
We evaluate our results quantitatively using both unsupervised and supervised met-
rics: i) Auto-encoder (AE) error. We train separate V AEs for each identity. Given a
translated image with the target identityi, we measure the reconstruction error after giv-
ing it as input to thei
th
V AE. The idea is that the trained V AEs should reconstruct the
images better when the input is more similar to the images of the specific identity that
it is trained on. ii) Classification error. We train an identity classification model and
use it to compute the cross-entropy error of the translated image w.r.t. the target id. i)
and ii) both serve to evaluate the translated image quality and how similar it resembles
the target domain, but do not measure how correctly it preserves the source semantics
(expression and poses). For that purpose, iii) we use the available labelings of peak
frames for the expressions and use those frames as the corresponding ground truth to
86
directly measure the`
2
error. This metric measures both the image quality and the cor-
rectness of semantics. Results are shown in Table 5.1.
Both CycleGAN and UFDN perform well on the first two metrics. This is expected
as they tend to have high image quality, especially for CycleGAN. On the other hand,
our results have the smallest`
2
error when comparing with the ground truth correspon-
dence. This shows that our results not only have favorable quality but also best preserve
the semantics. UFDN tends to lose semantics and generate mostly neutral faces, so their
`
2
error is large. Comparing with the image-based CV AE baseline which does not use
keypoints, our quality and semantics are both significantly better, showing the effective-
ness of using multi-view training.
Method AE Error Classification Error `
2
Error
Ground Truth 0.31 0 0
Image CVAE 1.62 2.60 1.15
UFDN [120] 0.56 1.10 2.21
CycleGAN [225] 0.33 0.18 0.82
StarGAN [29] 2.02 4.76 3.22
Ours 0.51 0.36 0.76
Table 5.1: Numerical comparisons. Our results have favorable quality (as shown in AE
and classification error) and best preserve the semantics (as shown in`
2
error).
5.4.3 Identity-Invariant Representations
We visualize the latent representations using t-SNE to show their identity-invariance
property (Fig. 5.9). We plot the latent codes of the images from five identities and four
representative expressions, and color the points by both the expressions and the identi-
ties. We can see from the visualization that our representations (top) separate different
expressions well, and also result in clusterings of the images in the same expression
but from different identities. The distance between the points further shows that it is a
87
meaningful metric as for how similar two expressions are from each other. For example,
“jaw open lips together” is very different from other expressions, and their points are
isolated from others. As for the image-based CV AE (bottom), the points of different
expressions are largely mixed without clear separation.
We show the application of our approach in social virtual reality (VR). Combining
with “eyes”, we train three multi-view CV AEs, one for the face, and two for the eyes.
These three models encode the facial expressions and eye movements of any person to a
shared latent space. Given the parameter of the 3D avatar of the target id at each frame
and the corresponding image, we train a regression model to the 3D parameters of the
avatar from the latent code encoded from the face and eye images of the target person.
Then given another person wearing the headset, we can encode their face and eyes with
the trained CV AE models and map them to the shared latent space, which is further
translated to the avatar parameters with the trained regressor. Assuming the latent space
are identity-invariant, it allows us to encode and transfer the facial expression and eye
movement of any user to the target avatar. Fig. 5.10 shows some examples of our results.
5.5 Conclusion
We proposed the multi-view CV AE model to learn disentangled feature representations
for data across multiple domains. Our model leverages multiple data sources, such as
images, keypoints and poses, and formulate them as additional constraints when training
the CV AE model. It explicitly guides the learned representation to encode the semantics
that are shared across domains while leaving out the domain-specific attributes. We
show our model can be applied to human retargeting and demonstrate the effectiveness
of using additional “views” of data, which leads to improved reconstruction quality and
better disentangling representations.
88
Figure 5.9: t-SNE visualization of the embeddings of five identities and four expressions
(shown above). t-SNE top: ours. t-SNE bottom: CV AE baseline. We color the points
by both identities (left) and expressions (right).
89
Figure 5.10: Driving the 3D avatar of another person in VR while wearing the headset.
From left to right: mouth, left eye, right eye, and the rendered 3D avatar of target identity
(different from the source id wearing the headset). Note the left and right are mirrored
between the image and the avatar.
90
Chapter 6
Unconstrained Facial Expression
Transfer using Style-based Generator
6.1 Introduction
(a) (b) (c)
.
Figure 6.1: Examples of facial expression transfer between two images. Our method
could take any two face images as input and combine the appearance (a) and expression
(b) to synthesize realistic-looking reenactment result (c).
91
Many methods have been developed in recent years to render realistic faces or edit
specific attributes of face images such as appearances or facial expressions [94, 7, 22,
213, 104, 187, 140, 143, 35]. Such techniques have found applications in a variety of
tasks, including photo editing, visual effects, social VR and AR, as well as the contro-
versial “DeepFake” where people create faked images or videos of prominent public
figures [3], oftentimes with malicious intent to spread fake news or misinformation.
The availability of social media, online image, video portals and public datasets have
provided easily-accessible data to facilitate better understanding and modeling of facial
attributes at different levels [92, 93].
We consider the problem of facial expression transfer in the image space: given two
face images, we aim to create a realistic-looking image the combines the appearance
of one image and the expression of the other. Existing methods mostly fall into two
categories. The first one is purely geometry based, where it fits a blendshape model [20]
or tracks facial keypoints [7] to warp and synthesize target facial textures. However,
they require videos as input to fit the 3D model and track the facial transformations.
Also, hidden regions such as teeth and wrinkles are hallucinated by directly transferring
from the source expressions and therefore do not account for their different appearances.
The second category methods consist of data-driven and learning-based synthesis, lever-
aging recent advances of deep neural networks. In addition to facing similar issues
as geometry-based methods, they are also faced with a scalability issue, as they often
require paired training data that are difficult to acquire [143] or need to train separate
models for each of the target identity [104, 3].
To address these issues, we propose a novel expression transfer and reenactment
method based on the recent style-based generative adversarial network (GAN) [93]
(StyleGAN) that was developed to generate realistic-looking human faces. The
92
approach is motivated by several observations. First, as shown in [224], deep gener-
ative models are capable of learning a low-dimensional manifold of the data. Editing
in the latent manifold instead of the pixel space ensures that the image does not fall
off the manifold and looks natural and realistic. Second, StyleGAN can learn hierar-
chical “style” vectors that are shown to explain attributes at different levels, from fine
attributes such as hair color or eyes open/closed to high-level aspects such as pose, face
shape, eyeglasses. At the core of our algorithm is an optimization scheme that infers and
combines the style vectors to create a face that fuses the appearance and the expression
of two images.
With a pre-trained face model, we could directly apply it to any face and infer its
semantic styles in the latent space. Assuming different style layers capture different
attributes, we propose an integer linear programming (ILP) framework to optimize the
style combinations such that the generated image share the appearance of one face and
the expression of another. Unlike previous methods, the inference and optimization
scheme can generalize to unseen identities or new faces without re-training the model.
Our end-to-end expression transfer system also does not rely on geometry modeling or
shape/texture separation, making it more widely applicable to difficult poses or extreme
viewpoints. Moreover, our approach is fully automatic and could easily be used to
generate results at scale, without compromising on the quality of the results due to the
effectiveness of the StyleGAN in modeling natural face distributions.
We test our approach on multiple datasets to show its effectiveness. The results out-
perform previous geometry-based or learning-based methods both visually and quanti-
tatively. In particular, we demonstrate its application to create large-scale facial expres-
sion transfer or “DeepFake” data, which can be potentially used to train a more robust
detector to safeguard against the misuse of such techniques.
Our contributions can be summarized as following:
93
Figure 6.2: Our system pipeline.
1. We introduce an optimization method based on StyleGAN that effectively infer
the latent style of a face image. The style is a hierarchical vector that disentangles
explanatory factors of the face.
2. We propose a unified framework for facial expression transfer based on the latent
styles inferred.
3. We describe a fully automatic, end-to-end system that enables generating high-
quality and scalable face reenactment results.
6.2 Related Work
6.2.1 Face Reconstruction and Rendering
Face reconstruction refers to the task of reconstructing 3D face models of shape and
appearance from visual data, which is often an essential component in animating or
transferring facial expressions. Traditionally, the problem is most commonly tackled
using geometry-based method, such as fitting a 3D model to a single image [15, 16] or a
video [21, 55, 58, 82, 166, 176, 187, 206]. Recently, due to the effectiveness of deep neu-
ral networks, learning-based methods have become popular. [156, 184, 192, 157, 165]
fit a regressor to predict 3D face shape and appearances from a large corpus of images.
94
[165] uses an encoder-decoder network to infer a detailed depth image and dense cor-
respondence map which serve as a basis for non-rigid deformation. [94] combines
the geometry-based method with learning, which first fits a monocular 3D model and
extracts the parameters and then trains a render network using the extracted parameters
as input. Unfortunately, its encoder-decoder model is identity-specific, which therefore
cannot generalize to new identities. paGAN [140] also uses self-supervised training to
learn a renderer from mesh deformations and depth to textures. However, when a sin-
gle image is provided, it needs either the neutral expression or manual initialization.
Besides, the result often lacks realism and fine details due to the process of projecting
the rendered texture onto a 3D model.
6.2.2 Face Retargeting and Reenactment
Facial reenactment transfers expressions from a source actor to a target actor. Most of
these methods require video as input so as to compute the dense motion fields [7, 125,
177] or the warping parameters [187, 188]. Then it uses 2D landmarks or dense 3D
models to track the source and target faces. In the case of a single target image, the
inner mouth interiors and fine details such as wrinkles are hallucinated by copying from
the source, which often leads to uncanny results. Another common approach is to uti-
lize learning-based techniques, especially generative models, for face retargeting. [143]
trains a conditional GAN (cGAN) to synthesize realistic inner face textures. However,
it requires paired training data, which is difficult to acquire. [104] learns face swapping
using convolutional neural networks (CNN) trained with content and style loss. The
swapping model is also identity-specific and needs to be trained for each target actor.
DeepFake [3] on the other hand trains an encoder-decoder for each pair of identities to
be swapped, where the encoder is shared while the decoder is separate for the source
95
actor and the target actor. Due to the same reason as [104], it also requires laborious
training efforts to generalize to new people.
6.2.3 Deep Generative Model for Image Synthesis and Disentangle-
ment
Deep generative models such as GAN [66] and V AE [100] have been very success-
ful in modeling natural image distributions and synthesizing realistic-looking figures.
Recent advances such as WGAN [6], BigGAN [19], Progressive GAN [92] and style-
GAN [93] have developed better architectures, losses and training schemes that were
able to achieve synthesis results of higher resolutions and impressive quality. Due to
the nature that generative models learn a low-dimensional manifold from image data,
they are often adapted to disentangle latent explanatory factors. UFDN [120] uses con-
ditional V AE (cV AE) to disentangle the domain-invariant and domains specific factors
by applying adversarial loss on partial latent encodings. StyleGAN [93] modifies GAN
architecture to implicitly learn hierarchical latent styles that contribute to the generated
faces. However, it remains unclear how to infer these disentangled factors from a given
image, and therefore could not be directly applied for manipulating existing data.
6.3 Our Approach
We first define the problem of facial expression transfer. Given an image of the target
identityI
1
, and another image of the source actorI
2
, we would like to rewrite the facial
expression ofI
1
such that it resembles the expression ofI
2
. Note that unlike [143, 3],
we do not modify the facial appearance of the target identity, instead we only aim to
reenact the facial expression ofI
1
driven by source expressionI
2
.
96
6.3.1 Overview
Our pipeline consists of the following components:
1. Face detect and normalize. We first detect the facial landmarks and use them to
crop and normalize the face regions ofI
1
andI
2
.
2. Style inference. With the pre-trained StyleGAN model, we iteratively optimize
and infer the style vectors of normalizedI
1
andI
2
.
3. Style fuse and regenerate. We apply integral linear programming to fuse the two
style vectors and regenerate a new image that is combines the appearance of I
1
and the expression ofI
2
.
4. Warp and blend. We warp the generated face based on facial landmarks and blend
with originalI
1
rewrite the facial expression so that it agrees withI
2
.
The pipeline is illustrated in Fig. 6.2.
6.3.2 Face Detect and Normalize
We first detect the facial landmarks using the Dlib detector [97], which returns 68 2D
keypoints from the image. Given the landmarks, the face could be rectified by comput-
ing the rotation based on the eye-to-eye landmark (horizontal) and eye-to-mouth land-
mark (vertical) directions. We then crop the image based on the eye-to-eye and eye-to-
mouth distance, such that the output is a region slightly larger than the face. Finally, we
resize the image to 1024x1024, which is the size of the StyleGAN output image.
6.3.3 Style Inference
Given the cropped and rectified image, our goal is to infer its style vector w.r.t. the
StyleGAN generator [93]. The original StyleGAN (Fig. 6.3) consists of a mapping
97
Figure 6.3: StyleGAN architecture. Figure courtesy of [93].
networkf and a synthesis networkg.f takes random noise as input and outputs a style
vectors. s is modeled as 18 layers where each layer is a 512-dimensional vector. The
synthesize network takes the style vector and fixed noise as input, where the style vector
is used as parameters of adaptive instance normalization [80] to transform the output
before each convolution layer. [92] shows that using style vector as layer-wise guidance
not only makes synthesizing high-resolution images easier, but also leads to hierarchical
disentanglement of local and global attributes.
Our goal is to reverse the process: with a pre-trained model, we aim to find the
corresponding style vectors
I
of a given imageI. In this way we could manipulateI
s
to change the corresponding attributes of I. More formally, our goal is to solve the
following objective:
s
I
= arg min
s
D(g(s);I): (6.1)
Hereg is the pre-trained synthesis network with fixed weights, andD is the distance
function to measure the similarity between the output image and the original image. Any
98
distance function such as`
1
or`
2
could be applied here; however we found using pre-
trained VGG network [171] to compute the perceptual similarity gives best reconstruc-
tion results, which is consistent to previous findings that perceptual loss best reflects
human sense of similarity [87]. In our experiments, we use the mid-feature layer of
VGG-16 pretrained network as the feature extractor before computing the Euclidean
distance. More analysis about the distance function and choice of VGG-16 layers are
described in ablation study.
Given the image I, we iteratively solve for s
I
that minimizes Eqn. 6.1. s is first
initialized as a zero-value style vector and g(s) is a random face. We then compute
the error function D(g(s);I), and backpropagate the loss through g to update s using
gradient descent. We use learning rate lr = 1 and for each image, we run 1,000 itera-
tions. Fig. 6.4 shows the an example loss curve ofD during iterative optimization, and
Fig. 6.5 shows how g(s) evolves to become increasingly similar to I. After the opti-
mization is finished,g(s
I
) generates a face that is similar toI, ands
I
can be seen as an
approximation of the underlying style vector ofI. Specific to our task whereI
1
andI
2
are defined as the target identity and source expression image respectively, we solve for
their correspondings
I
1
ands
I
2
using the procedures described above.
Figure 6.4: The loss curve as we iteratively optimize 6.1.
99
iter = 0 iter = 10 iter = 50 iter = 100 iter = 1000
Figure 6.5: Visualization ofg(s) as we iteratively optimize 6.1.
6.3.4 Style Fuse and Regenerate
The goal of expression transfer is to come up with a style vectors
0
such thatg(s
0
) shares
the appearance ofI
1
and the expression ofI
2
. More formally, the objective function can
be defined as:
s
0
= arg min
s
D
1
(g(s);I
1
) +D
2
(g(s);I
2
): (6.2)
HereD
1
is the appearance distance andD
2
is the expression distance. In order to solve
for Eqn. 6.2 directly, it requires a well-defined, differentiable expression and appearance
distance functions. In [104],D
2
is approximated with`
2
content loss, andD
1
approxi-
mated with style loss based on Gram matrix. Unfortunately, artifacts in the results show
that this does not account for the complexity of facial appearance/expression separa-
tion. To ease optimization we constrain the solution space ofs
0
, and assume it lies in
the manifold spanned bys
I
1
ands
I
2
, i.e. s
0
= s
I
1
+s
I
2
, where and are 18x18
diagonal matrix. In other words, each layer ofs
0
is a linear combination ofs
I
1
ands
I
2
.
Without defining ad-hoc distance functions, we constrain to be a 0-1 matrix, and let
= 1. This makes Eqn. 6.2 an integer linear programming problem and significantly
reduce the size of the solution space so that we can heuristically search for the optimal
solution. Converting Eqn. 6.2 to ILP assumes we always take several layers ofs
I
2
and
combine it with the remaining layers ofs
I
1
, which is a valid assumption as [93] shows
that different style layers correspond to different attributes, and certain layers could
100
represent facial expressions. Our experiments show that using a fixed solution such that
0
= diag(0; 0; 0; 1; 1; 0; ; 0) works surprisingly well on a large variety of images.
We analyze different combinations of style layers and their effects in ablation study.
Finally, we regenerate the imageI
0
=g(s
0
).
6.3.5 Warp and Blend
After we regenerate the image I
0
, we compute a transformation based on the facial
landmarks ofI
0
and originalI
1
and warpI
0
to align with the face regionI
1
. We also
compute a mask by taking the convex hull of the facial landmarks and post-processing
with Gaussian blur. Finally, we blend warpedI
0
withI
1
using the mask to rewrite the
facial expression ofI
1
(Fig. 6.6).
(a) (b) (c) (d) (e)
Figure 6.6: Post-processing using warping and blending. (a) generatedI
0
= g(s
0
); (b)
warpedI
0
; (c) facial mask; (d) originalI
1
; (e) final composite.
6.4 Experiments
6.4.1 Experiment Setup and Results
We tested our method on several datasets for qualitative illustrations: FFHQ [93],
CelebA-HQ [92], and web images. Within each of the dataset, we randomly select
pairs of images as the target identity and the source expression. We then infer their style
vectors using FFHQ pre-trained StyleGAN model. Note that although the statistics of
101
images vary between datasets, we found that using a uniform pre-trained model works
sufficiently well across different data. We then fuse the two style vectors by replac-
ing the expression layers of the target identity with those of the source expression and
regenerate the image, which is warped and blended with the target identity as the final
output. As discussed in Sec. 6.3.4, we always use layer 4 and 5 of a style vector as the
representation of expressions. It takes on average 2 minutes to process a pair of images,
and most of the time is spent on iterative style vector inference.
Example visual results are shown in Fig. 6.17. In most cases, our result looks like
plausible human faces and also accurately transfers the expressions of the source actor to
the target identity. Other than the mouth regions, it also transfers the eye movements and
gaze directions (e.g., Row 3 left). The result keeps the eyeglasses in the target identity
but ignore those from the source expression, showing the model is effective in disentan-
gling the appearance/expression attributes. Furthermore, in many cases the appearances
of the target identity and the source expression are substantially dissimilar in skin col-
ors, head poses or genders, our model still separates those traits from expressions and
achieves satisfactory transferred results.
For quantitative evaluation, we use the videos from [143]. The videos are captured
by asking different performers to mimic the expressions and speeches in an instruction
video, such that the expressions are frame-wise aligned across different videos. We
use one video as source expressions and the first frame of another video as target iden-
tity for expression transfer, and then compare the results with the ground truth frames.
Qualitative illustrations of Fig. 6.7 shows that our model captures and transfers subtle
expressions and mouth movements. Table. 6.1 shows quantitative results and compares
with [143]. Our results achieve smallest error and highest SSIM, outperforming both
cGAN and direct transfer.
102
(a)
Figure 6.7: Expression transfer of videos. (a) target identity; (b) top: source expressions;
bottom: transferred results.
Method `
1
Error `
2
Error SSIM
direct transfer [143] 1790 211 0.815
cGAN [143] 1360 152 0.873
Ours 1024 136 0.901
Table 6.1: Numerical comparisons with cGAN and direct transfer.
6.4.2 Comparisons
Comparison with paGAN [140] paGAN generates dynamic textures from a single
image using a trained network, and the textures could be applied to produce avatars
or combine with other faces. Comparing with their method, ours does not need the input
to be neutral expressions. Moreover, our model can effectively disentangle expressions
from other attributes and is much more robust to handle shadows or occlusions. paGAN
on the other hand directly transforms the input textures, so the results largely depend
on the input image quality. Fig. 6.8 shows examples where paGAN fails to reconstruct
good textures due to shadow or occlusion while we still manage to capture the expres-
sions and transfer to the target identity accurately.
103
(a) (b) (c) (d)
Figure 6.8: paGAN [140] comparison. (a) Source identity; (b) Source expressions with
shadow (above) and occlusion (below); (c) Texture reconstructions with paGAN; (d)
Our expression transfer results.
Comparison with cGAN [143] As described in Sec. 6.4, our results are quantitatively
better than [143] when evaluated on videos. One limitation of [143] is that it trains a con-
ditional GAN using registered training pairs, which are difficult to acquire. Another lim-
itation is that it requires the target identity to have neutral expression while our approach
does not have any constraint about the facial expression of the target. In addition, for
the hidden regions such as mouth interiors, their method directly hallucinates them by
copying from the source expressions. This may lead to artifacts as it is from a different
person. On the other hand, our approach directly synthesizes the hidden regions, leading
to more coherent and realistic results (Fig. 6.9).
Comparison with DeepFake [3] DeepFake needs to train a separate model that con-
sists of a shared encoder and different decoders for each pair of identity and expression
and therefore is difficult to scale to unseen people. Comparing with their results, our
results have higher quality, with more coherent textures and fewer artifacts. We also
104
(a) (b) (c) (d)
Figure 6.9: Comparison with cGAN. (a) target identity; (b) source expression; (c) their
result; (d) our results. Note their mouth interiors are directly copied from source expres-
sion.
better disentangle expressions from other attributes such as eyeglasses and mustache
(Fig. 6.10).
(a) (b) (c) (d)
Figure 6.10: Comparison with DeepFake. (a) source identity; (b) source expression; (c)
DeepFake result; (d) our result.
Comparison with Face2Face [187] and DVP [94] Face2Face and Deep Video Portraits
(DVP) are state-of-the-art facial reenactment approaches. Both methods require video
sequence as input and use monocular face reconstruction to parameterize face images.
105
While Face2Face trains multi-linear PCA model for face synthesis, DVP trains a ren-
dering to video translation network to translate modified parameters to images. Similar
to [3], both methods need to re-train the model for each new identity, which is cum-
bersome to scale. On the other hand, our method only requires two images given out
model is a generic style extractor that can be applied to any identity. Fig. 6.11 shows
that although all the methods achieve compelling reenactment effects, the facial move-
ments of Face2Face and DVP are modest, while our method can synthesize more drastic
expressions.
(a) (b) (c) (d)
Figure 6.11: Comparison with DVP and Face2Face. (a) source expression; (b)
Face2Face; (c) DVP; (d) ours.
Comparison with [7] We compare our results with [7], which uses feature tracking and
2D warping to reenact a facial image. Similar to [143], hallucinating the hidden region
such as mouth interior is problematic here as [7] directly copies from the source expres-
sion to the target identity, creating inconsistent appearances. Another issue with [7] is
that when the input identity is not neutral expression, the reenactment fails as it is unable
to initialize the feature correspondences (Fig. 6.12).
106
(a) (b) (c) (d)
Figure 6.12: Comparison with [7]. (a) target identity; (b) source expression; (c) their
result; (d) ours.
Comparison with Face Swap [104] Face Swap [104] trains an identity-specific net-
work, such as cage-net or swift-net, to transform any face to appear like the target iden-
tity. Naturally, each trained model is limited to swap to a fixed identity. Although we
cannot directly compare the results as our goal is not face swapping, we take a ran-
dom public image from the target identity and apply expression transfer to compare.
Fig. 6.13 shows that our method faithfully transfers the expression to the target identity,
while their swapped faces fail to preserve the source expressions.
6.4.3 Ablation Study
Different reconstruction losses During iterative style inference, we compute the per-
ceptual loss between the synthesis network output and the original image and back-
propagate the loss to update the style vector (Sec. 6.3.3). We experiment with different
layers of the VGG-16 network as the feature extractor when computing the loss func-
tion. Fig. 6.15 shows that using mid-feature layer (L=9) leads to reconstruction of the
highest quality and visually most consistent with the original image.
107
(a) (b) (c) (d)
Figure 6.13: Comparison with [104]. (a) source expression; (b) [104] face swap result
using cage-net or swift-net; (c) random image of Nicolas Cage or Taylor Swift; (d) ours
transfer result.
Different combinations of style vectors As described in Sec. 6.3.4, we want to search
for s
0
that is a linear combinations of s
I
1
and s
I
2
, which are both 18-layer style vec-
tors. Since we constrain the coefficient to be a 0-1 matrix, we heuristically search for
different combinations of the two vectors by splicing certain layers ofs
I
2
and replace
those ofs
I
1
. Fig. 6.14 exhaustively illustrates the regenerated images after we splice the
vectors at different locations, and we can see the style vector encodes specific attributes
at each layer. For example, the last layer encodes the background style since whenever
it changes tos
I
1
ors
I
2
the image changes to the corresponding background ofI
1
orI
2
.
Similarly, the middle layers (layer 8 and 9) can be found to encode the hair color and hat
style. Specific our task, it can be observed that layer 4 is associated with expressions:
whenever layer 4 is switched tos
I
2
, the regenerated image shows the facial expression
ofI
2
.
108
I
1
I
2
i = 2
i = 4
i = 6
i = 8
j = 1 j = 2 j = 4 j =-1
Figure 6.14: Top row: two images I
1
and I
2
as input. Bottom rows: replace i layers
starting fromj
th
layer ofs
1
(inferred fromI
1
) with those ofs
2
(inferred fromI
2
) and
regenerate the image.j =-1 (last column) indicates replacing thei layers at the end.
Failure cases We also observe some failure cases in our results. For example, when a
large portion of the face is occluded, the model fails to recover the complete face and
109
(a) Input (b) L=1 (b) L=3 (c) L=5 (d) L=9 (e) L=13
Figure 6.15: Reconstruction result when using different layer L of the VGG network as
perceptual loss during style vector inference.
is unable to generate meaningful expressions (Fig. 6.16 left). The output may also look
uncanny due to excessive shadows in the source identity (Fig. 6.16 right).
Figure 6.16: Examples of failure cases. Each set consists of target identity (left), source
expression (middle) and result (right).
6.5 Conclusion
We propose a simple yet effective expression transfer method based on StyleGAN. Our
method can easily apply to any pair of arbitrary face images and transfer the facial
expression from one to another. Our approach not only generates compelling results
but is also highly scalable and fully automatic. As future work, we are interested in
extending our framework to incorporate head pose reenactment to generate more realis-
tic video results. It would also lead to exciting applications if time efficiency could be
improved such that expression transfer could run in real-time.
110
Figure 6.17: Examples of visual results. Each set consists of identity (left), expression
(middle) and result (right). Test images from top to bottom: CelebA-HQ images, FFHQ
images, random web images. Note the StyleGAN model is only trained with FFHQ
dataset.
111
Chapter 7
Conclusion and Future Work
7.1 Summary of the Research
In this dissertation, we developed several image translation techniques based on deep
neural networks, and show their applications in different computer vision tasks. Based
on the availability of training data, image translation task could be either unsupervised,
which means no paired training data is available; or supervised, which means we have
access to paired training data. A more specific category of supervised image translation
is self-supervised image translation, where we use the original image itself as the super-
vision signal. Image translation belongs to the broader category of generative modeling
problems, where we try to estimate the density function from an extensive collection of
training data. Moreover, image translation is conditional generative modeling problem,
in that we take an image as input and the model output conditions on the input image
rather than random noise.
In Chapters 2 and 3, we developed two state-of-the-art techniques for image inpaint-
ing. First, we propose an image inpainting framework based on deep neural networks
and traditional patch-based algorithms, which achieve high-quality inpainting results on
high-resolution images. Our observation is that patch-based algorithms can be adapted
and applied to neural features to improve texture quality. Furthermore, to reduce the
running time of iterative texture optimization, we developed a feed-forward neural net-
work that combines progressive training with adversarial loss annealing. Our model has
a large capacity and can model high-resolution natural image statistics. The inference
112
is done in a feed-forward pass, which runs in a real-time. The compromise is that we
observe a few artifacts and noises in the output compared with the patch-based frame-
work, especially on large images.
In Chapter 4, we propose a novel unsupervised image translation approach that is
extremely simple yet can achieve outstanding performance. The translation can take
place between two arbitrary domains of data such as horse and zebra, MNIST and
USPS and so on. Our observation is that even though no supervision signal is avail-
able, we can use the original image itself as the supervision by employing proper layers
of pre-trained VGG network as perceptual losses. Furthermore, we observe that image
translation is usually coupled with attending to specific regions, which can be directly
modeled with an attention module. We extended the original translation module with
an additional attention branch, and make it learn the attention mask in an unsupervised
fashion. We observe that by training the attention and translation modules jointly, we
could learn the image segmentation simultaneously with translation, which in turn sig-
nificantly improves the final results.
In Chapters 5 and 6, we study the problem of expression and body pose transfer. We
first propose a multi-view learning approach that is suitable for expression/body pose
transfer across multiple identities without the need for correspondences during training.
Our observation is that, although training a single conditional variational AutoEncoder
can achieve partial disentanglement of different latent factors, there exists tremendous
difficulty due to the reconstruction quality and disentanglement conflicts. However, by
employing simpler data views such as keypoints or facial landmarks and training multi-
ple CV AE at the same time using latent-consistency loss, we significantly improve the
final results. To further enhance the quality of the output and also to scale the model
to unseen identities, we propose another disentanglement and facial expression transfer
technique based on StyleGAN. Our observation is that StyleGAN learns hierarchical
113
features and the trained model can be used to iterative infer and disentangle latent fac-
tors. We can achieve unconstrained facial expression transfer using this approach, which
does not require re-training the model for new identities and can achieve photorealistic
results at extremely high resolution. Our approach can be used to generate DeepFake
images at-scale which can be further applied to improve the robustness of classification
models against adversarial examples.
7.2 Future Research Directions
Given image translation as an important research topic and the several generative mod-
eling techniques we developed, there are multiple future research directions we would
like to pursue:
Better-informed Expression Transfer Current expression transfer based on
StyleGAN rely on the disentangle ability of the model to learn hierarchical fea-
tures. However, we notice that the disentanglement is far from perfect, where
the “expression” layers could potentially encode appearance information as well.
Therefore it would be interesting to investigate a more robust and transparent
scheme of learning disentangled attributes. A possible way of enforcing explicit
expression transfer is to use partial semantic segmentations as input at each layer
so that it only sees part of the image.
More Generalized Inpainting Image inpainting is trained in a self-supervised
fashion such that the missing patterns are randomly generated at training and test
time, and the mask of hoes is given as an additional input to indicate the miss-
ing locations and shapes. However, we often do not have such information readily
available in real life. For example, in the case of image denoising or unwanted pat-
tern removal, the patterns may be random and may have a different distribution at
114
test time. We aim to solve the problem by learning a more generic image inpaint-
ing framework that can detect and handle different types of inpainting regions. A
possible solution is to learn a two-stage approach which first identifies the area to
be inpainted, probably using a recurrent and attention mechanism, and then learn
to inpaint the missing areas.
Natural Image Manifold and Adversarial Attacks With the advances of deep
generative models we can create higher-quality and photorealistic images, which
poses severe threats of misinformation and fake news. Therefore it becomes cru-
cial to learn to detect fake images from real images. However, training a naive
classifier is suboptimal as the distribution of the counterfeit images varies over
time, and it is easy to come up with new attacks if old attack techniques are
exposed. In this scenario, we are hoping to find a tight boundary that covers
the natural image manifold, which could lead to a generalized approach to handle
different kinds of fake images. Some preliminary results show that the manifold
generated by StyleGAN fake images gives a comprehensive and tight constraint
over the real natural images, and the decision boundary learned are robust to even
images generated by other kinds of GAN models.
115
Bibliography
[1] https://research.adobe.com/project/content-aware-fill.
[2] Cmu panoptic range of motion. http://domedb.perception.cs.cmu.
edu/range_of_motion.html. Accessed: 2018-11-14.
[3] Deepfacelab.https://github.com/iperov/DeepFaceLab. Accessed:
2019-03-14.
[4] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand. Domain-
adversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.
[5] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Aug-
mented cyclegan: Learning many-to-many mappings from unpaired data. arXiv
preprint arXiv:1802.10151, 2018.
[6] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint
arXiv:1701.07875, 2017.
[7] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen. Bringing portraits to
life. ACM Transactions on Graphics (TOG), 36(6):196, 2017.
[8] D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[9] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. PatchMatch:
A randomized correspondence algorithm for structural image editing. TOG,
28(3):24:1–24:11, 2009.
[10] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A
randomized correspondence algorithm for structural image editing. ACM Trans-
actions on Graphics-TOG, 28(3):24, 2009.
[11] M. Bertalmio, G. Sapiro, V . Caselles, and C. Ballester. Image inpainting. In
Proceedings of the 27th annual conference on Computer graphics and interactive
techniques, pages 417–424. ACM Press/Addison-Wesley Publishing Co., 2000.
116
[12] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher. Simultaneous structure and
texture image inpainting. IEEE transactions on image processing, 12(8):882–
889, 2003.
[13] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative
adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
[14] V . Blanz, S. Romdhani, and T. Vetter. Face identification across different poses
and illuminations with a 3d morphable model. In Automatic Face and Gesture
Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pages
202–207. IEEE, 2002.
[15] V . Blanz, K. Scherbaum, T. Vetter, and H.-P. Seidel. Exchanging faces in images.
In Computer Graphics Forum, volume 23, pages 669–676. Wiley Online Library,
2004.
[16] V . Blanz, T. Vetter, et al. A morphable model for the synthesis of 3d faces. In
Siggraph, volume 99, pages 187–194, 1999.
[17] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsuper-
vised pixel-level domain adaptation with generative adversarial networks. In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol-
ume 1, page 7, 2017.
[18] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain
separation networks. In Advances in Neural Information Processing Systems,
pages 343–351, 2016.
[19] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity
natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
[20] C. Cao, D. Bradley, K. Zhou, and T. Beeler. Real-time high-fidelity facial perfor-
mance capture. ACM Transactions on Graphics (ToG), 34(4):46, 2015.
[21] C. Cao, Y . Weng, S. Zhou, Y . Tong, and K. Zhou. Facewarehouse: A 3d facial
expression database for visual computing. IEEE Transactions on Visualization
and Computer Graphics, 20(3):413–425, 2014.
[22] C. Cao, H. Wu, Y . Weng, T. Shao, and K. Zhou. Real-time facial animation with
image-based dynamic avatars. ACM Transactions on Graphics, 35(4), 2016.
[23] Z. Cao, T. Simon, S.-E. Wei, and Y . Sheikh. Realtime multi-person 2d pose
estimation using part affinity fields. arXiv preprint arXiv:1611.08050, 2016.
117
[24] R. Caseiro, J. F. Henriques, P. Martins, and J. Batista. Beyond the shortest path:
Unsupervised domain adaptation by sampling subspaces along the spline flow. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3846–3854, 2015.
[25] A. J. Champandard. Semantic style transfer and turning two-bit doodles into fine
artwork. In arXiv:1603.01768v1, 2016.
[26] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Semantic image
segmentation with deep convolutional nets and fully connected crfs. In ICLR,
2015.
[27] Q. Chen and V . Koltun. Photographic image synthesis with cascaded refinement
networks. In The IEEE International Conference on Computer Vision (ICCV),
volume 1, 2017.
[28] X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Info-
gan: Interpretable representation learning by information maximizing generative
adversarial nets. In Advances in neural information processing systems, pages
2172–2180, 2016.
[29] Y . Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified gen-
erative adversarial networks for multi-domain image-to-image translation. arXiv
preprint, 1711, 2017.
[30] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Bengio. Attention-
based models for speech recognition. In Advances in neural information process-
ing systems, pages 577–585, 2015.
[31] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin,
P. Abbeel, and W. Zaremba. Transfer from simulation to real world through
learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518, 2016.
[32] D. Clevert, T. Unterhiner, Hochreiter, and S. Fast and accurate deepnetwork
learning by exponential linear unites (ELUS). In ICLR, 2016.
[33] M. Cordts, M. Omran, S. Ramos, T. Scharw¨ achter, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele. The cityscapes dataset. In CVPR Workshop
on the Future of Datasets in Vision, volume 1, page 3, 2015.
[34] A. Criminisi, P. P´ erez, and K. Toyama. Object removal by exemplar-based
inpainting. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), volume 2, pages II–721 – II–728 vol.2, 2003.
118
[35] K. Dale, K. Sunkavalli, M. K. Johnson, D. Vlasic, W. Matusik, and H. Pfister.
Video face replacement. In ACM Transactions on Graphics (TOG), volume 30,
page 130. ACM, 2011.
[36] J. S. Denker, W. Gardner, H. P. Graf, D. Henderson, R. Howard, W. Hubbard,
L. D. Jackel, H. S. Baird, and I. Guyon. Neural network recognizer for hand-
written zip code digits. In Advances in neural information processing systems,
pages 323–331, 1989.
[37] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models
using a laplacian pyramid of adversarial networks. In arXiv:1506.05751v1, 2015.
[38] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using
a laplacian pyramid of adversarial networks. In Advances in neural information
processing systems, pages 1486–1494, 2015.
[39] E. L. Denton et al. Unsupervised learning of disentangled representations from
video. In Advances in Neural Information Processing Systems, pages 4414–4423,
2017.
[40] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look
like paris? TOG, 31(4), 2012.
[41] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network
for image super-resolution. In European Conference on Computer Vision, pages
184–199. Springer, 2014.
[42] A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional
networks. In arXiv:1602.02644v1, 2015.
[43] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics
based on deep networks. In arXiv:1602.02644v1, 2016.
[44] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics
based on deep networks. In Advances in Neural Information Processing Systems,
pages 658–666, 2016.
[45] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with
convolutional neural networks. In CVPR, pages 1538–1546, 2015.
[46] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs
with convolutional neural networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1538–1546, 2015.
[47] I. Drori, D. Cohen-Or, and H. Yeshurun. Fragment-based image completion.
TOG, 22(3):303–312, 2003.
119
[48] V . Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning.
arXiv preprint arXiv:1603.07285, 2016.
[49] I. Durugkar, I. Gemp, and S. Mahadevan. Generative multi-adversarial networks.
arXiv preprint arXiv:1611.01673, 2016.
[50] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer.
In ACM SIGGRAPH, pages 341–346, 2001.
[51] A. A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In
ICCV, pages 1033–1038, 1999.
[52] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels
with a common multi-scale convolutional architecture. In Proceedings of the
IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
[53] M. Elad, J.-L. Starck, P. Querre, and D. L. Donoho. Simultaneous cartoon and
texture image inpainting using morphological component analysis (mca). Applied
and Computational Harmonic Analysis, 19(3):340–358, 2005.
[54] P. Esser, E. Sutter, and B. Ommer. A variational u-net for conditional appearance
and shape generation.
[55] G. Fyffe, A. Jones, O. Alexander, R. Ichikari, and P. Debevec. Driving high-
resolution facial scans with video performance capture. ACM Transactions on
Graphics (TOG), 34(1):8, 2014.
[56] Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V . Lempitsky. Domain-adversarial training of neural networks.
The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
[57] K. Garg and S. K. Nayar. Photorealistic rendering of rain streaks. In ACM Trans-
actions on Graphics (TOG), volume 25, pages 996–1002. ACM, 2006.
[58] P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormahlen, P. Perez, and C. Theobalt.
Automatic face reenactment. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 4217–4224, 2014.
[59] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional
neural networks. In Advances in Neural Information Processing Systems, pages
262–270, 2015.
[60] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. In
arXiv:1508.06576v2, 2015.
120
[61] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style.
arXiv preprint arXiv:1508.06576, 2015.
[62] L. A. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis and the controlled
generation of natural stimuli using convolutional neural networks. In NIPS, 2015.
[63] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional
neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016
IEEE Conference on, pages 2414–2423. IEEE, 2016.
[64] X. Glorot and Y . Bengio. Understanding the difficulty of training deep feedfor-
ward neural networks. In Proceedings of the thirteenth international conference
on artificial intelligence and statistics, pages 249–256, 2010.
[65] A. Gonzalez-Garcia, J. van de Weijer, and Y . Bengio. Image-to-image translation
for cross-domain disentanglement. arXiv preprint arXiv:1805.09730, 2018.
[66] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y . Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pages 2672–2680, 2014.
[67] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y . Bengio. Generative adversarial nets. In NIPS, pages 2672–
2680, 2014.
[68] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra. DRAW: A
recurrent neural network for image generation. In arXiv:1511.08446v2, 2015.
[69] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨ olkopf, and A. Smola. A
kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–
773, 2012.
[70] I. Gulrajani, F. Ahmed, M. Arjovsky, V . Dumoulin, and A. C. Courville.
Improved training of wasserstein gans. In Advances in Neural Information Pro-
cessing Systems, pages 5767–5777, 2017.
[71] J. Hays and A. A. Efros. Scene completion using millions of photographs. TOG,
26(3), 2007.
[72] J. Hays and A. A. Efros. Scene completion using millions of photographs. In
ACM Transactions on Graphics (TOG), volume 26, page 4. ACM, 2007.
[73] K. He, G. Gkioxari, P. Doll´ ar, and R. Girshick. Mask r-cnn. In Computer Vision
(ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
121
[74] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE
international conference on computer vision, pages 1026–1034, 2015.
[75] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[76] K. He, X. Y . Zhang, S. Ren, and J. Sun. Deep residual learning for image recog-
nition. In arXiv:1409.1556v6, 2016.
[77] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin. Image
analogies. In Proceedings of the 28th annual conference on Computer graphics
and interactive techniques, pages 327–340. ACM, 2001.
[78] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed,
and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained
variational framework. 2016.
[79] F. J. Huang, Y .-L. Boureau, Y . LeCun, et al. Unsupervised learning of invariant
feature hierarchies with applications to object recognition. In Computer Vision
and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE,
2007.
[80] X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive
instance normalization. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1501–1510, 2017.
[81] X. Huang, M.-Y . Liu, S. Belongie, and J. Kautz. Multimodal unsupervised image-
to-image translation. arXiv preprint arXiv:1804.04732, 2018.
[82] A. E. Ichim, S. Bouaziz, and M. Pauly. Dynamic 3d avatar creation from hand-
held video input. ACM Transactions on Graphics (ToG), 34(4):45, 2015.
[83] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image
completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017.
[84] D. Im, C. Kim, H. Jiang, and R. Memisevic. Generating images with recurrent
adversarial networks. In arXiv:1602.05110v1, 2016.
[85] P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with
conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
[86] P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with
conditional adversarial networks. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 1125–1134, 2017.
122
[87] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer
and super-resolution. In European conference on computer vision, pages 694–
711. Springer, 2016.
[88] J. Johnson, A. Alahi, and F. Li. Perceptual losses for real-time style transfer and
super-resolution. In arXiv:1603.08155v1, 2016.
[89] M. K. Johnson, K. Dale, S. Avidan, H. Pfister, W. T. Freeman, and W. Matusik.
Cg2real: Improving the realism of computer generated images using a large
collection of photographs. IEEE Transactions on Visualization and Computer
Graphics, 17(9):1273–1285, 2011.
[90] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and
R. Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated
annotations for real world tasks? In Robotics and Automation (ICRA), 2017 IEEE
International Conference on, pages 746–753. IEEE, 2017.
[91] H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. S. Godisart,
B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y . Sheikh. Panoptic studio:
A massively multiview system for social interaction capture. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2017.
[92] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[93] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for genera-
tive adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
[94] H. Kim, P. Carrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. P´ erez, C. Richardt,
M. Zollh¨ ofer, and C. Theobalt. Deep video portraits. ACM Transactions on
Graphics (TOG), 37(4):163, 2018.
[95] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using
very deep convolutional networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
[96] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover cross-domain
relations with generative adversarial networks. arXiv preprint arXiv:1703.05192,
2017.
[97] D. E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning
Research, 10:1755–1758, 2009.
[98] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
123
[99] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised
learning with deep generative models. In Advances in Neural Information Pro-
cessing Systems, pages 3581–3589, 2014.
[100] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[101] N. Komodakis. Image completion using global optimization. In CVPR, pages
442–452, 2006.
[102] N. Komodakis. Image completion using global optimization. In Computer Vision
and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1,
pages 442–452. IEEE, 2006.
[103] N. Komodakis and G. Tziritas. Image completion using efficient belief propaga-
tion via priority scheduling and dynamic pruning. TIP, 16(11):2649–2661, 2007.
[104] I. Korshunova, W. Shi, J. Dambre, and L. Theis. Fast face-swap using convolu-
tional neural networks. In Proceedings of the IEEE International Conference on
Computer Vision, pages 3677–3685, 2017.
[105] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, pages 1106–1114, 2012.
[106] J. Kuen, Z. Wang, and G. Wang. Recurrent attentional networks for saliency
detection. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 3668–3677, 2016.
[107] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional
inverse graphics network. In Advances in neural information processing systems,
pages 2539–2547, 2015.
[108] V . Kwatra, I. Essa, A. Bobick, and N. Kwatra. Texture optimization for example-
based synthesis. TOG, 24(3):795–802, 2005.
[109] V . Kwatra, A. Sch¨ odl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: Image
and video synthesis using graph cuts. In ACM SIGGRAPH, pages 277–286, 2003.
[110] P.-Y . Laffont, Z. Ren, X. Tao, C. Qian, and J. Hays. Transient attributes for
high-level understanding and editing of outdoor scenes. ACM Transactions on
Graphics (TOG), 33(4):149, 2014.
[111] Y . LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database. AT&T
Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
124
[112] C. Ledig, L. Theis, F. Husz´ ar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken,
A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution
using a generative adversarial network. arXiv preprint, 2016.
[113] H.-Y . Lee, H.-Y . Tseng, J.-B. Huang, M. Singh, and M.-H. Yang. Diverse
image-to-image translation via disentangled representations. arXiv preprint
arXiv:1808.00948, 2018.
[114] C. Li and M. Wand. Combining markov random fields and convolutional neural
networks for image synthesis. In arXiv:1601.04589v1, 2016.
[115] C. Li and M. Wand. Combining markov random fields and convolutional neu-
ral networks for image synthesis. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 2479–2486, 2016.
[116] H. Li, L. Trutoiu, K. Olszewski, L. Wei, T. Trutna, P.-L. Hsieh, A. Nicholls, and
C. Ma. Facial performance sensing head-mounted display. ACM Transactions on
Graphics (ToG), 34(4):47, 2015.
[117] Y . Li, S. Liu, J. Yang, and M.-H. Yang. Generative face completion. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
volume 1, page 6, 2017.
[118] T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ ar,
and C. L. Zitnick. Microsoft coco: Common objects in context. In European
conference on computer vision, pages 740–755. Springer, 2014.
[119] T. Lindvall. Lectures on the coupling method. Courier Corporation, 2002.
[120] A. H. Liu, Y .-C. Liu, Y .-Y . Yeh, and Y .-C. F. Wang. A unified feature disentangler
for multi-domain image translation and manipulation. In Advances in Neural
Information Processing Systems, pages 2590–2599, 2018.
[121] M.-Y . Liu. Unsupervised Image-to-Image Translation. https://github.
com/mingyuliutw/UNIT, 2017. [Online; accessed 7-May-2018].
[122] M.-Y . Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation
networks. In Advances in Neural Information Processing Systems, pages 700–
708, 2017.
[123] M.-Y . Liu and O. Tuzel. Coupled generative adversarial networks. In Advances
in neural information processing systems, pages 469–477, 2016.
[124] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild.
In Proceedings of the IEEE International Conference on Computer Vision, pages
3730–3738, 2015.
125
[125] Z. Liu, Y . Shan, and Z. Zhang. Expressive expression mapping with ratio images.
In Proceedings of the 28th annual conference on Computer graphics and inter-
active techniques, pages 271–276. ACM, 2001.
[126] S. Lombardi, J. Saragih, T. Simon, and Y . Sheikh. Deep appearance models for
face rendering. ACM Transactions on Graphics (TOG), 37(4):68, 2018.
[127] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In CVPR, pages 1–9, 2015.
[128] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic
segmentation. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 3431–3440, 2015.
[129] J. Long, E. Shelhamer, and T. Darrell. Learning deconvolution network for
semantic segmentation. In ICCV, pages 1520–1528, 2015.
[130] M. Long, Y . Cao, J. Wang, and M. I. Jordan. Learning transferable features with
deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
[131] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided
person image generation. In Advances in Neural Information Processing Systems,
pages 406–416, 2017.
[132] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine
learning research, 9(Nov):2579–2605, 2008.
[133] A. Mahendran, H. Bilen, J. F. Henriques, and A. Vedaldi. Research-
doom and cocodoom: learning computer vision with games. arXiv preprint
arXiv:1610.02431, 2016.
[134] C. Malleson, J.-C. Bazin, O. Wang, D. Bradley, T. Beeler, A. Hilton, and
A. Sorkine-Hornung. Facedirector: continuous control of facial performance in
video. In Proceedings of the IEEE International Conference on Computer Vision,
pages 3979–3987, 2015.
[135] X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. P. Smolley. Least squares
generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.
[136] X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. P. Smolley. Least squares
generative adversarial networks. In Computer Vision (ICCV), 2017 IEEE Inter-
national Conference on, pages 2813–2821. IEEE, 2017.
[137] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y . LeCun.
Disentangling factors of variation in deep representation using adversarial train-
ing. In Advances in Neural Information Processing Systems, pages 5040–5048,
2016.
126
[138] T. Miyato, T. Kataoka, M. Koyama, and Y . Yoshida. Spectral normalization for
generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
[139] M. Moravˇ c´ ık, M. Schmid, N. Burch, V . Lis` y, D. Morrill, N. Bard, T. Davis,
K. Waugh, M. Johanson, and M. Bowling. Deepstack: Expert-level artificial
intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
[140] K. Nagano, J. Seo, J. Xing, L. Wei, Z. Li, S. Saito, A. Agarwal, J. Fursund, H. Li,
R. Roberts, et al. pagan: real-time avatars using dynamic textures. In SIGGRAPH
Asia 2018 Technical Papers, page 258. ACM, 2018.
[141] A. Odena, V . Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts.
Distill, 2016.
[142] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary
classifier gans. arXiv preprint arXiv:1610.09585, 2016.
[143] K. Olszewski, Z. Li, C. Yang, Y . Zhou, R. Yu, Z. Huang, S. Xiang, S. Saito,
P. Kohli, and H. Li. Realistic dynamic facial textures from a single image using
gans. In Proceedings of the IEEE International Conference on Computer Vision,
pages 5429–5438, 2017.
[144] K. Olszewski, J. J. Lim, S. Saito, and H. Li. High-fidelity facial and speech
animation for vr hmds. ACM Transactions on Graphics (TOG), 35(6):221, 2016.
[145] A. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks.
In arXiv:1601.06759v1, 2016.
[146] T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu. Semantic image synthesis with
spatially-adaptive normalization. arXiv preprint arXiv:1903.07291, 2019.
[147] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V . Jawahar. Cats and dogs. In
IEEE Conference on Computer Vision and Pattern Recognition, 2012.
[148] D. Pathak, P. Kr¨ ahenb¨ uhl, J. Donahue, T. Darrell, and A. Efros. Context encoders:
Feature learning by inpainting. In CVPR, 2016.
[149] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context
encoders: Feature learning by inpainting. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2536–2544, 2016.
[150] P. P´ erez, M. Gangnet, and A. Blake. Poisson image editing. ACM Transactions
on graphics (TOG), 22(3):313–318, 2003.
[151] W. Qiu and A. Yuille. Unrealcv: Connecting computer vision to unreal engine.
In European Conference on Computer Vision, pages 909–916. Springer, 2016.
127
[152] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learn-
ing with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.
[153] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative
adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
[154] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley. Color transfer between
images. IEEE Computer graphics and applications, 21(5):34–41, 2001.
[155] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and
variational inference in deep latent gaussian models. In International Conference
on Machine Learning, volume 2, 2014.
[156] E. Richardson, M. Sela, and R. Kimmel. 3d face reconstruction by learning from
synthetic data. In 2016 Fourth International Conference on 3D Vision (3DV),
pages 460–469. IEEE, 2016.
[157] E. Richardson, M. Sela, R. Or-El, and R. Kimmel. Learning detailed face recon-
struction from a single image. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 1259–1268, 2017.
[158] S. R. Richter, V . Vineet, S. Roth, and V . Koltun. Playing for data: Ground truth
from computer games. In European Conference on Computer Vision, pages 102–
118. Springer, 2016.
[159] K. Ridgeway, J. Snell, B. Roads, R. Zemel, and M. Mozer. Learning to generate
images with perceptual similarity metrics. In arXiv:1511.06409v1, 2015.
[160] S. Rifai, Y . Bengio, A. Courville, P. Vincent, and M. Mirza. Disentangling factors
of variation for facial expression recognition. In Computer Vision–ECCV 2012,
pages 808–822. Springer, 2012.
[161] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The syn-
thia dataset: A large collection of synthetic images for semantic segmentation of
urban scenes. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 3234–3243, 2016.
[162] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recogni-
tion challenge. IJCV, 115(3):211–252, 2015.
[163] A. A. Rusu, M. Vecerik, T. Roth¨ orl, N. Heess, R. Pascanu, and R. Hadsell.
Sim-to-real robot learning from pixels with progressive nets. arXiv preprint
arXiv:1610.04286, 2016.
128
[164] T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen.
Improved techniques for training gans. In Advances in neural information pro-
cessing systems, pages 2234–2242, 2016.
[165] M. Sela, E. Richardson, and R. Kimmel. Unrestricted facial geometry reconstruc-
tion using image-to-image translation. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1576–1585, 2017.
[166] F. Shi, H.-T. Wu, X. Tong, and J. Chai. Automatic acquisition of high-fidelity
facial performances using monocular videos. ACM Transactions on Graphics
(TOG), 33(6):222, 2014.
[167] Y . Shih, S. Paris, F. Durand, and W. T. Freeman. Data-driven hallucination of dif-
ferent times of day from a single outdoor photo. ACM Transactions on Graphics
(TOG), 32(6):200, 2013.
[168] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learn-
ing from simulated and unsupervised images through adversarial training. In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol-
ume 3, page 6, 2017.
[169] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez,
T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without
human knowledge. Nature, 550(7676):354, 2017.
[170] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. In ICLR, 2014.
[171] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[172] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using
deep conditional generative models. In Advances in Neural Information Process-
ing Systems, pages 3483–3491, 2015.
[173] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation.
In AAAI, volume 6, page 8, 2016.
[174] Y . Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial
point detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 3476–3483, 2013.
[175] K. Sunkavalli, M. K. Johnson, W. Matusik, and H. Pfister. Multi-scale image
harmonization. In ACM Transactions on Graphics (TOG), volume 29, page 125.
ACM, 2010.
129
[176] S. Suwajanakorn, I. Kemelmacher-Shlizerman, and S. M. Seitz. Total moving
face reconstruction. In European conference on computer vision, pages 796–812.
Springer, 2014.
[177] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman. What makes tom
hanks look like tom hanks. In Proceedings of the IEEE International Conference
on Computer Vision, pages 3952–3960, 2015.
[178] A. Szab´ o, Q. Hu, T. Portenier, M. Zwicker, and P. Favaro. Challenges in dis-
entangling independent factors of variation. arXiv preprint arXiv:1711.02245,
2017.
[179] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Van-
houcke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages
1–9, 2015.
[180] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Van-
houcke, and A. Rabinovich. Very deep convolutional networks for large-scale
image recognition. In arXiv:1409.1556v6, 2015.
[181] M. W. Tao, M. K. Johnson, and S. Paris. Error-tolerant image compositing. In
European Conference on Computer Vision, pages 31–44. Springer, 2010.
[182] J. B. Tenenbaum and W. T. Freeman. Separating style and content. In Advances
in neural information processing systems, pages 662–668, 1997.
[183] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear
models. Neural computation, 12(6):1247–1283, 2000.
[184] A. Tewari, M. Zollhofer, H. Kim, P. Garrido, F. Bernard, P. Perez, and
C. Theobalt. Mofa: Model-based deep convolutional face autoencoder for unsu-
pervised monocular reconstruction. In Proceedings of the IEEE International
Conference on Computer Vision, pages 1274–1283, 2017.
[185] L. Theis and M. Bethge. Generative image modeling using spatial LSTMs. In
arXiv:1603.03417v1, 2015.
[186] J. Thies, M. Zollh¨ ofer, M. Nießner, L. Valgaerts, M. Stamminger, and
C. Theobalt. Real-time expression transfer for facial reenactment. ACM Trans.
Graph., 34(6):183–1, 2015.
[187] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face:
Real-time face capture and reenactment of rgb videos. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 2387–
2395, 2016.
130
[188] J. Thies, M. Zollh¨ ofer, M. Stamminger, C. Theobalt, and M. Nießner. Facevr:
Real-time facial reenactment and eye gaze control in virtual reality. arXiv preprint
arXiv:1610.03151, 2016.
[189] A. T. Tran, T. Hassner, I. Masi, and G. Medioni. Regressing robust and discrim-
inative 3d morphable models with a very deep neural network. In 2017 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 1493–
1502. IEEE, 2017.
[190] Y .-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang. Deep image
harmonization. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2017.
[191] Y .-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, and M.-H. Yang. Sky is not the limit:
semantic-aware sky replacement. ACM Trans. Graph., 35(4):149–1, 2016.
[192] A. Tuan Tran, T. Hassner, I. Masi, and G. Medioni. Regressing robust and dis-
criminative 3d morphable models with a very deep neural network. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
5163–5172, 2017.
[193] S. Tulyakov, M.-Y . Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion
and content for video generation. arXiv preprint arXiv:1707.04993, 2017.
[194] E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and
T. Darrell. Towards adapting deep visuomotor representations from simulated to
real environments. CoRR, abs/1511.07111, 2015.
[195] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer
across domains and tasks. In Computer Vision (ICCV), 2015 IEEE International
Conference on, pages 4068–4076. IEEE, 2015.
[196] D. Ulyanov, V . Lebedev, A. Vedaldi, and V . Lempitsky. Texture networks: Feed-
forward synthesis of textures and stylized images. In arXiv:1603.03417v1, 2016.
[197] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and
content for natural video sequence prediction. arXiv preprint arXiv:1706.08033,
2017.
[198] P. Vincent, H. Larochelle, Y . Bengio, and P.-A. Manzagol. Extracting and com-
posing robust features with denoising autoencoders. In Proceedings of the 25th
international conference on Machine learning, pages 1096–1103. ACM, 2008.
[199] T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-
resolution image synthesis and semantic manipulation with conditional gans.
arXiv preprint arXiv:1711.11585, 2017.
131
[200] T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-
resolution image synthesis and semantic manipulation with conditional gans. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2018.
[201] X. Wang and A. Gupta. Generative image modeling using style and structure
adversarial networks. In European Conference on Computer Vision, pages 318–
335. Springer, 2016.
[202] Y . Wexler, E. Shechtman, and M. Irani. Space-time video completion. In CVPR,
pages 120–127, 2001.
[203] Y . Wexler, E. Shechtman, and M. Irani. Space-time video completion. In Com-
puter Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004
IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2004.
[204] M. Wilczkowiak, G. Brostow, B. Tordoff, and R. Cipolla. Hole fill through pho-
tomontage. In BMVC, pages 492–501, 2005.
[205] M. Wilczkowiak, G. J. Brostow, B. Tordoff, and R. Cipolla. Hole filling through
photomontage. In BMVC 2005-Proceedings of the British Machine Vision Con-
ference 2005, 2005.
[206] C. Wu, D. Bradley, M. Gross, and T. Beeler. An anatomically-constrained local
deformation model for monocular face capture. ACM Transactions on Graphics
(TOG), 35(4):115, 2016.
[207] S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of the IEEE
international conference on computer vision, pages 1395–1403, 2015.
[208] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial
attention for visual question answering. In European Conference on Computer
Vision, pages 451–466. Springer, 2016.
[209] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y . Bengio. Show, attend and tell: Neural image caption generation with visual
attention. In International conference on machine learning, pages 2048–2057,
2015.
[210] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan:
Fine-grained text to image generation with attentional generative adversarial net-
works. arXiv preprint, 2017.
[211] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image
generation from visual attributes. In European Conference on Computer Vision,
pages 776–791. Springer, 2016.
132
[212] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image
inpainting using multi-scale neural patch synthesis. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
[213] R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala. Semantic facial expression
editing using autoencoded flow. arXiv preprint arXiv:1611.09961, 2016.
[214] Z. Yi, H. R. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning
for image-to-image translation. In ICCV, pages 2868–2876, 2017.
[215] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-level domain transfer.
In European Conference on Computer Vision, pages 517–532. Springer, 2016.
[216] F. Yu and V . Koltun. Multi-scale context aggregation by dilated convolutions.
arXiv preprint arXiv:1511.07122, 2015.
[217] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stack-
gan: Text to photo-realistic image synthesis with stacked generative adversarial
networks. arXiv preprint arXiv:1612.03242, 2016.
[218] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European
Conference on Computer Vision, pages 649–666. Springer, 2016.
[219] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unrea-
sonable effectiveness of deep features as a perceptual metric. arXiv preprint
arXiv:1801.03924, 2018.
[220] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Isl
2
a good loss function for neural
networks for image processing? In arXiv:1511.08861v1, 2015.
[221] J. Zhao, M. Mathieu, and Y . LeCun. Energy-based generative adversarial net-
work. arXiv preprint arXiv:1609.03126, 2016.
[222] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image
database for deep scene understanding. arXiv preprint arXiv:1610.02055, 2016.
[223] J.-Y . Zhu, P. Kr¨ ahenb¨ uhl, E. Shechtman, and A. A. Efros. Learning a discrim-
inative model for the perception of realism in composite images. In Computer
Vision (ICCV), 2015 IEEE International Conference on, 2015.
[224] J.-Y . Zhu, P. Kr¨ ahenb¨ uhl, E. Shechtman, and A. A. Efros. Generative visual
manipulation on the natural image manifold. In European Conference on Com-
puter Vision, pages 597–613. Springer, 2016.
[225] J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image trans-
lation using cycle-consistent adversarial networks. In Proceedings of the IEEE
international conference on computer vision, pages 2223–2232, 2017.
133
[226] J.-Y . Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shecht-
man. Toward multimodal image-to-image translation. In Advances in Neural
Information Processing Systems, pages 465–476, 2017.
[227] Y . Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi.
Target-driven visual navigation in indoor scenes using deep reinforcement learn-
ing. In Robotics and Automation (ICRA), 2017 IEEE International Conference
on, pages 3357–3364. IEEE, 2017.
[228] X. W. Ziwei Liu, Ping Luo and X. Tang. Deep learning face attributes in the wild.
In Proceedings of International Conference on Computer Vision (ICCV), 2015.
134
Abstract (if available)
Abstract
In the thesis, we tackle the problem of translating faces and bodies between different identities without paired training data: we cannot directly train a translation module using supervised signals in this case. Instead, we propose to train a conditional variational auto-encoder (CVAE) to disentangle different latent factors such as identity and expressions. In order to achieve effective disentanglement, we further use multi-view information such as keypoints and facial landmarks to train multiple CVAEs. By relying on these simplified representations of the data we are using a more easily disentangled representation to guide the disentanglement of image itself. Experiments demonstrate the effectiveness of our method in multiple face and body datasets. We also show that our model is a more robust image classifier and adversarial example detector comparing with traditional multi-class neural networks. ❧ To address the issue of scaling to new identities and also generate better-quality results, we further propose an alternative approach that uses self-supervised learning based on StyleGAN to factorize out different attributes of face images, such as hair color, facial expressions, skin color, and others. Using pre-trained StyleGAN combined with iterative style inference we can easily manipulate the facial expressions or combine the facial expressions of any two people, without the need of training a specific new model for each of the identity involved. This is one of the first scalable and high-quality approach for generating DeepFake data, which serves as a critical first step to learn a more robust and general classifier against adversarial examples.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning controllable data generation for scalable model training
PDF
Visual knowledge transfer with deep learning techniques
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Transfer learning for intelligent systems in the wild
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Landmark-free 3D face modeling for facial analysis and synthesis
PDF
Generating psycholinguistic norms and applications
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Modeling, learning, and leveraging similarity
PDF
Analyzing human activities in videos using component based models
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Deep learning models for temporal data in health care
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
3D deep learning for perception and modeling
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Object classification based on neural-network-inspired image transforms
PDF
Multimodal reasoning of visual information and natural language
PDF
Interactive learning: a general framework and various applications
PDF
Robust and proactive error detection and correction in tables
Asset Metadata
Creator
Yang, Chao
(author)
Core Title
Deep generative models for image translation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
06/13/2019
Defense Date
05/07/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,computer vision,machine learning,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. (
committee chair
), Barbic, Jernej (
committee member
), Jenkins, Keith (
committee member
)
Creator Email
chaoy@usc.edu,harryyang.hk@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-174296
Unique identifier
UC11662483
Identifier
etd-YangChao-7479.pdf (filename),usctheses-c89-174296 (legacy record id)
Legacy Identifier
etd-YangChao-7479.pdf
Dmrecord
174296
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yang, Chao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
artificial intelligence
computer vision
machine learning