Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Designing neural networks from the perspective of spatial reasoning
(USC Thesis Other)
Designing neural networks from the perspective of spatial reasoning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DESIGNING NEURAL NETWORKS FROM THE PERSPECTIVE OF SPATIAL REASONING
by
Haiwei Chen
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2024
Copyright 2025 Haiwei Chen
Dedication
To Mom, Dad, Mimmy and Momo.
ii
Acknowledgements
First of all, I would like to thank my advisors Prof. Yajie Zhao, Prof. Randall Hill, Prof. Hao Li and Prof.
Evan Suma for their academic guidance and insights, which always played the critical role in my growth
as a researcher, and for their continuous supports of me to freely explore my interested research topics. I
would also like to thank all members of my thesis committee: Prof. Ram Nevatia and Prof. Andrew Nealen
for their insightful suggestions, and Prof. Andrew Gordon, Prof. Aiichiro Nakano and Prof. Ramesh
Govindan for joining my qualification exam committee. I would also like to thank Kathleen Hasse and
Christian Trejo for their very supportive administration and their always timely response to my needs for
help and advice.
It has been an honor for me to work with many talented researchers: Shichen Liu, Weikai Chen,
Tianye Li, Gonglin Chen, Jiayi Liu, Yunxuan Cai, and Samantha Chen. Many of them not only inspired me
as collaborators, with their insightful suggestions on research problems, but also encouraged me, as true
friends, to tackle the many challenges that I have faced in my research career. I would like to extend my
gratitude to Bo Yang, Jian Wang, Guru Krishman and Sizhuo Ma for the excellent guidance and support
during my internship at Tencent Games and Snap Research. I would like to take this opportunity to thank
my undergraduate mentors, Prof. Henry Fuchs, Prof. Martin Styner, Prof. Arlene Chung, Prof. HyeChung Kum and Prof. Gary Bishop, whom introduced me to conducting research in computer science and
encouraged me to seek for a PhD degree.
iii
I would like to acknowledge my lab mates at USC ICT for their company during the fun time and the
hardest time in the past few years: Zeng Huang, Zimo Li, Jun Xing, Shunsuke Saito, Xinglei Ren, Bipin
Kishore, Pratusha Prasad, Marcel Ramos, Yi Zhou, Ruilong Li, Yuliang Xiu, Mingming He, Pengda Xiang,
Sitao Xiang, Jing Yang, Yuming Gu, Hanyuan Xiao, Bryce Blinn, Ziqi Zeng and Wenbin Teng.
Finally, I would like to thank my father, mother, and my family for their unwavering support and
unconditional love.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Designing equivariant neural operator in the SE(3) space . . . . . . . . . . . . . . . 5
1.1.2 Designing a generative implicit network for texture synthesis . . . . . . . . . . . . 6
1.1.3 Restricting the receptive field for generative image inpainting . . . . . . . . . . . . 6
Chapter 2: Designing Equivariant Neural Operator in the SE(3) space. . . . . . . . . . . . . . . . . 8
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Deriving the SE(3) convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 SE(3) Separable Convolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 SE(3) point convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 SE(3) group convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Complexity analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4 Proof of Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Shape Matching with Attention mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 Experiments on Rotated ModelNet40 . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.2 Shape Alignment on 3DMatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.3 Inference speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 3: Designing a Generative Implicit Network for Texture Synthesis. . . . . . . . . . . . . . 33
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Periodic Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Latent Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.3 Conditional IPFN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
v
3.2.4 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Texture Pattern Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.3 Volumetric Shape Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 Application: Seamless 3D Texturing . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.5 Application: 3D Foam with controllable density . . . . . . . . . . . . . . . . . . . . 48
3.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Chapter 4: Restricting the Receptive Field for Generative Image Inpainting. . . . . . . . . . . . . . 54
4.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1 Encoding with Restrictive Convolutions . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Predicting the latent codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 Decoding the latent codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.1 Comparisons to The State of Arts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.3 Limitations of our model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 5: Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 Summary of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Open Questions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vi
List of Tables
2.1 Angular errors in point cloud pose estimation. . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Results on shape classification and retrieval on randomly rotated objects of ModelNet40. . 27
2.3 Results of ablation studies on ModelNet40 dataset. The conv column denotes the
configuration of convolution layers. The global pool column denotes the type of global
pooling method. Loss configuration follows notation from Sec. 2.4. . . . . . . . . . . . . . 28
2.4 Comparisons of average recall of keypoint correspondences on 3DMatch. All baseline
results are tested on the official 3DMatch evaluation set without point normals. . . . . . . 29
3.1 SIFID scores between the exemplars and the generated patterns from ours and different
baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Comparisons of inference time and inference memory consumption, measured in
milliseconds (ms) / gigabytes (GB), when patterns of increasing size (top row) are
generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Comparisons of FID and diversity scores to the baseline methods. Bold text denotes the
best, and blue text denotes the second. Since LaMa [107] does not generate pluralistic
results, and Pluralistic [140] produces degenerate results in the Places Box setting, we
omit their diversity scores in the table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Quantitative ablation study. “Temperature” adjustments change the temperature value
in the sampling procedure. “Restrictive Conv” adjustments change the mask update rule
in the restrictive encoder. “Network Design” adjustments replace our designed network
structures with the vanilla ones: for the “Vanilla Encoder” setting, an encoder network
with the regular convolution layers are used; for the “Vanilla Decoder” setting, the
predicted latent codes are directly decoded into an image. In the “Full Model”, we set
t = 1.0 and α = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Comparisons of FID and diversity scores between the restrictive encoder and the miracle
encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
vii
List of Figures
2.1 Illustration of SPConv. Each arrow represents an element in the group and each edge
represents a correlation needed to compute in the convolution operator. We propose to
use two separable convolutions (b)(c) to achieve SE(3) equivariance. The computational
cost is much lower than the naive 6D convolution (a). (d) shows the structure of a basic
SPConv block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 An illustration of the network architecture used in both ModelNet and 3DMatch experiments. 23
2.3 Percentile of errors comparing KPConv [110] and two equivariant models (Ours-N) varied
in number of SO(3) elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Classification accuracy based on the attention confidence for each object category. The
attention layer is trained on rotated dataset to learn a canonical orientation for the given
object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 T-SNE visualization of features learned by our network. Each column contains a pair of
fragments from the same scene. Regions in correspondence are automatically labeled with
similar features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Comparison between synthesized honeycomb from a DCGAN convolution generator and
the periodic MLP generator. Seams and intercepting patterns are visible in the former
result due to difficulty for the convolution generator to capture the repeating structure. . . 36
3.2 Overview of our network architecture discussed in Section 3.2.4. . . . . . . . . . . . . . . . 43
3.3 Synthesized honeycomb textures for the ablation study. The blue boxes represent the
learned scale of periodic encoding in ours, where in w/o deformation, the period is default
to 1, which does not match with the repeating structure of the honeycomb pattern and
results in visual artifact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Main results for 2D texture synthesis with comparisons to Henzler et al. [43], Bergmann
et al. [6], and Zhou et al. [144] on synthesizing two stationary patterns (top four rows)
and two directional patterns (bottom four rows). . . . . . . . . . . . . . . . . . . . . . . . . 49
viii
3.5 Synthesized foam structures with controllable density. a. The grey scale bar controls
the synthesized structure from the highest density (white) to the lowest (black). b.
Smooth interpolation of the guidance factor allows us to synthesize a foam structure with
smoothly changing densities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Main results for 3D volume synthesis. a. Exemplar porous structure. b. Synthesized
structure models interior tunnels. c. Global views of synthesized porous structures. c.
Exemplar foam structure. e. Two scales of noise fields for the foam structure synthesis. f.
Synthesized foam structures. Larger scale of the noise field leads to more isotropic foam
structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 IPFN learns multi-channel textures that are applicable to seamless 3D texturing. The
original 3D texture in this example is not symmetric and therefore visible seams can be
found on the texture-mapped surface and in the closeup view (A in figure). As synthesized
patterns learnt from this exemplar can be tiled in any direction, the mapped surface (B in
surface) is seamless. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 The examples demonstrating limitations of our network . . . . . . . . . . . . . . . . . . . 53
4.1 Overall pipeline of our method. Erst denotes our proposed restrictive encoder that
predicts partial tokens from the source image (see Section 4.2.1). The grey square space
in the figure denotes missing tokens, which are iteratively predicted by a bidirectional
transformer (see Section 4.2.2). Eprt denotes an encoder with partial convolution layers,
which processes the source image into complementary features to the predicted tokens.
The coupled features are decoded into a complete image by a generate G (see Section 4.2.3). 59
4.2 A visualization of mask down-sampling, shown on a 16x16 grid on the third column,
from different α values following Equation 4.4. Smaller α values (top two rows) lead the
restrictive encoder to predict tokens for more small mask areas (marked by the red pixels).
Larger α is undesirable (bottom two rows) as it unnecessarily discards useful information
from the image, leading to more inconsistent inpainting results. . . . . . . . . . . . . . . . 61
4.3 A visual comparison between the decoder designs. A. Directly decoding the predicted
latent codes Z with the restrictive encoder E and transformer T, and B. its composition
with the source image XM. C. Our proposed decoding design, where partial image
priors Eprt(XM) are composed with Z through a composition function f described in
Equation.4.7-4.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Detailed network structures for the encoder and decoder. Numbers within each feature
map (e.g. (3,128)) denote the input and output channels. Numbers below each feature map
(e.g. 256x256) denotes the size of the tensor. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5 Visual examples on inpainting with both the random masks (upper half) and the
challenging large box mask (lower half), compared to the selected baseline methods. . . . 71
4.6 Further visual examples of inpaining under the large mask setting, compared to the
baseline methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
ix
4.7 Further visual examples of pluralistic inpainting on the Places Dataset [142], compared to
the baseline methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Further visual examples of pluralistic inpainting on the Places Dataset [142], compared to
the baseline methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.9 Further visual examples of pluralistic inpainting on the CelebA-HQ Dataset [55], compared
to the baseline methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.10 Comparisons of inpainting results with regard to different sampling temperature t and
annealing factors s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.11 Further visual examples of pluralistic inpainting with respect to different sampling
temperature t and annealing factor s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.12 Visual comparison between inpainting with the restrictive encoder and a miracle encoder. 77
4.13 Failure cases in our results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
x
Abstract
All visual data, from image to CAD models, live in a 2D or 3D spatial domain. In order to understand
and model the visual data, spatial reasoning has always been fundamental to computer vision algorithms.
Naturally, the practice has been widely extended to the use of artificial neural networks built for visual
analysis. The basic building blocks of a neural network - operators and representations - are means to
learn spatial relationships and therefore are built with spatial properties. In this thesis, we present novel
designs of neural operators and representations in different application contexts, with a unique focus on
how these design choices affect the spatial properties of the neural networks in ways that are beneficial for
the tasks at hand. The first topic explored is the equivariance property, where a SE(3) equivariant convolutional network is designed for 3D pose estimation and scene registration. In this chapter, we show that the
equivariant property of a convolutional neural network can be practically extended to higher dimensional
space and proved highly effective for applications that are not only sensitive to translation, but also 3D
rotations. The second topic explored is the learning of neural operators that approximate spatially continuous function in a pattern synthesis application context. In this chapter, we explore the combination of
deformable periodic encoding and continuous latent space which enables an implicit network, consisting
of multilayer perceptron, to synthesize diverse, high-quality and infinitely large 2D and 3D patterns. The
unique formulation allows the generative model to be at least 10 times faster and more memory efficient
compared to previous efforts, and marked one of the earliest attempts to adopt the implicit network to
the generative setting. The third topic explored is spatial awareness with regard to incomplete images,
xi
where a generative network model for image inpainting is designed based on an analysis-after-synthesis
principle. In this model, a novel encoder is designed to restrict the receptive field in the analysis step, and
the extracted features serve as priors to a bidirectional generative transformer that synthesize latent codes
step by step. This novel paradigm demonstrates the effectiveness of disentangling analysis and synthesis
in challenging image inpainting scenarios, as the resulted network model achieves state-of-the-art performance in both diversity and quality, when completing partial images with free-form holes occupying as
large as 70% of the image.
I believe that the topics covered have contributed to a better understanding of neural operator and representation designs for both discriminative and generative learning in computer vision, from a perspective
of identifying the effective ways of spatial reasoning for the targeted visual applications.
xii
Chapter 1
Introduction
Computer vision is the automated analysis of visual patterns. A fundamental characteristic of this study
is its focus on visual data - images, videos, 3D shapes - that possess spatial structures. The ability of
understand and analyze spatial structures from an observed visual pattern has been considered the key to
many computer vision applications, as it, to some extents, mimics the mechanism of human’s perceptual
system. For example, a photograph of a chair can be recognized by human and machine alike as a chair,
because the legs are found beneath the seat, and the seat is positioned horizontally to support its user. A
variety of visual structures exist in different forms of data representations, from regular grid structures
to sparse point cloud. The discovery of these structures not only is meaningful for applications in its
own right (e.g. pose estimation), but also greatly facilitate the understanding of visual contents (e.g. face
recognition, textures). Therefore, I consider computer vision algorithms as a systematic way of spatial
reasoning: how local regions of visual information is combined and encoded, how spatial relationship in
a global scale is modeled, and how spatial transformations influence these computations.
In the past decades, artificial neural networks have become the best performing methods in the majority
of fields in computer vision. This is thanks to the ability of neural networks to approximate any nonlinear
function. Optimization of the neural networks under sufficient data and the right objectives thus allows
them to learn deep features, which, compared to traditional handcrafted, low-level features, carries much
1
more complex and semantic meanings. Naturally, the neural networks found to be effective for computer
vision applications are composed of spatial neural operators, with the convolution operator being the most
prominent example. The design principle of the convolutional neural network (CNNs), specifically, motivated many ideas that are presented in this thesis. The first inspiration comes from CNN’s hierarchical
network design: every convolutional layer uses a localized kernel to compute a neural response (a.k.a.
features) that attends to a windowed area in an image. In-between the layers, the computed local features
are aggregated into more global features, either by the convolutional layers themselves (e.g. strided convolutions), or by pooling layers. Secondly, in comparison to the early fully-connected network model, the
CNN maintains the spatial orders of the visual data through its shift equivariant property. At the same
time, it does so with a much reduced number of parameters to achieve the same learning capacity compared to a fully-connected network [62]. Throughout my studies of the convolutional neural network, I
have identified three interconnected properties that, in my opinion, define the spatial reasoning ability of
CNNs and directly motivate the application topics that will be discussed: kernels, receptive fields, and
spatial equivariance.
The convolutional kernel is a set of learnable weights that act as filter functions on the input signal.
Mathematically, given a group of transformations G, a signal function F and a kernel function h, the
output feature of a discrete convolution is computed as a correlation:
(F ∗ h)(x) = X
g∈G
F(x)h(g
−1x) (1.1)
I consider the convolutional kernel as the most fundamental element in the spatial characterization
of CNNs. On a 2D regular grid, for instance, the 2D kernel takes the shape of a NxM block, with (N,M)
known as the kernel size. The shape of the kernel not only defines the structure of the kernel matrix, but
also the group of spatial transformations G that the convolution operates on (in the example, G is the NxM
translational offsets). Naturally, many key extensions to the convolutional neural network operate on the
2
spatial configuration of the kernel, such as the dilated convolution [130], the separable convolution [45],
the partial convolution [74], and the sparse convolution [73]. The aforementioned convolutions are not
only proven effective in their targeted application contexts, but serve as important motivations for topics
discussed in this thesis. Specifically, a special kernel for the SE(3) equivariant convolution will be discussed
in the context of separable and sparse kernels (Chapter 2). Similarly, the kernel design for a pluralistic
image inpainting framework will be discussed in the context of the partial convolution (Chapter 4). The
1x1 convolution kernel used to implement the neural implicit network, which allows a CNN to operate as
a coordinate-wise multi-layer perceptron, will be discussed in Chapter 3.
Receptive field is generally defined as “the portion of the sensory space that can elicit neuronal
responses, when stimulated”∗
, or, in describing a neural network, the subregions of the original image
plane where a neuron has access to. The size of the receptive field is directly related to the kernel -
the larger the kernel size, the larger the receptive fields in the network layers trailing it. Analyzing the
receptive field of a hierarchical neural network helps us understand how well the deeper layers model
global information, and how much the shallower layers focus on learning local features. In this thesis,
several ideas proposed are tightly related to the concept of receptive field: in Chapter 2, we study ways to
expand the receptive field on a non-regular, rotational-equivariant point cloud representation; in Chapter
3, we analyze the unique benefits of minimum receptive field of the implicit network; in Chapter 4, we
consider an unique situation where the receptive field is purposefully limited for the purpose of image
inpainting.
Equivariance, last but not least, is key to understanding the effectiveness of convolutional neural
networks in many computer vision applications. While the formal definition and proof of equivariance will
be given in Chapter 2, spatial equivariance, loosely speaking, is the property that preserves the structure
of the tensors defined on a certain symmetry group. For example, the translation equivariance in 2D
∗
see https://en.wikipedia.org/wiki/Receptive_field
3
convolution allows CNNs to maintain a spatial grid that is aligned with the original image. However,
since the operator is not designed to have equivariance to other transformations of higher dimensions
(e.g. rotation), CNNs cannot maintain the pixel orders of a rotated image, thus resulting in difficulty
generalizing to unseen rotations during inference. Although it has been a common practice to use data
augmentation to learn robustness to transformations that the network is not equivariant or invariant to,
we will show in Chapter 2 that achieving natural equivariance to rotational transformations from operator
designs can provide better performances in rotation-related tasks than the non-equivariant ones trained
with data augmentation alone.
This thesis is on the design of neural operators and representations that aim to produce the desirable spatial properties in different applications, or, in simple words, how a neural network is engineered
to “look at” the visual data. In a broader application context, the following chapters together aim to provide practical values for the 3D World Reconstruction, where designs of the spatial properties of the
neural networks are proved highly beneficial for its several foundational stages: scene registration, terrain
synthesis and completion. In Chapter 2, the focus is on designing an SE(3) equivariant network for 3D
point cloud that is particularly effective in learning descriptors that excel in the downstream applications
of scene alignment and rotation prediction; In Chapter 3, the focus is on periodically encoding representations in an implicit network, such that it learns a continuous function that represents a field of visual
patterns. The superior scalability of the design is shown particularly valuable to large-scale synthesis of
3D patterns; In Chapter 4, the focus is on designing convolutional operators to adjust the receptive field
of a neural network in processing incomplete images, such that it creates a separation between analysis
and synthesis in challenging image inpainting scenario, and we will show that the design choices lead to
superior performance in synthesizing diverse and cohesive image completion results.
It is worth noting that my design focus is largely local - the adjustment of how a convolution operator
works, for instance, directly affects how local features are computed from a windowed area in the input.
4
Thanks to the hierarchical nature of the CNNs used in these works, the spatial properties of local features are allowed to propagate and aggregate, and eventually form into meaningful global features for the
downstream applications. This, however, is not without cares into the designs of the aggregation rules (e.g.
pooling layers), which will also be addressed in details in Chapter 2 and Chapter 4. The implicit network
discussed in Chapter 3 is a unique case where the design is not hierarchical, and no feature propagation or
aggregation has ever occurred. We will detail the reasons behinds these design choices with comparisons
to a hierarchical network.
It is also worth noting that this thesis touches on two distinctive machine learning paradigms: discriminative learning (Chapter 2,4) and generative learning (Chapter 3,4). While spatial analysis is more
commonly related to discriminative learning, where the neural network makes inference based on the
structured contents of the input visual data, generative learning can benefit from spatial designs as well.
This is particularly the case when a generative network is designed to model visual data of particular representations (e.g. continuous function, partial images). We show in both Chapter 3 and Chapter 4 that careful
considerations are needed in the designs of spatially structured latent space and the decoder networks in
order to synthesize contents of the desirable visual representations.
1.1 Contributions
1.1.1 Designing equivariant neural operator in the SE(3) space
Motivated by the importance of SE(3) equivariance in 3D point cloud applications, the Equivariant Point
Network (EPN) is proposed to be an efficient and effective architecture for extracting features that are
robust to rigid transformations. The key contribution in this neural network architecture is the construction of the SE(3) equivariant group convolution operator that is practical enough in application settings.
Specifically, the SE(3) space is discretized and processed with a spatial separable convolution termed the
5
SE(3) separable convolution. Since the formulation allows the network to maintain a discrete SE(3) structure, our second main contribution is the use of this property in pose estimation, a rotation-related task
that cannot be performed with invariant networks. As a general discriminative network, EPN is evaluated on three different point cloud applications in pose estimation, shape classification and large-scene
registration, where it achieves superior performance compared to previous related approaches.
1.1.2 Designing a generative implicit network for texture synthesis
The implicit neural network refers to the use of multi-layer perceptrons (MLPs) as an universal approximator of implicit functions. In this work, we leverage the implicit networks to model continuous visual
patterns from exemplars. With a focus on stationary and directional patterns, the implicit network is designed with periodic encoding and a continuous latent field to enable synthesis of visual patterns that are
infinitely expandable, locally diverse and visually authentic. Compared to previous methods, the design
was found to be more effective in both diversity and fidelity measurements and, importantly, much more
scalable in space and computational complexity. Due to the fast, lightweight nature of the implicit network,
we have in addition showcased extension of the model to several applications of 3D synthesis.
1.1.3 Restricting the receptive field for generative image inpainting
Generative Image inpainting is a recent branch of methods that utilizes generative networks to synthesize
the missing content of a partial image. This approach is especially suitable to the challenging situations
when inpainting large holes as the problem becomes increasingly ill-posed for the non-generative methods. In this work, we will show that careful neural operator designs are important to the framing of the
inpainting problem as generating missing pixels conditioned on the observable pixels. However, it has
been observed that mixing the analysis (of observable contents) and synthesis (of the missing contents)
6
together reduces the generative models’ ability to create pluralistic results. To this end, three design contributions are made to encourage a strict separation between the analysis and synthesis steps. Firstly, we
adopt a bidirectional generative transformer framework, based on a discrete latent code representation, to
the synthesis part to ensure more controls over diversifying the inpainting outcomes. Secondly, a unique
operator, termed the restrictive convolution operator, is proposed to restrict the receptive field of the image encoder to only the observable pixels in the analysis steps. Thirdly, a dedicated decoder is used to
blend the synthesized latent codes with the original contents. Experiments on public benchmarks have
shown that limiting the receptive field to separate analysis and synthesis leads to more favorable results
in pluralistic inpainting, both in terms of diversity and quality of the synthesis.
7
Chapter 2
Designing Equivariant Neural Operator in the SE(3) space.
The success of 2D CNNs stems in large part from the ability of exploiting the translational symmetries via
weight sharing and translation equivariance. In this chapter, we extend the 2D translation equivariance to a
higher-dimensional domain of SE(3) and show the practical challenges emerged in this process. Our efforts
are motivated by duplicating the success of 2D CNNs in the 3D domain. The group of transformations in
3D data, however, is more complex compared to that in the 2D images, as 3D entities are often transformed
by arbitrary 3D translations and 3D rotations when observed. Although group-invariant operators could
render identical features under different group transforms, it fails to distinguish distinct instances with
internal symmetries (e.g. the counterparts of “6" and “9" in 3D scenarios regarding rotational symmetry).
In contrast, equivariant features are much more expressive thanks to their ability to retain information
about the input group transform on the feature maps throughout the neural layers.
Despite the importance of deriving SE(3)-equivariant features for point clouds, progress in this regard
remains relatively sparse. The main obstacles arise in two aspects. First, the cost of computing convolutions between 6-dimensional functions over the entire SE(3) spaces is prohibitive especially in the presence
of bulky 3D raw scans. Second, compared to the learning of rotational invariant features, the application
values of their equivariant counterparts remain less explored. A typical example is feature matching: while
8
invariant descriptors are commonly used in solving the classical correspondence problem, equivariant descriptors, with its preservation of orientation information, may provide a much more direct bridge to the
end goals of camera/pose estimation or shape alignment. However, this typically requires solving a PnP
optimization [34] which is quite costly considering the high dimensionality of the features. In this chapter,
we explore a simple and fast formulation of attention mechanism to exploit equivariant features’ use in
scene matching and pose estimation to highlight the potential benefits of learning equivariant features.
In a general definition of equivariance, we say that a kernel function h is equivariant to a symmetry
group G if it satisfies the following equation:
∀g ∈ G, g(F ∗ h)(x) = (gF ∗ h)(x), (2.1)
where F is the input signal and x is a point defined on the domain of F. In the simple case with
discrete 2D CNNs, G is the group of 2D translation and the kernel function h is a N × M matrix indexed
by a subgroup of the 2D translation (for instance, at index i,j the matrix stores a value that will be multiplied
with the signal that is shifted a distance of i pixels horizontally, and j pixels vertically from the center).
The definition above requires defining the kernel function at a subgroup of G. In the case of equivariance
to 3D translation and rotation, G is a very high-dimensional space (at least 6D). This posts a fundamental
challenge to the computational efficiency of any SE(3)-equivariant neural network model that needs to be
reasonably expressive for uses in real-world applications.
A main factor that influences the efficiency of CNNs is the discretization of space. Although the
voxel grid is considered a straightforward representation for 3D objects, which shares the regular grid
structure with 2D CNNs, the extra dimension tremendously increases the computational cost of learning
voxel grid features. Moreover, such way of discretization is not required in most cases: as 3D objects are 2D
manifold embedded in the 3-space, the majority of the voxels would be empty. The first step of improving
efficiency in this work is thus to adopt a compact point cloud representation for the 3D object and build
9
our convolution from the sparse convolution on point cloud. We also discretizes the rotation space into a
discrete rotation group, calculated from the icosahedron [15]. The discrete set of points and the discrete
set of rotation together form the discrete SE(3) space that serves as the domain of every tensor in our
neural network, where the convolution operations can be considered extensions of the group equivariant
convolution [14]. Secondly, to reduce the dimensions of the SE(3) convolution kernels, we further propose
a spatial separable convolution designed to partition the SE(3) space. Last but not least, we present
an attention mechanism specially tailored for fusing SE(3)-equivariant features. We observe that while
the commonly used pooling operations, such as max or mean pooling, work well in translation equivariant networks like 2D CNNs, they are not best suited for fusing equivariant features in SO(3) groups.
This is mostly due to the highly sparse and non-linear structure of SO(3) features which poses additional
challenges for max/mean pooling to maintain its unique pattern without losing too much information.
The group attentive pooling (GA pooling) is thus introduced to adaptively fuse rotation-equivariant
features into their invariant counterparts. Trained together with the network, the GA pooling layer implicitly learns an intrinsic local frame of the feature space and generates attention weights to guide the
pooling of rotation-equivariant features. Compared to invariant features, equivariant features preserves,
rather than discards, spatial structure and therefore can be seen as a more discriminative representation.
It is for this reason that translational equivariance has been the premise for convolutional approaches
for detection and instance segmentation [35]. Similarly, through the attention mechanism, the equivariant framework can be utilized for inferring 3D rotations. We demonstrate in the experiments that this
structure significantly outperforms a non-equivariant framework in the shape alignment task.
This proposed framework is evaluated on a variety of practical applications. Experimental results
show that our approach consistently outperforms strong baselines. An ablation analysis and qualitative
visualization are also provided to evaluate the effectiveness of each algorithmic component.
10
2.1 Related Work
Learning-based Point Descriptor. The seminal work on handling irregular structure of point cloud
places the main emphasis on permutation-invariant functions [91]. Later works proposes shift equivariant
hierarchical architectures with localized filters to align with the regular grid CNNs [93, 68, 78]. Explicit
convolution kernels have also received tremendous attention in recent years. In particular, various kernel
forms have been proposed, including voxel bins [46], polynomial functions [125] or linear functions [38].
Other works consider different representations of point clouds, noticeably image projection [28, 47, 71]
and voxels [81, 5, 92, 121]. We point interested readers to [40] for a comprehensive survey on point cloud
convolution.
Rotation invariant point descriptors have been an active research area due to its importance to correspondence matching. While the features extracted by most of the above approaches are permutationinvariant, very few of them can achieve rotation-invariance. The Perfect Match [36] incorporates a local
reference frame (LRF) to extract rotation-invariant features from the voxelized point cloud. Similarly, [139]
proposes a capsule network that consumes a point cloud along with the estimated LRF to disentangle shape
and pose information. By only taking point pair as input, PPF-FoldNet [21] can learn rotation-invariant descriptors using folding-based encoding of point pair features. However, invariant features may be limited
in expressiveness as spatial information is discarded a priori.
Learning Rotation-equivariant Features. Since CNNs are sensitive to rotations, a rapidly growing
body of work focus on investigating rotation-equivariant variants. Starting from the 2D domain, various
approaches have been proposed to achieve rotation equivariance by applying multiple oriented filters [80],
performing a log-polar transform of the input [31], replacing filters with circular harmonics [120] or rotating the filters [70, 117]. Cohen and Welling later extend the domain of 2D CNNs from translation to finite
groups [14] and further to arbitrary compact groups [16].
11
When it comes to the domain of 3D rotation, previous efforts can be divided into spectral and nonspectral methods. In the spectral branch, generalized Fourier transform for S
2
and SO(3) underlies designs
for rotation equivariant CNN. We would like to highlight two seminal works [17], [30] that define convolution operators respectively by spherical (S
2
) correlation, and SO(3) correlation with circularly symmetric
kernels. The works most relevant to our setting are extensions of the two spectral paradigms to the 3D
spatial domain. A number of works extend spherical CNNs to 3D voxels grids [119, 116, 30, 52]. Yet, the research work on exploring the potential on point clouds remains sparse, with the exception of a concurrent
work Tensor field network (TFN) [111], which achieves SE(3) equivariance on irregular point clouds.
We found inspiration from the non-spectral group equivariant approaches that have seen recent progress,
extending from mathematical framework derived in [14, 15]. Specifically, [15] provides a general framework for the practical implementation of convolution on discretized rotation group, with icosahedral convolution as an examplar. Discrete group convolution characterizes many recent works on images [32, 64],
spherical signal [105], voxel grid [119] and point cloud [67].
2.2 Deriving the SE(3) convolutions
The Lie group SE(3) is the group of rigid body transformations in three-dimensions:
SE(3) = {A|A =
R t
0 1
, R ∈ SO(3), t ∈ R
3
}.
SE(3) is homeomorphic to R
3×SO(3). Therefore, a function that is equivariant to SE(3) must be equivariant
to both 3D translation t ∈ R
3
and 3D rotation g ∈ SO(3). Given a spatial point x and a rotation g, let us
first define the continuous feature representation in SE(3) as a function F(xi
, gj ) : R
3 × SO(3) → R
D.
Equivariance to SE(3) is expressed as satisfying ∀A ∈ SE(3), A(F ∗ h)(x, g) = (AF ∗ h)(x, g). The SE(3)
equivariant continuous convolutional operator can be defined as
12
(F ∗ h)(x, g) = Z
xi∈R3
Z
gj∈SO(3)
F(xi
, gj )h(g
−1
(x − xi), g−1
j
g)dgjdxi
, (2.2)
where h is a kernel h(x, g) : R
3 × SO(3) → R
D. The convolution is computed by translating and rotating
the kernel and then computing a dot product with the input function F.
Discretization. To discretize Eq. 2.12, we starts with discretizing the SE(3) space into a composition of a
finite set of 3D spatial point P : {x|x ∈ R
3} and a finite rotation group G ⊂ SO(3). This leads to a discrete
SE(3) feature mapping function F(xi
, gj ) : P × G → R
D. The discrete convolutional operator in SE(3) is
therefore:
(F ∗ h)(x, g) = X
xi∈P
X
gj∈G
F(xi
, gj )h(g
−1
(x − xi), g−1
j
g). (2.3)
We note that such discretization serves as a good approximation of the continuous formulation in Eq. 2.12,
where the approximation error can be further mitigated by the rotation augmentation [1]. If we interpret
P as a set of 3D displacements, this leads to an equivalent definition:
(F ∗ h)(x, g) = X
xi∈P
X
gj∈G
F(g
−1
(x − xi), g−1
j
g)h(xi
, gj )
=
X
x
′
i∈Pg
X
gj∈G
F(x − x
′
i
, g−1
j
g)h(gx′
i
, gj ).
(2.4)
Without loss of generality, let’s assume the coordinate is expressed in the local frame of x and therefore
g
−1x = x. In the second row of Eq. 2.4, the summation over the set P becomes a summation over a rotated
set Pg : {g
−1x|x ∈ P}. When written this way, we can see that the kernel is parameterized by a set of
translation offsets and rotation offsets under the reference frame given by g. We call the discrete set P ×G
the domain of the kernel with a kernel size of |P| × |G|.
13
2.3 SE(3) Separable Convolution.
(a) Naïve SE(3) convolution (b) SE(3) point convolution (c) SE(3) group convolution
gj
g
x
xi
SE(3) point conv
BN-ReLU
SE(3) group conv
BN-ReLU
(d) SPConv block
O(KpKgCN) O(KpCN) O(KgCN)
Figure 2.1: Illustration of SPConv. Each arrow represents an element in the group and each edge represents a correlation needed to compute in the convolution operator. We propose to use two separable
convolutions (b)(c) to achieve SE(3) equivariance. The computational cost is much lower than the naive
6D convolution (a). (d) shows the structure of a basic SPConv block.
A key issue with Eq. 2.4 is that the convolution is computed over a 6-dimensional space – a naive implementation would be computational prohibitive. Inspired by the core idea of separable convolution [12],
we observe that the kernel h with a kernel size |P| × |G| can be separated into two smaller kernels, denoting h1 with a kernel size of |P| × 1 and h2 with a kernel size of 1 × |G|. This divides the domain of the
kernel to two smaller domains: P × {I} for h1, and {0} × G for h2, where I is the identity matrix, and 0
is a zero displacement vector. From here, we are ready to separate Eq. 2.4 into two convolutions:
(F ∗ h1)(x, g) = X
x
′
i∈Pg
F(x − x
′
i
, g)h1(gx′
i
,I) (2.5)
(F ∗ h2)(x, g) = X
gj∈G
F(x, g−1
j
g)h2(0, gj ) (2.6)
We can see that h1 is a kernel only varied by translation in the reference frame of g, and h2 is a kernel
only varied by the rotation gj . In the following text, they are simplified to h1(gx′
i
) and h2(gj ). The division
here matches with the observation that the space SE(3) can be factorized into two spaces R
3
and SO(3).
Sequentially applying the two convolutions in Eq. (2.15-2.16) approximates the 6D convolution in Eq. 2.4
14
(Fig. 2.1(d)) while maintaining equivariance to SE(3) (proofs provided in the section 2.3.4). The working
principle here is similar to that of the Inception module [108] and its follow-up works [12], which have
shown the promising property of separable convolutions in improving the network performance with
reduced cost. The two consecutive convolutions are named as SE(3) point convolution and SE(3) group
convolution, respectively, as shown in Fig. 2.1. The combined convolutions will be referred to as SE(3)
separable point convolution (SPConv). Formally, the original 6D convolution is approximated by: F ∗ h ≈
(F ∗ h1) ∗ h2.
2.3.1 SE(3) point convolution
The SE(3) point convolution layer aims at aggregating point spatial correlations locally under a rotation
group element g. Let Nx = {xi ∈ P
∥x − xi∥ ⩽ r} be the set of neighbor points for point x, with a
radius r, the SE(3) point convolution with localized kernel is:
(F ∗ h1)(x, g) = X
x
′
i∈Ngx
F(x − x
′
i
, g)h1(gx′
i
), (2.7)
where Ngx = {g
−1
(x − xi)|xi ∈ Nx} is the set of displacements to the neighbor points under a rotation
g. h1 is a kernel defined in a canonical neighbor space B
3
r
. Given that the convolution is computed as a
spatial correlation under a rotation g, the form of the kernel can be naturally extended from any spatial
kernel function. While this framework is general enough to support various spatial kernel definitions, we
introduce two kernel formulations that are used in our implementation.
Explicit kernels. Given kernel size K, we can define a set of kernel points {y˜k}K evenly distributed in
B
3
r
. Each kernel point is associated with a kernel weight matrix Wk ∈ R
Din×Dout , where Din and Dout
15
are the input and output channel, respectively. Let κ(·, ·) be the correlation function between two points,
we have
h1(xi) = X
K
k
κ(xi
, y˜k)Wk. (2.8)
The correlation function κ(y, y˜) can be either linear or Gaussian. For example, in the linear case described
in [111], κ(y, y˜) = max(0, 1 −
∥y−y˜∥
σ
), where σ adjusts the bandwidth.
Implicit kernels. The implicit formulation gives a function on point set that does not utilize parameterized kernels and is generally not considered a convolutional operation. Rather, spatial correlation is
computed implicitly by concatenating the local frame coordinates of points to their corresponding features.
In the SE(3) equivariant extension, the local coordinates are also composed by a corresponding rotation g.
The implicit filter for the input signal F is:
h1(F(x, g)) = X
xi∈Nx
h1(F(xi
, g), g−1xi)
=
X
xi∈Nx
F(xi
, g)
g
−1xi
W.
(2.9)
We believe that other choices of kernel functions can be naturally extended from these two examples.
In our implementation of the network, we use the explicit kernel formulation in all convolutional layers.
The last layer before the output block of our network filters point features globally and therefore utilizes
the implicit formulation, as it scales better to process a larger set of point features.
2.3.2 SE(3) group convolution
Given a discrete rotation group G, the SE(3) group convolution computes SO(3) correlation between the
input signal and a kernel defined on the group domain.
16
We define a set of kernel rotation and their associate kernel weight matrices as Ng = {gj ∈ G}K
and {Wj ∈ R
Din×Dout}K, with the kernel size K = |Ng|. Thus the kernel is simply h2(gj ) = Wj . Our
SE(3) group convolution layer aggregates information from neighboring rotation signals within the group,
which is given by
(F ∗ h2)(x, g) = X
gj∈Ng
F(x, g−1
j
g)h2(gj ). (2.10)
In our implementation, the icosahedron group can be used as the discrete rotation group. The K
neighbor rotations are a subset of the group that is smallest in the corresponding angle. The computation
can be accelerated by pre-computing the permutation index and only performing constant-time query with
an index layer at run time.
2.3.3 Complexity analysis.
As illustrated in Fig. 2.1, by combining the two equivariance-preserving convolutions, we can achieve a
similar effect with Eq. 2.3 at a significantly lower computational cost. In particular, suppose we divide the
original number of kernels K into Kp and Kg, the number of kernels in the point and group convolution;
C = CiCo where Ci and Co are the number of input and output channels, N = NpNa is the product of the
number of points and the number of SO(3) element in a rotation group. The naive 6D convolution requires
a computational complexity of O(KpKgCN). In contrast, the complexity of our approach is reduced to
O((Kp+Kg)CN), which could achieve orders-lower complexity compared to the naive solution especially
with large Kp and Kg.
17
2.3.4 Proof of Equivariance
In this section, we provide proofs of SE(3) equivariance to the convolution introduced in previous passages.
Recall that the SE(3) space can be factorized into the space of 3D rotation {R|R ∈ SO(3)} and 3D translation
{T |T ∈ R
3}. A convolution operator equivariant to SE(3) must therefore satisfy:
∀R ∈ SO(3), R(F ∗ h)(x, g) = (RF ∗ h)(x, g),
∀T ∈ R
3
, T (F ∗ h)(x, g) = (T F ∗ h)(x, g).
(2.11)
Theorem 1. The continuous convolution operator
(F ∗ h)(x, g) = Z
xi∈R3
Z
gj∈SO(3)
F(xi
, gj )h(g
−1
(x − xi), g−1
j
g)dgjdxi
(2.12)
is equivariant w.r.t. rotation R ∈ SO(3) and translation T ∈ R
3
Proof. Firstly, we prove that Eq.(2.12) is equivariant to 3D rotation. For convenience of notation, let
x
′
i = R−1xi
, and g
′
j = R−1
gj .
R(F ∗ h1)(x, g) = (F ∗ h1)(Rx, Rg)
=
Z
xi∈R3
Z
gj∈SO(3)
F(xi
, gj )h((Rg)
−1
(Rx − xi), g−1
j Rg)dgjdxi
=
Z
xi∈R3
Z
gj∈SO(3)
F(xi
, gj )h(g
−1
(x − R−1xi),(R−1
gj )
−1
g)dgjdxi
=
Z
x
′
i∈R3
Z
g
′
j∈SO(3)
F(Rx
′
i
, Rg
′
j
)h(g
−1
(x − x
′
i
), g′−1
j
g)dgjdxi
= (RF ∗ h1)(x, g).
18
Then, we prove that Eq.(2.12) is equivariant to 3D translation. Let x
′
i = T
−1xi
. Because T (x − xi) =
x − xi
:
T (F ∗ h1)(x, g) = (F ∗ h1)(T x, g)
=
Z
xi∈R3
Z
gj∈SO(3)
F(xi
, gj )h(g
−1
(T x − xi), g−1
j
g)dgjdxi
=
Z
xi∈R3
Z
gj∈SO(3)
F(xi
, gj )h(g
−1
T (x − T −1xi), g−1
j
g)dgjdxi
=
Z
x
′
i∈R3
Z
gj∈SO(3)
F(T x
′
i
, gj )h(g
−1
(x − x
′
i
), g−1
j
g)dgjdxi
= (T F ∗ h1)(x, g).
The continuous convolution operator is therefore SE(3) equivariant. Given a finite point set P and a
finite rotation group G, the SE(3) separable convolution consists of two discrete convolution operators:
(F ∗ h1)(x, g) = X
xi∈P
F(xi)h1(g
−1
(x − xi), g) (2.13)
(F ∗ h2)(x, g) = X
gj∈G
F(gj )h2(x, g−1
j
g) (2.14)
For convenience, we use an equivalent definition in the following proof:
(F ∗ h1)(x, g) = X
xi∈P
F(xi
, g)h1(g
−1
(x − xi)) (2.15)
(F ∗ h2)(x, g) = X
gj∈G
F(x, gj )h2(g
−1
j
g) (2.16)
Theorem 2. The discrete convolution operators given in Eq.(2.15),(2.16) are equivariant w.r.t. rotation
R ∈ G and translation T ∈ R
3
19
Again, we first prove that the two operators are equivariant to 3D rotations in the rotation group G.
Following the notations used in the previous proof, let PR = {x
′
i
|x
′
i = Rx, x ∈ P}, GR = {g
′
j
|g
′
j =
R−1
g, g ∈ G}:
R(F ∗ h1)(x, g) = (F ∗ h1)(Rx, Rg)
=
X
xi∈P
F(xi
, Rg)h1((Rg)
−1
(Rx − xi))
=
X
xi∈P
F(xi
, Rg)h1(g
−1
(x − R−1xi))
=
X
x
′
i∈PR
F(Rx
′
i
, Rg)h1(g
−1
(x − x
′
i
))
= (RF ∗ h1)(x, g).
R(F ∗ h2)(x, g) = (F ∗ h2)(Rx, Rg)
=
X
gj∈G
F(Rx, gj )h2(g
−1
j Rg)
=
X
g
′
j∈GR
F(Rx, Rg
′
j
)h2(g
′−1
j
g)
= (RF ∗ h2)(x, g).
We then prove that the two operators are equivariant to 3D translation. Let x
′
i = T
−1xi
:
20
T (F ∗ h1)(x, g) = (F ∗ h1)(T x, g)
=
X
xi∈P
F(xi
, g)h1(g
−1
(T x − xi))
=
X
xi∈P
F(xi
, g)h1(g
−1
T (x − T −1xi))
=
X
xi∈P
F(xi
, g)h1(g
−1
(x − T −1xi))
=
X
x
′
i∈T −1P
F(T x
′
i
, g)h1(g
−1
(x − x
′
i
))
= (T F ∗ h1)(x, g).
T (F ∗ h2)(T x, g) = (F ∗ h2)(T x, g)
=
X
gj∈G
F(T x, gj )h2(g
−1
j
g)
= (T F ∗ h2)(x, g).
Since both operators are SO(3) equivariant and translation equivariant, we have:
R((F ∗ h1) ∗ h2)(x, g) = (R(F ∗ h1) ∗ h2)(x, g)
= ((RF ∗ h1) ∗ h2)(x, g),
T ((F ∗ h1) ∗ h2)(x, g) = (T (F ∗ h1) ∗ h2)(x, g)
= ((T F ∗ h1) ∗ h2)(x, g).
Thus, the SE(3) separable convolution is equivariant w.r.t. rotation R ∈ G and translation T ∈ R
3
,
which approximates equivariance to SE(3).
21
2.4 Shape Matching with Attention mechanism
In this section, we demonstrate how attention mechanism can be utilized to harness the power of equivariant feature. Given spatially pooled features that are equivariant to SO(3): F(g) : G → R
D, we define a
rotation-based attention A : G → R, A(g) = {ag|
P
g∈G ag = 1}.
SO(3) Detection. Suppose a task requires the network to predict the pose R ∈ SO(3) of an input shape.
When the attention weight is used as a probability score, the equivariant network turns the pose estimation
task into a SO(3) detection task, which is analogous to bounding box detection. Intuitively, each element
from the discrete rotation group can be interpreted as an anchor. A two-branch network is used to classify
whether the anchor is the "dominant rotation". Every anchor regresses a small rotational offset from its
corresponding rotation. The multi-task loss for rotational regression is then given by:
L(a, u, R, Ru
) = Lcls(a, u) + λ[u = 1]L2(R
uR
T
) (2.17)
where a = {ag|g ∈ G} are the predicted probabilities and R are the predicted relative rotations. u =
{ug|g ∈ G} is the ground-truth label with ug = 1 if g is the nearest rotation to the target ground truth
rotation RGT . Ru = {Ru
g
|∀g ∈ G, Ru
g
g = RGT } is the ground truth relative rotation.
Group Attentive Pooling. Global pooling layers are integrated as part of the network for spatial reduction of the representation. As many common tasks, such as classification, benefit from rotation invariance
of the learned feature, global pooling is utilized by most rotation-equivariant architectures to aggregate
information into an invariant representation.
To integrate attention mechanism with global pooling, we propose group attentive pooling (GA pooling),
which is given by
Finv =
P
g
exp(ag/T)FG(g)
P
g
exp(ag/T)
, (2.18)
22
where FG(g) and ag are the input rotation-equivariant feature and attention weight on rotation g. T is
a temperature score to control the sharpness of the function response. The output feature is invariant
given a rotated input point cloud. The confidence weight a can be learned by minimizing the loss L =
Ltask + λLsa,
where Ltask is a task-specific loss (e.g. cross-entropy loss for classification and triplet loss for correspondence matching); Lsa is a optional cross-entropy loss that encourages the network to learn the
canonical axis from the candidate orientations when ground truth canonical pose is available for supervision.
2.5 Implementation Details
Figure 2.2: An illustration of the network architecture used in both ModelNet and 3DMatch experiments.
The core element of our network is the SPConv block as shown in Fig. 2.1(d). It consists of one SE(3)
point convolution and one SE(3) group convolution operator, with a batch normalization and a leaky ReLU
activation inserted in between and after. We employ a 5-layer hierarchical convolutional network. Each
23
layer contains two SPConv blocks, with the first one being strided by a factor of 2. The network outputs
spatially pooled features that are equivariant to the rotation group G. It can be then pooled into an invariant
feature through a GA pooling layer. For the classification network, the feature is fed into a fully connected
layer and a softmax layer. For the task of metric learning, the feature is processed with an L2 normalization.
The network architecture used in both experiments is illustrated in Figure 2.2. Input points (P ∈ R
N×3
)
are first lifted to features that are defined in the SE(3) space (F(xi
, gj ) : R
3 × G → R), by assigning rotation group to each point and setting its associated features to be constant 1s (denoting occupied space).
Therefore, in the first layer, the network learns to differentiate different input points by the kernel correlation function at Equation. 2.8. The layer after the separable convolutional layers is an MLP layer with
a symmetry function (average function) that aggregates features in the spatial dimension. We have introduced this layer as a function with implicit kernel formulation (see Equation 8 in the main text). Before the
fully connected layers, a separate branch of unitary convolution takes the spatially pooled feature defined
in SO(3), and outputs the attention confidence (see Section 3.3 in the main text). The output feature of the
network can be further processed by a softmax layer in the classification task, or an l2 normalization in
the shape matching task.
In the implementation of SE(3) point convolution, we follow the design principles in [93] to compute
a spatially hierarchical local structure of the points, by subsampling the input points with furthest point
sampling and obtaining spatial local neighborhood by the ball searching algorithm. For the explicit point
kernel function, we select a kernel size of 24 with kernel points evenly distributed inside a ball B
3
r
. The
radius r of grouping operator is set as r
2 = dσ, where d is a parameter related to the density of the input
points, σ is a parameter used in the correlation function κ(y, y˜) described in Section 3.1 in the main text.
The kernel radius rk is set as rk = 0.7r.
To achieve more effective rotation group convolution, inspired by [32], we choose to sample the rotation group from an axis-aligned 3D model of regular icosahedral. Each face normal of the icosahedral
24
provides the α and β angles. We additionally sample three γ angles for each face normal, each separated
by 120 degrees. The cardinality of the rotation set is thus 20 × 3 = 60. Thanks to the icosahedral symmetry, the set of rotation forms a rotation group G with closure, associativity, identity and invertibility.
For band-limited filters, a 12-element subgroup is chosen, which transforms each element of G to its SO(3)
neighbors.
It is thus worth noting that while we maintain a sparse representation for the spatial dimension of the
point set, which takes online computation to find its local structure in the spatial dimension, the rotation
group naturally possesses a closed grid-like structure. This greatly facilities the computation for the bandlimited group convolution.
2.6 Experiments
The proposed framework is hypothesized to be most suitable for tasks where the objects of interest are
rotated arbitrarily. To this end, we evaluate our approach on two rotation-related datasets: the rotated
Modelnet40 dataset [121] and the 3DMatch dataset [134]. To ensure a fair comparison to previous works,
in all experiments, we use the implementation provided by the authors or the reported statistics if no
source code is available. We provide the training details of the experiments in the supplemental materials.
2.6.1 Experiments on Rotated ModelNet40
Dataset. The official aligned Modelnet40 dataset provides a setting where canonical poses are known,
and therefore it allows us to evaluate the effectiveness of pose supervision. We create the rotated ModelNet40 dataset based on the train/test split of the aligned ModelNet40 dataset [121]. We mainly focus on a
more challenging “rotated” setting where each object is randomly rotated. For each object, we randomly
subsample 1,024 points from the surface of the 3D object and perform random rotation augmentation
before feeding it into the network.
25
0% 20% 40% 60% 80% 100%
Error percentile
10
3
10
1
10
1
Angular error (°)
KPConv
Ours-20
Ours-60
Figure 2.3: Percentile of errors comparing KPConv [110] and two equivariant models (Ours-N) varied in
number of SO(3) elements.
Pose Estimation. The pose estimation task predicts the rotation R ∈ SO(3) that aligns a rotated input
shape to its canonical pose. To avoid ambiguities indued by rotationally symmetric objects, we only use
the airplane category from the dataset. We train the network with N=1252 airplane point clouds and test
it with N=101 held-out point cloud, each augmented with random rotations. The evaluation compares
equivariant models with KPConv [110], a network that has similar kernel function to our implementation
of point convolution, while not equivariant to 3D rotation. The equivariant models (Ours-N) are varied by
the size of rotation group (N), similar to the setting in [32], and use the multitask detection loss described
in Sec. 2.4. KPConv directly regresses the output rotation. Each model is trained for 80k iterations. The
regressors in all models produce a rotation in the quaternion representation. We evaluate the performance
by measuring angular errors between the predicted rotations and the ground-truth rotations. Tab. 2.1
shows the mean, median and max angular errors in each setting, and Fig. 2.3 plots the error percentile
curves. As shown in the results, the equivariant networks significantly outperform the baseline network,
with Ours-60 having the lowest errors. The equivariant networks also perform significantly more stable
(max angular errors are kept within 9 degrees), while KPConv could produce unstable results for a certain
inputs. This experiment showcases that a hierarchical rotation model can be much more effective in task
that requires direct prediction of 3D rotation.
26
Mean (◦
) Median (◦
) Max(◦
)
KPConv [110] 11.46 8.06 82.32
Ours-20 1.36 1.16 8.30
Ours-60 1.25 1.11 6.63
Table 2.1: Angular errors in point cloud pose estimation.
Representation Methods Acc (%) Retrieval
(mAP)
3D Surface RotationNet [54] 80.0 74.2
Sph. CNN [30] 86.9 -
Point cloud
QENet [139] 74.4 -
PointNet [91] 83.6 -
PointNet++ [93] 85.0 70.3
DGCNN [91] 81.1 -
PointCNN [93] 84.5 -
KPConv [110] 86.7 77.5
Ours 88.3 79.7
Table 2.2: Results on shape classification and retrieval on randomly rotated objects of ModelNet40.
Classification and Retrieval. The classification and retrieval tasks on Modelnet40 follow evaluation
metric from [121]. In addition, our network is trained with GA pooling and pose supervision introduced in
Sec. 2.4. In Tab. 2.2, we show the results comparing with the state-of-the-art methods in the setting where
models are both trained and tested with rotation augmentation. We categorize the baseline approaches
based on the input 3D representations: 3D surface and point cloud.
In the classification and retrieval task, our models also achieve the best performance, as shown in
Tab. 2.2. This indicates that our proposed framework can learn more effective and discriminative features
even in the challenging cases that all the objects are randomly rotated.
Ablation Analysis. We further conduct an ablation study to validate the effectiveness of each algorithmic component. In particular, we experiment with five variants of our model by altering key designs in
our network under the same architecture as shown in Tab. 2.3. By using the supervised attentive pooling,
we can improve the classification accuracy with the same number of parameters compared to the max
and mean pooling. However, the unsupervised attentive pooling does not outperform max pooling. This
may be partly due to the difficulty of learning canonical pose in an unsupervised manner. In addition,
27
conv global pool Loss Acc (%)
Separable Conv Attentive Lcls + Lsa 88.3
Separable Conv Attentive Lcls 87.7
Separable Conv Max Lcls 87.7
Separable Conv Mean Lcls 87.4
Point Conv Attentive Lcls + Lsa 86.1
Table 2.3: Results of ablation studies on ModelNet40 dataset. The conv column denotes the configuration of
convolution layers. The global pool column denotes the type of global pooling method. Loss configuration
follows notation from Sec. 2.4.
only using point convolution will lead to a decline in performance, indicating the effectiveness of group
convolution.
How well does the attention layer learn? It is possible that the performance of GA pooling in distinguishing canonical poses could be compromised by the rotational symmetry of the object. If a shape is
circularly symmetric, and the canonical poses prescribed by the rotational label is aligned with an axis of
symmetry, the attention layer would naturally fail to provide a deterministic prediction. We summarize
the classification accuracy based on the attention confidence for each category of ModelNet objects, as
shown in Fig. 2.4. The results indeed support our intuition: the attention layer is ambiguous on objects
with circular symmetry (e.g. cone and flower pot) and very confident on categories that have distinctive
canonical orientation. On one hand, this shows that when the object of interest is asymmetric in rotation,
the GA pooling does help improve classification performance by establishing a local reference frame. On
the other hand, the GA pooling only fails at symmetric object that benefits relatively less from a equivariant
representation. In the extreme case, the attention layer could be reduced to an average pooling.
2.6.2 Shape Alignment on 3DMatch
Dataset. The 3DMatch dataset is a real-scan dataset consisting of 433 overlapping fragments from 8
indoor scenes for evaluation, and RGB-D data set from 62 indoor scenes for training. The pose of each
fragment is determined by the camera angle during capturing, and two fragments at most overlap partially.
28
Plane
Toilet
Car
Bed
Cone
Stool
Vase
Flower
pot
Figure 2.4: Classification accuracy based on the attention confidence for each object category. The attention layer is trained on rotated dataset to learn a canonical orientation for the given object.
SHOT[112] 3DMatch[134] CGF[56] PPFNet[22] PPFF[21] 3DSNet[36] Li[71] Ours
Kitchen 74.3 58.3 60.3 89.7 78.7 97.5 92.1 99.0
Home 1 80.1 72.4 71.1 55.8 76.3 96.2 91.0 99.4
Home 2 70.7 61.5 56.7 59.1 61.5 93.2 85.6 96.2
Hotel 1 77.4 54.9 57.1 58.0 68.1 97.4 95.1 99.6
Hotel 2 72.1 48.1 53.8 57.7 71.2 92.8 91.3 97.1
Hotel 3 85.2 61.1 83.3 61.1 94.4 98.2 96.3 100.0
Study 64.0 51.7 37.7 53.4 62.0 95.0 91.8 96.2
MIT Lab 62.3 50.7 45.5 63.6 62.3 94.1 84.4 93.5
Average 73.3 57.3 58.2 62.3 71.8 95.6 91.0 97.6
Table 2.4: Comparisons of average recall of keypoint correspondences on 3DMatch. All baseline results
are tested on the official 3DMatch evaluation set without point normals.
Evaluating our model on this dataset is meaningful as shape registration in such setting would benefit from
descriptors that are invariant to rigid camera motion. Each test fragment is a densely sampled point cloud
with 150,000 to 600,000 points. To be consistent with our baselines, we use an evaluation metric based
on the average recall of keypoints correspondence without performing RANSAC, following [22]. We also
follow previous works [21, 22, 36] to set the matching threshold τ1 = 0.1m and the inlier ratio τ2 = 0.05.
Comparison with baselines. We designed a Siamese network for this task and trained our model with
the batch-hard triplet loss proposed in [36]. The input to the network is 1024-point patches extracted locally from keypoints in a fragment. The output is 64-dim invariant descriptors. Since a canonical ground
truth pose is not known in this setting, the attentive pooling module in our model is trained in an unsupervised manner. Our results are shown in Tab. 2.4. To provide a comprehensive comparison, we select
29
the state-of-the-art baselines using a variety of approaches: 1) convolutional network without rotational
invariance, e.g. [134, 22] 2) handcrafted invariant features w/ and w/o deep learning, e.g. [56, 112, 21],
3) features learned from LRF aligned input [36], and 4) multi-view network [71]. We report the 64-dim
results of [36] to match the feature dimension of our model. Since the official 3DMatch test dataset does
not contain point normal information, we report two results of [71]: a result of their model trained and
tested without normal information (Li [71] in Tab. 2.4) and one that is trained and tested with the authors’
provided point normals (Li [71]
♭
in Tab. 2.4). We evaluate our model with the interest points provided by
the authors of the dataset, which is consistent with the reported results of our baselines.
Overall, our model outperforms all of the baselines in average recall, without the need to precompute
an invariant representation or a local reference frame. Compare to some baselines (e.g. [21, 71]) that requires dense point input, our model can learn discriminative features from very sparsely sampled sets of
1024 point. Our result is also better than the state-of-the-art method [71], even without needing normal information as input. In the official setting where point normal information is not available, the performance
of our model marks a great leap forward.
Qualitative analysis. We provide a T-SNE visualization of the features learned by our network in
Fig. 2.5. As different features are labeled with distinct colors, we can observe that the features learned
by our network can robustly generate correct geometry correspondences even when the point cloud is
incomplete, partially aligned, or significantly rotated. For instance, in the third column, the bottom scene
is only partially aligned with the top one and is viewed at an entirely different angle, our network can still
reliably label the corresponded points with similar features.
2.6.3 Inference speed.
Our network model used in the experiment is compared to the baseline models that employ similar equivariant structures regarding the inference time. Specifically, we evaluated the 20-anchor model to align
30
Figure 2.5: T-SNE visualization of features learned by our network. Each column contains a pair of fragments from the same scene. Regions in correspondence are automatically labeled with similar features.
with the settings in [32, 54]. Among the selected baselines, [32, 54] are multi-view image networks that
are SO(3) equivariant; TFN [111] is an example of “non-separable” SE(3) equivariant network. Our network is found to be faster than all of the baselines selected, and it is significantly faster than the SE(3)
equivariant framework that is not separable.
Method OURS-20 EMVN-20 RotationNet TFN
Time 35.4ms 35.9ms 108.0ms 302.9ms
2.7 Discussion
This chapter has presented a framework that efficiently computes and leverages SE(3)-equivariant features for point cloud analysis. First, the SE(3) separable convolution is proposed to factorize the naive
SE(3) convolution into two concatenated operators performed in two subspaces. Second, we propose the
incorporation of attention mechanism that leverages the expressiveness of SE(3)-equivariant features for
3D alignment tasks and can be used as a pooling layer that fuses the equivariant features into their more
ready-to-use invariant counterparts. Our method has led to leaps of performance in a variety of challenging tasks.
31
The method, however, leaves many rooms for improvements: 1) Compared to a typical CNN model
with only shift equivariance, the SE(3) equivariant network proposed still demands a much higher computation and memory complexity. Our separable convolution approach, while matching the general theme
of this thesis in spatial design, offers only one path of model compression. The costly SE(3) group representation is also a subject for improvement, which has been evidently shown by a follow-up work [146].
2) As mentioned in the introduction, a spatially hierarchical structure is critical for the ability of CNNs
to model longer-range spatial relationship. While our network is hierarchical in the Euclidean space, the
icosahedron discretization of SO(3) makes it difficult to learn hierarchical rotational features. Furthermore,
one may question the necessity to perform band-limited convolution on the 60-anchor rotation group, as
a transformer architecture may offer much efficient ways of aggregating features. 3) We believe the application values of an SE(3) equivariant network is much wider than the tasks that the model has been
experimented on. Our implementation of scene registration still follows the typical keypoint matching
pipeline [134] with invariant descriptors, though we speculate that the oriented equivariant descriptors
may provide further benefit in key matching, particularly in a task where the end goal is to estimate camera parameters. Furthermore, the method appears to be quite appropriate for 3D detection, which is an
important application that we have not explored.
32
Chapter 3
Designing a Generative Implicit Network for Texture Synthesis.
In recent years, we have seen the rising popularity of the implicit neural network - neural network that
approximates a continuous, implicit function. With its primary applications in view synthesis, an implicit
network Φ is a multi-layer perceptron network (mlp) of the simple form:
Φθ(c) ≈ f(c), (3.1)
where the network takes as input the coordinates c to approximate a unknown continuous function f
where samples can be drawn, which, in application contexts, could be a light field with observed radiance,
or a 3D geometry represented by signed distance function. Popularized by both its use in neural radiance field [83] and in shape reconstruction [97], the implicit neural network has been shown to excel at
compressing a continuous field of values that are otherwise difficult to represent, or performing deep interpolation given partial samples of a 3D scene. In processing a grid of coordinates, the implicit network of
mlps is equivalent to a convolutional neural network operating over the grid with only 1x1 convolutions,
although such design greatly differentiates it with the typical CNNs, as the implicit network operates on
an infinitesimal receptive field and thus does not require processing of the entire view of the data during training. In the training of neural radiance field, for instance, the training is done under iterations of
batches of rays that correspond to individual pixels from the multi-view images. During inference, the
33
target view can be rendered in a piecewise manner. These characteristics allow the typical neural implicit
network to operate much faster and more lightweight compared to other neural network structures, while
having the capability of sampling values continuously from the learnt implicit function once trained.
This chapter describes one of the earliest attempts to study the generative implicit network - learning of
a continuous function that represents stationary or directional patterns (e.g. textures, 3D structures) based
on latent variables. Why would the implicit network be a decent choice for synthesis of these patterns?
Different from general image synthesis, the synthesis of texture patterns, may that be a wooden texture for
painting, or a simulation of natural cave systems, focuses on the modelling of localized variations within
some periodic global structures. Due to their importance to computer graphics application, continuous
and scalable textures are also high desirable, given that they may be needed, in large volumes, for texture
mapping to various kinds of 3D surfaces. The continuous and computationally efficient nature of the
implicit network thus appears to be a natural choice in such setting.
In an application context, let us start by defining several characteristics that are desirable for an algorithm that generates visual patterns:
• Authenticity. Probably the most prioritized quality of synthesized visual patterns is its visual quality. When patterns are synthesized from an exemplar, the quality is determined by whether they
faithfully recreate the source pattern.
• Diversity. It would be undesirable for a synthesizer to only copy patterns from the source. Diversity
is thus an equally important measurement that evaluates whether the synthesized patterns vary
from the source and each other. We strive to achieve two different levels of diversity: the patterns
should be diversified both within a generated sample and across samples.
• Scalability. As patterns are usually demanded at different and potentially large scales for many
practical applications, we want a scalable synthesizer to be able to efficiently generate patterns of
34
arbitrary size. Scalability is particularly valuable when it comes to the synthesis of 3D models, as
the extra dimension translates to a much larger amount of computations.
Patterns that scale well to an infinitely large space, in general, possess a stationary property - a shiftinvariant structure that can be expanded by tiling or stacking blocks of elements. We therefore focusing
our implicit network design by pivoting on the fact that many types of natural and artistic patterns can
be analyzed and recreated in a stationary framework. The goal of synthesizing an authentic and diverse
stationary pattern from an exemplar, however, requires careful modeling that is compatible with the underlying structure of the pattern.
Generative adversarial networks (GAN) [37] is one of the most promising techniques so far to model
data distribution in an unsupervised manner and has been frequently adapted to convolutional models that
synthesize visually authentic images [98, 51, 6, 144, 123, 101]. How would a GAN generator be leveraged
to modeling stationary pattern? As all stationary patterns contain a repeating spatial structure with local
variations that “vivifies” its appearance, in an ideal situation, a stationary pattern can be modeled by a
discrete random field, where each random variable is associated with a patch of the basic element, often
known as “textons” in past literature [147]. Thus a natural GAN formulation models image patches with
a spatially defined latent field [51, 6]. In a convolutional framework, however, problems arise when fake
samples generated from a discrete noise image are discriminated from randomly sampled patches from
a real image. The first problem is that the sampled patch does not necessarily agree with the scale of
the repeating structure. The second problem is that the sampled patch can be arbitrarily shifted from the
center of a stationary element. A typical deconvolutional network [24, 133] that upsamples from an evenlyspaced noise image may not sufficiently address the previously mentioned problems. To study their effects
we designed a convolutional network following the DCGAN architecture [94] to synthesize a honeycomb
pattern from a 2 × 2 noise map, which is trained against random patches sampled from a source image.
The comparison between its result and that synthesized by our generator network that is trained with an
35
Figure 3.1: Comparison between synthesized honeycomb from a DCGAN convolution generator and the
periodic MLP generator. Seams and intercepting patterns are visible in the former result due to difficulty
for the convolution generator to capture the repeating structure.
identical discriminator is shown in Figure. 3.1. We found that the noise map does not capture well the
honeycomb structure as seams and intercepting elements are visible from the synthesized image.
Though various techniques have been proposed in the past to address the aforementioned issues
(e.g. [51, 6]), in this paper, we consider a more natural way to model stationary patterns with an implicit
periodic generator. The core of the formulation is to match the repeatable structure of a stationary pattern
to the period of a learnable continuous periodic function. Instead of modeling the pattern with a discrete
noise tensor, we define latent variables in a continuous space, where the extent of each latent factor is
learned to match with the extent of the repeating elements. The benefits of this design align well with
the desirable characteristics for visual pattern synthesis: 1) learned periodicity of the implicit field encourages latent factors to model the stationary variations observed from the exemplar pattern; 2) a continuous
representation provides flexibility during training to learn a distribution from randomly shifted patches
cropped from the exemplar; 3) a Fourier encoding scheme learns high-frequency details from the exemplar. This allows our model to synthesize visually authentic results, and 4) multilayer perceptron (MLP)
that implicitly takes coordinates as input scales well to the generation of 3D shapes when compared to 3D
convolution. Based on these design choices, we term our network the Implicit Periodic Field Network
(IPFN).
36
We validate our proposed design by showing various applications in texture image synthesis and 3D
volume synthesis in Section. 3.3. Specifically, besides synthesizing stationary patterns, we design a conditional formulation of our model to tackle the synthesis of directional patterns and to provide application
in controllable shape generation. An ablation study is also conducted to verify the effectiveness of our
design choices.
3.1 Related Work
Pattern Synthesis 2D visual patterns are generally referred to as “texture" due to their prevalent applications in computer-aided design. Well-known early attempts to synthesize texture derive patterns
from smoothly interpolated noise [87, 88, 118] and create aesthetically pleasing materials that display a
high level of randomness. [43, 89] are two recent works related to us that utilize randomness from a
continuously-defined Perlin noise [87] to synthesize exemplar-based textures in an implicit field. Their
works demonstrate the advantage of smooth noise field and implicit formulation in efficiently and diversely generating a 3D texture field. However, just like their procedural precedence, these methods have
been shown to be limited in synthesizing patterns that are more complicated in structures (e.g. bricks,
rocky surface) [89].
Later works in traditional image synthesis are generally divided into pixel-based method (e.g. [27,
115]), patch-based method (e.g. [59, 26, 61]) and optimization-based method (e.g. [59, 60]) and have shared
with us important considerations for recreating new patterns from an exemplar. For instance, synthesis
on a patch level encourages preservation of fine local details, and the efforts are focused on "quilting" the
discontinuity between patches [26, 61] and encouraging global similarity [59]. Early statistical model [27,
90] utilizes a random field representation that captures variations in stationary patterns and synthesizes a
variety of new patterns under the Julesz’ conjecture. Our work is inspired by key ideas from the traditional
37
modeling of stationary textures, while we strive to unify authenticity, diversity, and scalability with a
neural representation to overcome limitations that existed in the traditional approaches.
Compared to earlier parametric models, the artificial neural network is powerful in its generalizability
in pattern recognition [42]. Thus neural synthesis of texture images has marked the advances in recent
endeavors. How does a neural network learn the stylish representation of a texture without simply reconstructing it? A milestone that unifies both synthesized quality and sampling power adversarially trains a
generator network and a discriminator network to learn the mapping from a latent space to the texture
distribution [37]. We are particularly interested in the generative adversarial networks (GANs) that models
the inner statistics of the exemplars. This is marked by a patch-based approach that represents an image
as a collection of smaller images: [66, 51] formalizes patch-based synthesis in the GAN setting with the
concept of Markovian and spatial GAN. [6] motivated us with its periodic design in latent codes, which
effectively captures stationary patterns from an input exemplar. [144] can be seen as a complement to our
design by focusing on addressing non-stationary texture synthesis by learning to expand an image patch.
In addition, [98, 101] present multi-scale, global-focused approaches to effectively recreate natural images.
While the aforementioned approaches all utilize convolutional designs, our work extends texture synthesis
to the continuous domain with an implicit representation of textures, as we argue that such representation
provides a more natural and efficient way to synthesize stationary patterns.
The synthesis of 3D shapes is of particular interest in computer graphics and thus has a rich history.
To name a few, this includes volumetric field design [136, 84, 11], procedural generation [49, 82] and 3D
texturing [143, 8]. However, very few works have considered the synthesis of 3D patterns with neural
networks, with the exception of [43, 89], which explores the generation of 3D solid textures.
Implicit Network Implicit network refers to multilayer perceptron that learns a continuous function
in real coordinate space. In particular, the implicit network is mainly utilized in the reconstruction of
3D shapes [85, 97, 77, 103], where shapes are represented by a distance function, or a radiance field [83,
38
129]. We are motivated by the signed distance representation of shapes and the Fourier encoding explored
in [83, 109] in our design of the implicit networks, and our work adopts these features to a generative
setting where novel 3D patterns are synthesized.
3.2 Method
Our method is best introduced by expanding from the Wasserstein GAN value function [39] constructed
as the Earth-Mover distance between the real data distribution Pdata and the generator distribution Pg:
min
G
max
D
Ex∼Pdata [log(D(x))] − Ez∼PZ
[log(D(G(z))]. (3.2)
Our first change of the above objective is to draw real-valued coordinates c ∈ R
k
(k = 2 for image and
k = 3 for volume) from a distribution {sc | c ∼ Pc} as input to the generator, where s is a constant
scalar. This underlies the implicit nature of the generative network, as G learns a mapping from the real
coordinate space in a stochastic process. Instead of being sampled from a prior distribution PZ, latent
variables are drawn from a random field fz(c) defined on the real coordinate space. G is therefore a
function of both the coordinates c ∈ R
k
and the latent variables fz(c) ∈ R
d
. This updates Eq.3.2 to
min
G
max
D
Ex∼Pdata [log(D(x))] − Ec∼Pc,z∼Pfz(c)
[log(D(G(z, c))]. (3.3)
In the implementation, our goal is to synthesize color, distance function, normal vectors, etc. that are
defined in a grid-like structure X : R
H×W×C or X : R
H×W×D×C and thus we also sample grid coordinate
input C : R
H×W×2 or C : R
H×W×D×3 whose center is drawn from a uniform distribution U(−s, s). The
randomness to the grid center position is critical in encouraging smooth and seamless transition between
blocks of patterns as it models the distribution of randomly sampled patches from the input pattern.
39
In the following sections, key modules are discussed in detail: 1) a deformable periodic encoding of the
coordinates to model stationary patterns; 2) implementation of the latent random field; 3) a conditional
variance of our objective for the synthesis of directional exemplar and controllability of the synthesized
patterns, and 4) the overall network structure that completes our pattern synthesis framework.
3.2.1 Periodic Encoding
As discussed in the introduction, it is critical to represent stationary patterns from the input by a repeated
structure that avoids simply reconstructing the original exemplar. Simply mapping spatial coordinates
to the visual pattern does not satisfy this requirement: since each coordinate is unique in the real-value
space, the network would learn to overfit the coordinates to their associated positions in the exemplar and
therefore fail to capture a repeatable structure.
The benefits of a periodic encoding to the coordinates are two-fold: Firstly, it disentangles patchlevel appearance from their specific position in the exemplar, which allows the pattern synthesized by the
generator to be shift-invariant. Secondly, recent advances in implicit network [83, 109, 103] have found a
Fourier feature mapping with a set of sinusoids effective in learning high-frequency signals by addressing
the “spectral bias"[4] inherent to MLPs. In our work, we use the following periodic mapping for the input
coordinates:
γ(c) = [cos(20πac),sin(20πac), · · ·, cos(2iπac),sin(2iπac)], (3.4)
where the learnable parameter a ∈ R
k determines the period of the encoding. This design allows the
network to learn to match the period of the encoding to the repeatable structure of the exemplar. It thus
provides robustness to the scale of the patches sampled in training as such scale no longer dictates the
period of the synthesized pattern.
40
3.2.2 Latent Random Field
In a generative framework, latent noise sampled from a prior distribution model the variation in observations. A noise function that is smoothly defined in the real coordinate space encourages smooth transition
between the synthesized patches. In a 2D example, we start with a discrete random field that maps a uniform grid of coordinates to random variables {fz(c) | c : R
H×W×2}. Then the discrete random field is
smoothly interpolated to form a smooth latent field (see the visualization in Figure 3.2). In our implementation, we used an exponential interpolation:
f(x) = X
K
i=1
wi(x)fz(ci), wi(x) = e
||x−ci
||2
σ
PK
j=1 e
||x−cj
||2
σ
, (3.5)
where the latent code at spatial position x is interpolated from K = 4 latent vectors defined at the grid
corners. In implementation, the discrete grid used to define the random field has a spacing of 1. To match
the extent of a latent factor with the learned period a, we simply scale the uniform grid of the discrete
random field accordingly.
3.2.3 Conditional IPFN
Extending our GAN objective to be conditional enables many practical applications. Assume each input
patch is paired with a guidance factor g, the conditional objective is simply an extension:
min
G
max
D
Ex∼Pdata [log(D(x|g))] − Ec∼Pc,z∼Pfz(c)
[log(D(G(z, c|g))]. (3.6)
Here we outline two applications in pattern synthesis using the conditional formulation:
• Synthesis of directional pattern: Many natural patterns have a directional distribution that is
oftentimes considered non-stationary. A typical example is a leaf texture - a midrib defines the
major direction that separates the blade regions by half (see Figure 3.4). A conditional extension of
41
our model is able to model patch distribution along a specified direction. For simplicity, we present
a 2D formulation that can be easily extended to 3D. With a user-defined implicit 2D line equation
ax+by+c = 0, the guidance factor is defined as g(x, y) = ax+by+c. Pixel coordinates(px, py)from
an input texture image with width w and height h are transformed as(x, y) = ( 2px
w −1,
2py
h −1)to be
normalized to the value range of [−1, 1]. In our experiments we have found it sufficient to condition
our model on the horizontal (y = 0) and vertical direction (x = 0) for the evaluated exemplars.
• Controlling synthesis of 3D shapes: In the modeling of geometric patterns, it is often desirable
for the synthesis algorithm to provide explicit control of certain geometric properties such as density
and orientation. These geometric properties can be calculated from the exemplar shape. Let g(x)
be a shape operator that defines the geometric property of interest, our conditional model trained
with the sample pair (x, g(x)) then learns a probabilistic mapping from a guidance vector field to
the target 3D shape. An intuitive example of this application can be found in Section 3.3.5.
3.2.4 Network Structure
The overall structure of IPFN is visualized in Figure 3.2. The Generator Network G is a 10-layer MLP with
ReLU activations between layers and a sigmoid function in the end. A grid of coordinates is sampled based
on a randomly shifted center. The coordinates are then passed to two separate branches: 1) the periodic
encoder, and 2) a projection on the latent field to obtain a 5-dim latent vector for each coordinate. The latent
codes and the periodically encoded coordinates are then concatenated as input to the generator mlp, which
outputs the fake sample. The discriminator D discriminates between the generated samples and randomly
cropped patches from the real input. We implement D by following the DCGAN architecture [94] with a
stride of 2. For discriminating 3D volumes, 2D convolution layers are replaced by 3D convolution.
42
Figure 3.2: Overview of our network architecture discussed in Section 3.2.4.
3.3 Experiments
We hypothesize that our approach is most suitable for synthesizing texture patterns and 3D shapes with
repeating structures and local variations in appearance. To this end, we demonstrate our main results by
applying IPFN to the synthesis of 2D texture images and 3D structured shapes. In addition, IPFN is adapted
to two applications in 3D texturing and shape manipulation. To evaluate the effectiveness of the proposed
techniques, we have also conducted an ablation study where several key designs are altered.
Evaluation metric While the quality for pattern synthesis is not easily quantifiable, human eyes
usually provide a reasonable qualitative assessment for whether the synthesized patterns capture the aesthetics and structure of the exemplar. In our evaluation, we present comparisons of visual results that
are self-manifesting, since the synthesized patterns bear obvious characteristics of the underlying designs
of the synthesizer. In addition, we have provided quantifiable metrics in terms of Single Image Frechét
Inception Distance and inference time and memory.
43
Implementation details For all of our experiments, the network is optimized under WGAN loss with
gradient penalty [39]. Adam optimizer [57] is used with a learning rate of 1e−4 for both the discriminator D
and the generator G. In each iteration, both D and G are updated for 5 steps sequentially. Input images and
volumes are randomly cropped to a smaller-scale patch. For positional encoding, we choose a bandwidth
i = 5 as a wider kernel tends to produce sinusoidal artifacts, whereas a narrower kernel produces blurry
results. The input coordinates are randomly shifted by an offset in the range [−4, 4] to accommodate for
the chance that the network may learn an increased period for the periodic encoding. Accordingly, noises
are interpolated from a 5 × 5 grid (5
3
for 3D volume) discrete random field, where the point locations
are ranged between [−5, 5]. A single-exemplar experiment is typically trained for 12,500 iterations with
a batch size of 8 and runs on a single Nvidia GTX 1080 GPU, which takes about 6-9 hours to complete.
Inference Time: IPFN only requires 24 milliseconds to generate a 1024 × 1024 image. 3D volumes are
generated iteratively and a large-scale 5123 volume takes only 22.9 seconds to be generated. Source code
will be made publicly available upon acceptance.
3.3.1 Texture Pattern Synthesis
Image sources selected from the Oxford Describable Textures Dataset (DTD) [13] are shown in Figure 3.4.
Specifically, we selected two exemplars with stationary patterns (top 4 rows in Figure 3.4) and two exemplars with directional patterns (bottom 4 rows in Figure 3.4) to demonstrate that IPFN synthesizes visually similar patterns in both cases. During training, images were randomly cropped into patches of size
128 × 128. During inference, the synthesized images were scaled up four times to a size of 512 × 512.
Our results are compared to the three most relevant baseline generative methods that synthesize texture
patterns from a single image:
• Henzler et al. [43]: A method that similarly utilizes implicit network and smooth noise field for
texture synthesis. The synthesized results were obtained from running the officially released code.
44
• Bergmann et al. [6]: A convolutional method that combines noise with periodic signals to synthesize
stationary pattern. The synthesized results were obtained from running the officially released code.
• Zhou et al. [144]: A convolutional image expansion approach targeted for non-stationary texture
synthesis. Since [144] expands from an image input deterministically and does not utilize latent
code, only one synthesized result is shown per row. The synthesized results were obtained from the
authors.
Visual inspection is sufficient to show that IPFN provides promising results. When compared to [43],
IPFN synthesized results with obvious structures as noise is not directly mapped to the output. While [6]
synthesizes periodic samples that display diversity across samples and similarity to the stationary exemplars, their synthesized patterns lack variation within the image. In comparison, our synthesized patterns
show a higher level of local variations and adapt well to the directional cases. [144] has provided the most
visually authentic results among the baselines. However, in the stationary cases, radial distortion is noticeable near the boundaries of its synthesized images. Moreover, without requiring image input, IPFN
provides a more direct approach to synthesizing diversified samples from random noise.
Single Image Frechét Inception Distance (SIFID) SIFID introduced in [98] is a metric commonly
used to assess the realism of generated images. For the computation of SIFID, we have used a patch size
of 128 × 128 in all experiments, where the synthesized patterns have the same resolution as the original
exemplars. Table 3.1 shows the SIFID comparisons between ours and the baselines in various categories
of exemplars. For Zhou et al. [144], only the generated (expanded) portion of the images were used. The
results show that our method can generate results that better resemble the distribution of the real texture
in the stationary categories (honey, crosshatch, rock) as our generated patterns receive lower SIFID scores.
For the leaf category, a typical directional pattern, Zhou et al. [144] achieves the best performance as its
method specifically targets non-stationary expansions, while our method still performs better than other
baselines that have not taken into consideration the synthesis of non-stationary patterns.
45
Inference Time and Memory Table 3.2 measures the inference time and memory consumption of our
network compared to the baselines when generating image at different sizes. Our implicit formulation is
shown to be significantly more efficient in both time and space without the needs to rely on computation
of pseudo-random noise ([43]) or convolutional operations ([6, 144]). This validates our claim on the
scalability of our design.
honey crosshatch rock leaf
Henzler [43] 332.66 310.49 351.23 225.11
Bergmann [6] 62.75 177.88 120.64 164.37
Zhou [144] 14.54 154.63 118.29 38.13
Ours 10.15 130.83 113.81 103.6
Table 3.1: SIFID scores between the exemplars and the generated patterns from ours and different baselines.
Time (ms) /
memory (GB)
1282 2562 5122 10242
Henzler [43] 218/1.38 278/1.62 328/2.72 458/6.45
Bergmann [6] 7/2.37 13/5.79 42/19.68 115/31.88
Zhou [144] 356/1.20 349/1.34 510/2.00 612/4.66
Ours 8/0.76 11/0.85 15/1.23 24/2.81
Table 3.2: Comparisons of inference time and inference memory consumption, measured in milliseconds
(ms) / gigabytes (GB), when patterns of increasing size (top row) are generated.
Figure 3.3: Synthesized honeycomb textures for the ablation study. The blue boxes represent the learned
scale of periodic encoding in ours, where in w/o deformation, the period is default to 1, which does not
match with the repeating structure of the honeycomb pattern and results in visual artifact.
46
3.3.2 Ablation Study
To validate our design choices, we have conducted an ablation study by removing two designs that we
consider critical for our network to be effective. The comparison results are shown in Figure ??. w/o deformation is a network model that encodes input coordinates without the learnable parameters a described
in Section 3.2.1. w/o shift is a model that is trained without randomly shifting the input coordinates. The
resulted patterns are indicative of the effects of these designs: when coordinates are encoded at a fixed
scale, the w/o deformation model generates hexagons that are seemingly glued together as the presumed
scale does not match with the actual period of the repeating structure. The w/o shift model synthesized
distorted patterns as we speculate that, without the random sampling of the input coordinates, the network
faces difficulty in matching the patch-based priors of the image patches.
3.3.3 Volumetric Shape Synthesis
For the evaluation on volumetric shape synthesis, We have obtained two 3D models from turbosquid.com:
a porous structure (Figure 3.6.a) and a foam structure ((Figure 3.6.d). The 3D meshes are preprocessed
into signed distance fields by uniformly sampling points in a volumetric grid. For the porous structure,
we have sampled 256 × 256 × 256 points and extracted 64 × 64 × 64 patches during training. For the
foam structure, we have sampled 200 × 200 × 128 points and extracted 32 × 32 × 32 patches during
training. During inference, porous structures are synthesized at their original resolution, while we scale
the synthesized foam structures to be twice as large as the original shape in the XY direction. Figure 3.6.c
and Figure 3.6.f show the synthesized shapes. For the porous structures, both outer structures and interior
structures are learned (see Figure 3.6.b for zoom-in interior views) and the structures are diversified both
across and within samples. For the foam structures, we have shown different results by varying the extent
of the latent random field (see Figure 3.6.e). A larger-scale random field encourages the synthesizer to
generate globally varied structures, whereas a smaller scale produces locally anisotropic structures.
47
3.3.4 Application: Seamless 3D Texturing
Due to the periodic nature of the synthesized patterns, noise manipulation allows IPFN to create textures
that are mirror-symmetric. This property provides an immediate application to seamless 3D texture mapping: in Figure 3.7, the original 9-channel texture, composed of color, normal, and bump maps, is tiled
and directly mapped to a planar surface. Due to discrepancies on the edges, as the textures are wrapped to
create the tiled patterns, the mapped surfaces show visible seams. We recreate this texture through our network under a constant latent vector (Figure 3.7 B). When repeatedly mapped to the surface, the symmetric
texture is seamless while faithfully reflecting the appearance and structure of the original texture.
3.3.5 Application: 3D Foam with controllable density
The original foam shape used in our experiment contains holes of various sizes, which corresponds to the
density of the foam structure. This geometric property g can be easily approximated with an unsigned
distance field representation. For a patch X′ of size H′ × W′ × D′
, we estimate its density by:
g(X′
) = 1
m
X
H′
i=1
X
W′
j=1
X
D′
k=1
|sdf(Xijk)| (3.7)
, where m is a normalization factor. Figure 3.5 shows the synthesized foam structures by gradually increasing the density factor (Figure 3.5.a) and by a linearly interpolated density map ((Figure 3.5.b).
48
Figure 3.4: Main results for 2D texture synthesis with comparisons to Henzler et al. [43], Bergmann et
al. [6], and Zhou et al. [144] on synthesizing two stationary patterns (top four rows) and two directional
patterns (bottom four rows).
49
Figure 3.5: Synthesized foam structures with controllable density. a. The grey scale bar controls the
synthesized structure from the highest density (white) to the lowest (black). b. Smooth interpolation of
the guidance factor allows us to synthesize a foam structure with smoothly changing densities.
50
Figure 3.6: Main results for 3D volume synthesis. a. Exemplar porous structure. b. Synthesized structure
models interior tunnels. c. Global views of synthesized porous structures. c. Exemplar foam structure. e.
Two scales of noise fields for the foam structure synthesis. f. Synthesized foam structures. Larger scale of
the noise field leads to more isotropic foam structures.
51
Figure 3.7: IPFN learns multi-channel textures that are applicable to seamless 3D texturing. The original 3D
texture in this example is not symmetric and therefore visible seams can be found on the texture-mapped
surface and in the closeup view (A in figure). As synthesized patterns learnt from this exemplar can be
tiled in any direction, the mapped surface (B in surface) is seamless.
52
3.4 Discussions
Figure 3.8: The examples demonstrating limitations of our network
The main limitation of our method is its emphasis on modeling stationary patterns. While this is based
on our observation that a broad range of natural patterns is stationary or directional, our method does not
provide a natural way to address a class of patterns that are radial, which are exemplified by web structures
and spiral patterns (see the third column in Figure 3.8).
While our conditional formulation is in theory compatible with the synthesis of landscape images,
experiments found that the quality of the synthesized landscapes are subpar - while the synthesized landscape appears globally similar to the exemplar, some local regions contain "fading-out" elements that are
blended with the background (see the first two columns in Figure 3.8). We speculate that this phenomenon
is due to an under-representation of these elements in the exemplar.
The above limitations have inspired us to consider many potential improvements of our methods in
future works. A multi-scale synthesis approach marked in [98, 101] strikes a good balance between learning the distribution of global structure and local, high-frequency details of an image. Different geometric
encoding schemes may also extend our framework to synthesize beyond stationary patterns. We believe
there are still ample opportunities for the extension of our methods to a broader range of 3D applications.
53
Chapter 4
Restricting the Receptive Field for Generative Image Inpainting.
Image inpainting is the task of filling the missing pixels of a masked image with appropriate contents
that are coherent to its visible regions. As a long-studied topic in computer vision, image inpainting has
evolved from a restoration technique solely relying on existing information from the input image (e.g. [7])
to data-driven generative methods (e.g. [140, 131, 127, 72, 107, 79]) that hallucinates detailed contents from
not only the observable pixels but also learned, rich image priors.
Pluralistic inpainting refers to the ability of a model to generate multiple plausible results that complete
a partial image. It offers a view of image inpainting as a generative method that models the smooth distributions of the complete images given the partial image as prior information [140]. However, modeling
such distributions is challenging in the typical encoder-decoder network structures. In order to synthesize missing contents that both respect the partial image and maintain sample diversity, the decoder in
this setting takes as input two types of information: 1) features propagated from the visible regions and
2) random noise vectors sampled from a prior distribution. If the training objective is to reconstruct a
ground-truth image from a partial image, the objective itself may discourage conditioning on the random
variable. Moreover, as the training dataset contains numerous examples that only require low-level information to complete an image (e.g. smoothly interpolating a wall texture), the model may choose to ignore
the latent priors when the available image cues are strong enough to provide an answer. The phenomenon
54
has been found in image translation networks [50], where adding noise to generate a conditional image
does little to create pluralistic results.
In this chapter, we investigate a unique approach to pluralistic image inpainting based on the common
theme of this thesis - design of neural operators that provide spatial reasoning, in this case, in incomplete
visual data. The overall intention in this application context is to move away from the “entangling” methods in the past and towards a paradigm of separation. The term “separation” here refers to the separation
of feature analysis and content synthesis: extracting semantic information from the partial information
first by restricting receptive field of a convolutional encoder to only the visible area, and then synthesizing
the missing areas based on the learned priors.
The resulted method is based on a branch of recently developed synthesis methods known as the generative transformers [29, 95, 9]. These methods synthesize images by procedurally predicting latent codes,
termed as “tokens”, that semantically encode information of an image, analogous to sentence generation
in natural language processing [114, 23]. As tokens drawn in each step are based on a learned posterior
distribution of the previous step, generative transformers offer fine-grained control over diversifying the
synthesized contents. To adapt the generative transformer framework to target the image inpainting task,
we design a three-stage pipeline to 1) encode a partial image into discrete latent codes, 2) predict the
missing tokens with a bidirectional transformer, and 3) couple the predicted tokens with the partial image
priors and decode them into a complete image. The resulted method design fundamentally differentiates
itself from typical inpainting approaches in the past: instead of modeling the complex interaction between
the missing and observed regions, our method barely looks at the missing regions during the encoding
and token prediction stage: the designed mask-aware encoder utilizes restrictive convolutions to only operate on the visible and near-visible regions, and the bidirectional transformer only attends to the visible
tokens to make a prediction. This separation design leads to a paradigm that divides feature reasoning and
55
generative modeling into two separate stages, which we found beneficial for large-mask pluralistic image
inpainting.
Experiments have validated that these design choices lead to a robust, high-quality, and pluralistic
solution to challenging image inpainting settings. Our method has achieved state-of-the-art performance
both in terms of visual quality and sample diversity in two public benchmarks Places [142] and CelebAHQ [55].
4.1 Related Works
The term “image inpainting” was originally introduced in [7] as a propagation algorithm that fills in the
missing regions following lines near the boundaries. Following works are methodologically divided into
two branches: the diffusion-based methods [33, 2, 100, 65, 106] that aim at interpolating information in
the hole areas under local or global optimization constraints, and the patch-based methods [63, 20, 3,
18, 25] that composite candidate patches from selected regions into the hole areas. The early works are
characterized by analyzing the inner statistics of the single image provided, many of which have shown
satisfactory results for filling small holes, restoring relatively simple salient structures, and reconstructing
textures.
Image inpainting has been advanced greatly by recent works that utilize artificial neural networks
(ANN), as they introduce deep image priors to the synthesis of missing contents in a more robust and
versatile way. Notably, the perceptual loss [53, 145] and the adversarial loss [37] have been found to drastically improve the visual quality achieved by the ANN methods. Many works have proposed designs
that incorporate both local coherence and global consistency through multi-scale discriminators [48], feature rearrangement [126, 104], spectral convolution [107], attention layers [132, 75, 141, 72, 99] and GAN
inversion [128]. Mask-aware operations have also been explored in various forms, notably with partial
convolution [74], gated convolution [131] and continuously masked transformer [58]. Another relevant
56
direction that shares similarity to our generative method is the progressive approach [41, 69, 137, 135],
where inpainting is done step-wise via various forms of residual networks to gradually update both the
mask and the feature maps of interest.
While the majority of inpainting works (e.g. [107, 141, 86, 132, 104, 75, 126, 48, 124, 74, 127, 131])
produce deterministic results given a fixed partial image input, various techniques have been proposed
for pluralistic inpainting. [140] first studies the coupling of smooth latent priors with pixels generated
for a masked image, where the authors suggest the undesirable side effect of reconstructive inpainting on
sample diversity and propose a dual-path pipeline to better condition the generated results on both the
observed regions and the sampled priors. [76] modulates a latent vector in the image space based on a
coarse inpainting prediction. More recent works [72, 79] have shown major improvements in the visual
quality of pluralistic inpainting. MAT [72] achieves state-of-the-art inpainting quality with a unifying
design of normalized transformer, shifting attention and style code modulation. RePaint [79] leverages the
denoising diffusion probabilistic model to image inpainting. The diffusion model can create realistic and
diverse inpainting results through iteratively denoising resampled pixels in many steps. The limitation of
it is the very slow sampling speed, which may be hindering for some real-world applications.
Our method extends from a family of methods characterized by learning priors from discrete latent
codes that are obtained from a vector-quantized autoencoder [113]. Past research in this direction has
only focused on image synthesis, from directly predicting pixels as word tokens [10], to predicting tokens
encoding visual features of larger receptive fields [113, 29]. While the pioneering works infer latent codes
autoregressively, MaskGIT [9] finds it beneficial to synthesize an image in a scattered manner with a
bidirectional transformer: in every iteration, a number of new codes are predicted in parallel and inserted
into scattered locations of the code map until the entire grid is filled. While [9] has partially adapted its
bidirectional framework to the image inpainting setting, our method design addresses several unanswered
57
aspects of this adaptation: how partial images can be robustly masked into latent codes, and how the latent
codes should be decoded into synthesized pixels that respect the observable area.
4.2 Method
Our method is divided into three stages to complete an input partial image. The neural network model
takes as input a partial image XM and a mask image M specifying the area to complete. The first stage
encodes the partial image into a set of discrete tokens, referred to as latent codes, at a lower resolution
and specifies the masked tokens that need to be predicted (Section 4.2.1); the second stage utilizes a bidirectional transformer to predict the missing tokens iteratively (Section 4.2.2); and the third stage couples
the predicted tokens with features from the partial image and decodes them into a completed image (Section 4.2.3). Figure 4.1 provides a visualization of the overall pipeline of our method.
4.2.1 Encoding with Restrictive Convolutions
Our latent codes are represented by a discrete codebook C of learned tokens C = {zk}
K
k=1 ∈ R
nz
, where
k is the number of tokens and nz is the number of channels of each token feature. The tokens are given
a set of labels Y = {y}
K
k=1 ∈ Z. We employ the setup of VQGAN [29] to learn the codebook by training
on full images, where the token grid is 1/16 of the resolution of the image. Ideally, encoding a partial
image amounts to extracting valid tokens for the observed parts and invalid tokens, which are given a
special [MASK] token, for the masked parts. However, the convolutional nature of the VQGAN network
leads individual tokens to encode not just local information, but also information of its proximity. Directly
encoding partial images with a VQGAN encoder thus leads to degradation of the tokens, as the masked
region inevitably affects how an encoder chooses to extract tokens. Another concern is the fact that pixellevel masking (e.g. on a 256x256 image) does not directly translate to token-level masking (on a 16x16 token
grid): small regions of masked pixels may still contain rich information in their neighborhood. However,
58
Erst
14
66
28 61
10 92 26 26
37
14
14
87 32
43 14
45 14
1
2
n-1
n
Codebook
14 ... 32 37 26
14
66
28 61
10 92 26 26
37
14
14
87 32
43 14
45 14
+ G
Section 3.1
Restrictive Encoding
Section 3.2
Predicting the Latent Codes
Section 3.3
Decoding the Latent Codes
masked
prediction
unmasked
step 0 step 1 step 2 step 3
Eprt
Input Image Completed Image
Incomplete
Visual Tokens
Completed
Visual Tokens
Figure 4.1: Overall pipeline of our method. Erst denotes our proposed restrictive encoder that predicts
partial tokens from the source image (see Section 4.2.1). The grey square space in the figure denotes missing
tokens, which are iteratively predicted by a bidirectional transformer (see Section 4.2.2). Eprt denotes an
encoder with partial convolution layers, which processes the source image into complementary features
to the predicted tokens. The coupled features are decoded into a complete image by a generate G (see
Section 4.2.3).
directly down-sampling the mask image may lead many observable pixels to be masked out at a lower
resolution, therefore discarding a fair amount of useful information. A good approach is thus needed to
determine when a token should be considered masked in the encoding step.
Separating Masked Pixels The principle idea behind our encoding method is to prevent the participation of large areas of masked pixels in each convolutional network layer, controlled by a hyperparameter
ratio α. To simplify the explanations, let’s consider an encoder network with only non-strided convolutions and down-sampling layers. The standard partial convolution [74] widely used in past image
inpainting works are characterized by scaling the matrix multiplication in the convolution operation and
an update rule for the mask:
x
′ =
WT
(X
JM)
1
sum(M) + b, if sum(M) > 0
0, otherwise.
(4.1)
M(x)
′ =
1, if sum(M) > 0
0, otherwise,
(4.2)
59
where X denotes a n × n neighborhood in the feature map under a convolution kernel size of n, and
M denotes the n × n mask in that area. J denotes element-wise multiplication. Effectively, the standard
partial convolution mitigates the impact of masked pixels on the features’ signal strength with an adaptive
scaling and propagates new features into the masked pixels as long as there are visible pixels surrounding
them. In contrast, we choose to separate the image prior learning step and the synthesis step aggressively
in two different stages. In the encoding stage, we thus propose the restrictive partial convolution that
only considers regions surrounded by a certain proportion of visible pixels:
x
′ =
WT
(X
JM)
1
sum(M) + b, if sum(M)
sum(1) >= α
0, otherwise,
(4.3)
and at each down-sampling layer, we update the mask by:
M(x)
′ =
1, if sum(M)
sum(1) >= α
0, otherwise,
(4.4)
where 1 denotes a n × n constant tensor of value 1. Different from the standard partial convolution,
the algorithm does not update the mask at each restrictive partial convolution layer. The changes made
here prevent feature propagation to densely unmasked regions subject to the α value.
Besides separating the masked regions from the observed features, this particular convolution design
also addresses the inevitable mismatch between the input pixel-level mask and the updated mask seen in
the much-lower-resolution feature space. By using a small α value, the encoder is designated to fill in
tokens for small regions of unseen pixels while leaving the larger regions to be predicted in the next stage
(see Section 4.2.2). Figure 4.2 provides a visual illustration of this process: the smaller the α, the more
likely that the encoder would predict a token label for local areas that are partially masked (marked by the
60
Erst
Erst
Erst
region predicted by the transformer
region predicted by the encoder
Input Prediction
α=0.25
α=0.5
α=1.0
Figure 4.2: A visualization of mask down-sampling, shown on a 16x16 grid on the third column, from
different α values following Equation 4.4. Smaller α values (top two rows) lead the restrictive encoder
to predict tokens for more small mask areas (marked by the red pixels). Larger α is undesirable (bottom
two rows) as it unnecessarily discards useful information from the image, leading to more inconsistent
inpainting results.
red-colored grid locations in the figure). When α is set to be larger than 0.5, more observable pixels are
“blocked out” and left to be predicted in the next stage. We have empirically found that setting α = 0.5
produces the best inpainting results.
Encoder Design Let X denotes an input image, and M denotes a mask of the same size, given a
pre-trained codebook C and VQGAN encoder EV Q of a dataset, our encoder E(XM) learns to predict
the probability of token labels in each visible region of a partial image XM = X
JM, supervised by
the “ground-truth” token labels at those locations from encoding the complete image with EV Q(X). Our
encoder is constructed by the restrictive partial convolutions and self-attention layers. It processes a partial
image into probability estimations on the token labels YMˆ = {yi}, given the down-sampled mask Mˆ . The
training objective is hence minimizing the negative log-likelihood:
Lencoder = −Ey∈Y[
X
∀i∈Mˆ
log p(yi
|XM)], (4.5)
61
where y are the target tokens, and XM is the partial image.
4.2.2 Predicting the latent codes
The restrictive encoder has thus far encoded the input image into two distinctive regions of tokens: the
visible region labeled by valid tokens D = {yi}Mˆ , and the unseen region D¯ = {m}1−Mˆ that contains
a set of [MASK] tokens m (visualized by grey blocks in Figure 4.2). A bidirectional transformer based on
the BERT model [23] is used to predict the token indices for each masked location in D¯ based on the
visible set of tokens D. The transformer retrieves visual features of the visible labels from the codebook,
augments them with positional encoding, and processes them with attention layers to make independent
label predictions on each masked location.
Training the transformer Training the generative transformer is simply by the maximum likelihood
estimation - learning to predict the missing labels from the available ones:
Ltransformer = −Ey∈Y[
X
∀i∈D¯
log p(yi
|C(D))], (4.6)
where the only differences from Eq.4.5 are that the labels are predicted for the set of unseen locations D¯
and the input to the transformer is a flattened list of visual tokens retrieved from the codebook C(D). We
train the transformer with full images and only mask the down-sampled token map during the training.
Specifically, during training, full images are encoded by the VQGAN encoder EV Q to obtain a complete
list of tokens. The list of tokens are then randomly masked by a ratio between 15% and 75%.
Sampling with the transformer During inference, the missing tokens are predicted iteratively through
a parallel decoding algorithm [9]. Given a sampling step k = 5, and a cosine scheduling function f, the
algorithm predicts labels for all missing tokens D¯ at each step i, while only choosing to keep n = f(i)
predicted tokens with top prediction scores given by the transformer. The cosine scheduling function
62
is chosen to ensure that Pk
i=0 f(i) = |D¯|. As pluralistic results are desired, we sample each token by
drawing from its predicted probability distribution p(yi
|C(D)).
An important modification we add to the sampling algorithm from [9] is the inclusion of an adaptive
temperature t which scales the logits prior to the softmax function with pi = P
exp(tzi)
n
exp(tzi)
. The temperature controls the confidence level of the sampler: the lower the temperature, the more likely that labels
with higher confidence scores would be sampled. Empirically, we found it beneficial to start with a high
temperature and gradually anneal the temperature in each step. This encourages the sampler to introduce
more diverse tokens early on and draw with more certainty when more evidence is present. In our model,
we use a starting temperature of 1, and set the annealing factor s = 0.9, which scales the temperature
value in each sampling step. The chosen starting temperature has substantial impacts on the visual quality
and diversity of the inpainting results (see the ablation study section in Section 4.4).
4.2.3 Decoding the latent codes
Due to the discrete, quantized nature of the codebook representation, the visual tokens learned to encode an
image usually do not fully recover the original image. Stochastically sampled tokens may further alter the
global appearance of the fill-in area if they are decoded into an image directly (see Figure. 4.3.A). Therefore,
compositing the generated pixels with the partial image oftentimes results in noticeable discontinuities at
the mask boundaries (see Figure 4.3.B).
In order to synthesize pixels that coherently complete the partial images, we find it necessary to couple
the quantized latent codes with smooth image priors encoded from the input partial image. Since the
smooth features are responsible for locally bridging the synthesized contents and the existing contents,
the encoder, denoted as Eprt, utilizes layers of the standard partial convolution (Eq.4.1) to extract local
63
14
7
28 61
10 92 74 36
37
99
58
87 32
43 13
45 24
2 96
18 26
50
23 35
26
E+T G
14
7
28 61
10 92 74 36
37
99
58
87 32
43 13
45 24
2 96
18 26
50
23 35
26
E+T
G
Eprt
+
A.
B.
C.
Figure 4.3: A visual comparison between the decoder designs. A. Directly decoding the predicted latent
codes Z with the restrictive encoder E and transformer T, and B. its composition with the source image XM. C. Our proposed decoding design, where partial image priors Eprt(XM) are composed with Z
through a composition function f described in Equation.4.7-4.8.
features propagated to the masked regions. The features in the masked region are combined with the
latent codes via an averaging operation:
h1 = (1 − Mˆ )(Z + Eprt(XM))/2, (4.7)
where Z is a feature map of the tokens predicted by the transformer. The recomposed features in the empty
space are then combined with features extracted from the visible area and decoded by a convolutional
generator G:
X′ = G(h1 + MEˆ
prt(XM)). (4.8)
64
The design is visualized in Figure 4.3.C. During training, the network learns to recover the ground
truth image X given the partial image XM and a set of chosen latent codes. As we train the network with
a reconstruction objective, we use the set of latent codes obtained from encoding the ground truth image
X, where Z = EV Q(X). The encoder Eprt and the generator G are optimized by a combination of the
adversarial loss [37] and a reconstruction loss. The adversarial loss Ladv formulated as:
LG = −Exˆ[log(D(ˆx)], (4.9)
LD = −Ex[log(D(x)] − Exˆ[log(1 − D(ˆx)], (4.10)
where x and xˆ are a pair of real and fake samples, G, D are the generator and the discriminator. We
additional use the R1 regularization [96] of the form R1 = Ex∥▽D(x)∥. The LPIPS reconstruction loss
function [138] is formulated as
LP =
X
l
1
HlWl
X
h,w
∥wl ⊙ (ˆy
l
hw − yˆ
l
0hw∥
2
2
, (4.11)
which compute the L2 distance between the layer activation of a pretrained VGG network [102] at
each layer l. The combined loss is Ldecode = Ladv + 0.1R1 + 0.1LP .
4.3 Implementation Details
The network structures and all hyper-parameter settings are identical for both the Places model and the
CelebA-HQ model. Our training pipeline is divided into three separate stages: the training of the encoder,
the transformer, and the decoder. The free-form random masks [72] are used during the training of the
encoder and the decoder. For the pre-trained VQGAN model, we use a quantization of N = 1024 tokens,
each with a 256-channel embedding. All models are trained on 3 NVidia V100 GPUs with a batch size of
65
256x256 256x256 128x128 64x64 32x32 32x32 16x16 16x16 16x16
E
G
ResBlk
(128,256)
ResBlk
(256,256)
256x256
Conv
(3,128)
ResBlk
(256,512)
ResBlk
(512,512)
Self-Att
(512,512)
ResBlk
(512,512)
Self-Att
(512,512)
Conv
(512,256)
Encoder Structure
Z
(256 dim)
16x16
Z
(256 dim)
16x16 16x16
Conv
(256,512)
16x16
Self-Att
(512,512)
16x16
ResBlk
(512,512)
32x32
Self-Att
(512,512)
32x32
ResBlk
(512,512)
64x64
ResBlk
(512,256)
128x128
ResBlk
(256,256)
256x256
ResBlk
(256,128)
256x256
Conv
(128,3)
Decoder Structure
256x256
Figure 4.4: Detailed network structures for the encoder and decoder. Numbers within each feature map
(e.g. (3,128)) denote the input and output channels. Numbers below each feature map (e.g. 256x256) denotes
the size of the tensor.
8. The Places model is trained for 20 epochs, and the CelebA-HQ model is trained for 200 epochs. The
inference time of our model is around 0.4 seconds for an image on a single NVidia V100 GPU regardless
of the mask size.
Figure 4.4 shows the detailed network structure designs for the encoders and decoder used in our
method. Given input channel cin and output channel cout, “ResBlk” is a resnet block composed of two convolution layers and a skip connection. The two convolutions have weight matrices W1 ∈ R
3×3×cin×cout
and W2 ∈ R
3×3×cout×cout respectively. Layer normalization is applied before and after the first convolution. The restrictive encoder and partial encoder described in the method part (Section 4.2) replace all
convolutions with the restrictive convolution and the partical convolution respectively. In addition, selfattention layers (denoted by “Self-Att” in the figure) are added to process the features at 32x32 and 16x16
resolution.
The transformer model described in Section 4.2.2 is designed based on the minGPT transformer model∗
,
with token embedding and positional embedding of 1408 channel and 40 layers of 16-head attention layers.
During training, we have applied attention dropout and embedding dropout both with a 10% probabilities.
∗
see https://github.com/karpathy/minGPT
66
4.4 Experiments
Dataset Our experiments are conducted on the Places365-Standard [142] and the CelebA-HQ [55], two
benchmarks widely evaluated by past image inpainting methods. The Places365-Standard dataset contains
1.8 million images for training and 36.5 thousand images for evaluation across over 205 scene categories.
The CelebA-HQ dataset is split into 24,183 training images and 2,993 test images. For both datasets, we
use an image resolution of 256 × 256. We further set three different mask settings for our experiments: 1)
small random hole, 2) large random hole, and 3) large box hole. The first two settings are directly adopted
from MAT [72] (free-form holes with strokes and boxes). The third, challenging setting uses a very large
box mask centered in the image with its width and length equal to 80% of the image size. As such setting
leaves a majority of the pixels empty, we find it suitable for the evaluation of pluralistic inpainting, as well
as on whether the inpainting method extends to the extreme cases.
Evaluation metric Our main objectives are to evaluate both the visual quality and the sample diversity
of the inpainted image. To this end, we opt for the perceptual metric FID [44] and the LPIPS-based diversity
score [145]. Specifically, to compute the diversity score for each dataset, we use a smaller subset of both
Places-Standard and CelebA-HQ with 1000 images. For each image, 100 inpainting samples are drawn
under the large box setting, and the diversity score for each sample is computed as the average of the
pair-wise LPIPS distances between the drawn samples. The final score shows the average of the individual
scores and their standard deviation. In addition, we provide visual examples for a qualitative evaluation
(see Figure 4.5,4.10 and more in the Supplements).
4.4.1 Comparisons to The State of Arts
Our results are compared to four recent baseline image inpainting methods. The baseline methods are
evaluated directly with the provided pre-trained models and their public source codes: 1) MAT [72] is
the current state-of-the-art method in image inpainting that is able to produce high-quality and pluralistic
67
Methods
Places (256 × 256) CelebA-HQ (256 × 256)
FID↓ Diversity↑ FID↓ Diversity↑
Small Mask Large Mask Box Box Small Mask Large Mask Box Box
Ours 1.02 2.82 13.30 0.29±0.06 2.70 5.04 12.79 0.28±0.05
MAT[72] 1.19 3.32 17.5 0.26 ± 0.04 2.94 5.16 15.18 0.10 ± 0.02
LaMa[107] 1.22 3.78 19.48 - 3.98 8.75 23.24 -
MaskGIT[9] 19.84 36.38 52.71 0.39±0.05 19.68 40.76 30.87 0.25±0.04
Pluralistic[140] 4.83 16.26 86.57 - 9.7 28.89 43.08 0.18±0.02
Table 4.1: Comparisons of FID and diversity scores to the baseline methods. Bold text denotes the best, and
blue text denotes the second. Since LaMa [107] does not generate pluralistic results, and Pluralistic [140]
produces degenerate results in the Places Box setting, we omit their diversity scores in the table.
inpainting results in challenging settings. 2) LaMa [107] is another state-of-the-art method that is characterized by the use of Fourier convolution. However, it does not generate pluralistic results; 3) MaskGIT [9]
is a latent-code-based image synthesis method that has strongly motivated our work and has been shown
to be adaptable to image inpainting; and 4) Pluralistic [140] is a seminal work in exploring pluralistic
image inpainting by coupling smooth latent priors with the partial image features.
Table 4.1 lists the quantitative evaluation results: ours outperforms the baseline methods in all settings
in terms of FID and diversity score, except for the diversity score on Places, where we rank the second best.
While MaskGIT produces more diverse results on the Places dataset, it does so at the cost of visual quality,
as the method is not designed to coherently compose the synthesized contents with the existing contents.
Figure 4.5 shows visual examples of the inpainting results under the three different mask settings. While
our results can be seen slightly better in inpainting consistency compared to the baselines, the method
truly shines in the pluralistic comparisons (lower half of the figure): while our model synthesizes pluralistic
inpainting results that are high-quality and varied in both local details and global structures, the baseline
methods either only produce locally diversifies results (MAT [72], see 4th,7th rows in Figure 4.5) or fail
to generate visually coherent results in the challenging box setting (Pluralistic [140], see 5th,8th rows in
Figure 4.5).
68
Type Model FID↓
Full Model 12.81
Temperature
t = 0.1 14.62
t = 0.5 13.61
t = 2 13.96
Restrictive Conv
α = 0.25 12.9
α = 0.75 14.12
α = 1.0 15.04
Network Design Vanilla Encoder 15.32
Vanilla Decoder 18.50
Table 4.2: Quantitative ablation study. “Temperature” adjustments change the temperature value in the
sampling procedure. “Restrictive Conv” adjustments change the mask update rule in the restrictive encoder. “Network Design” adjustments replace our designed network structures with the vanilla ones: for
the “Vanilla Encoder” setting, an encoder network with the regular convolution layers are used; for the
“Vanilla Decoder” setting, the predicted latent codes are directly decoded into an image. In the “Full Model”,
we set t = 1.0 and α = 0.5.
4.4.2 Ablation Study
We validate several design choices of our model with an ablation study on a smaller evaluation subset of
the Places dataset with 3000 images. Table 4.2 shows the quantitative results of the ablation study based
on the the FID score metric.
Effectiveness of the Restrictive Design One main question we ask when designing the restrictive
encoder is how much our model needs to separate small mask regions. In the extreme case, a one-pixel
mask can be turned into a masked token in the down-sampled token grid, thus discarding 93.74% of information around that pixel. Obviously, this behavior is not desirable and we have experimentally found that
larger α values lead to decreased performance (see Table 4.2 and Figure 4.2). Our final choice of α = 0.5
leads our encoder to complete the inpainting by itself for local regions that are less than 50% masked. On
average, the restrictive encoder achieves a label classification accuracy of 23% for labels in small mask
regions and near mask boundaries in the large mask setting (tested on the Places dataset), whereas an
encoder with the regular convolution only achieves an accuracy of 9%. Furthermore, due to the inductive
69
bias learned in the training process, a “wrongly” classified label does not necessarily translate to poor inpainting results. The result could be a completed image that is different from the original one, while still
visually plausible.
Effectiveness of the Sampling Function As described in Section 4.2.2, we have found the adaptive
temperature a key factor in controlling the visual quality and diversity outcome of the inpainting. Figure 4.10 provides a visual comparison between different configurations of the temperature and annealing
factor. We have observed that a larger starting temperature creates more diverse output, though at the
risk of breaking the coherence of the inpainting. Overly small temperature, on the other hand, led our
model to mainly interpolate patterns in the masked area, resulting in large areas of homogeneous textures
in the inpainting results. We further study the effectiveness of the restrictive encoder by comparing it to a
miracle encoder that has access to the original complete image. Specifically, while our restrictive encoder
takes as input the masked image and produces latent code by z0 = E(MX), the miracle encoder encodes
the complete image as input and masks the latent codes with a down-sampled mask, with z1 = MEˆ (X).
Figure 4.12 provides a visual comparison between the two encoders. For the specific example in the figure,
z0 and z1 only share 24.8% of the encoded latent codes, as the miracle encoder manages to encode the image
differently with access to the complete image. The quality of the inpainting results, however, show little
difference, although the restrictive encoder has to infer latent codes with far less information. Quantitative
evaluation in Table 4.3 provides a comparison between the performance of the restrictive encoder and the
miracle encoder when used in the full inpainting pipeline, where we found that the designed restrictive
encoder can be nearly as effective as one that has full access to the complete images. This indicates that
meaningful inductive bias has been learnt by the encoder in the training process.
70
Input Ours MAT LaMa Pluralistic MaskGIT
Input Sample 1 Sample 2 Sample 3 Sample 4
Ours
MAT
Pluralistic
Ours
MAT
Pluralistic
Figure 4.5: Visual examples on inpainting with both the random masks (upper half) and the challenging
large box mask (lower half), compared to the selected baseline methods.
71
Input Ours MAT LaMa Pluralistic MaskGIT
Figure 4.6: Further visual examples of inpaining under the large mask setting, compared to the baseline
methods.
72
Input Sample 1 Sample 2 Sample 3 Sample 4
Ours
MAT
Ours
MAT
MaskGIT
MaskGIT
Ours
MAT
MaskGIT
Figure 4.7: Further visual examples of pluralistic inpainting on the Places Dataset [142], compared to the
baseline methods.
73
Input Sample 1 Sample 2 Sample 3 Sample 4
Ours
MAT
Ours
MAT
MaskGIT
MaskGIT
Ours
MAT
MaskGIT
Figure 4.8: Further visual examples of pluralistic inpainting on the Places Dataset [142], compared to the
baseline methods.
74
Input Sample 1 Sample 2 Sample 3 Sample 4
Ours
MAT
Ours
MAT
MaskGIT
MaskGIT
Ours
MAT
MaskGIT
Figure 4.9: Further visual examples of pluralistic inpainting on the CelebA-HQ Dataset [55], compared to
the baseline methods.
75
Input
Prediction
Figure 4.10: Comparisons of inpainting results with regard to different sampling temperature t and annealing factors s.
Input
t = 1.0
s = 0.9
t = 0.1
s = 1.0
t = 5
s = 0.5
t = 1.0
s = 0.9
t = 0.1
s = 1.0
t = 5
s = 0.5
Sample 1 Sample 2 Sample 3
Figure 4.11: Further visual examples of pluralistic inpainting with respect to different sampling temperature t and annealing factor s.
76
Methods
Places (256 × 256)
FID↓ Diversity↑
Small Mask Large Mask Box
Restrictive 1.02 2.82 0.29±0.06
Miracle 0.93 2.71 0.29±0.06
Table 4.3: Comparisons of FID and diversity scores between the restrictive encoder and the miracle encoder. Input Sample 1 Sample 2 Sample 3
Figure 4.12: Visual comparison between inpainting with the restrictive encoder and a miracle encoder.
77
Input
Prediction
Figure 4.13: Failure cases in our results.
4.4.3 Limitations of our model
Our model shares limitations with previous inpainting methods in its limited ability to complete semantically salient objects such as shops and furniture, as well as people and animals, when trained on the Places
dataset (see Figure 4.13). The inference speed of our method is also relatively slower than the end-to-end,
single-pass inpainting methods, as the majority of the computation time is spent on the iterative sampling
steps. We also have left some areas unexplored. For instance: whether the aforementioned synthesis problem can be mitigated by training with semantic labels, or how the model extends to higher resolution input.
We believe that these extensions are well within reach as past latent code synthesis methods (e.g. [29, 9])
have demonstrated such capabilities.
4.5 Discussion
In this chapter, we present a pluralistic image inpainting method that first analyzes only the visible and
near-visible regions through latent code prediction, and synthesizes the missing contents through a versatile bidirectional transformer and a reconstruction network that composes the code prediction with partial
image priors. We have validated our design choices through comparative experiments on public benchmarks and an ablation study, as our method achieves state-of-the-art performance in both visual quality
and sample diversity.
78
Chapter 5
Conclusion and Future Directions
5.1 Summary of Research
With the expanding roles of neural networks in visual applications, the design of neural operators has
become a key to addressing many practical problems that arise in appropriately handling various visual
representations. My thesis established that many of these neural operator designs in computer vision are
matters of spatial reasoning - how visual information defined in a spatial domain should be aggregated
and processed. Though no sufficient amounts of work would be enough to fully cover the design space of
this complex topic, this thesis presented three works that shed light on the perspective of spatial reasoning
in three different directions.
In Chapter 2, spatial reasoning in the SE(3) space is explored. In designing a SE(3) equivariant
network, I have found that the effectiveness of a spatially hierarchical CNN can be reproduced even in
a higher dimensional space, where the obstacle to the increased computational complexity is tackled by
the proposed SE(3) Separable Convolution. I have also found that, through extensive evaluations, the
proposed neural network models that provide high-dimensional spatial reasoning is proven highly valuable
in several practical 3D applications, notably scene registration and shape alignment.
In Chapter 3, spatial reasoning in encoding continuous functions is explored, where the use of a
deformable periodic encoding has been shown to bring major improvements in the modeling power of a
79
mlp network in stationary 2D texture and 3D pattern synthesis. Combined with the scalability of the mlp
network, the method is shown to produce infinitely large texture very efficiently, with great visual quality
and can diversify local details with a modeled probabilistic distribution.
In Chapter 4, spatial reasoning in incomplete data is explored, from a perspective of controlling
where a neural network needs to look at, in this case restricting the receptive field through a novel convolutional operator. The proposed image inpainting thus reflects a paradigm that disentangles analysis
and synthesis: analyzing only the visible contents first, then synthesizing only the missing contents later.
In addition, we have also explored the effectiveness of bidirectional generative transformer in this unique
image inpainting setting, where the missing contents are predicted one step at a time to allow greater controllability in the stochasticity of the synthesis process. The overall framework can be seen as a procedural
inpainting process that excels in high-quality and pluralistic image inpainting in the most challenging settings.
5.2 Open Questions and Future Work
Faster equivariant analysis in 3D In chapter 2, we have established that equivariance to the SE(3)
space provides benefits for CNN-based 3D analysis. The burden to maintain a 6D representation of the
3D objects, however, is still a limiting factor for architectures like ours to be deployed in many target
applications that may run on integrated systems with limited memory, or those that require real-time performance. While the aims for a purely spatial approach does reproduce an analogy of a hierarchical CNN
in the SE(3) space, I have noted that discretized rotation groups are rarely considered in many practical
settings to model rotational signal, where, instead, the spherical harmonics are widely used, for instance,
to represent lighting in a 3D scene in computer graphics. The observation therefore leads to many questions on whether there is ample room to improve the time and space efficiency of our SE(3)-equivariant
models, how these improvements can be done without compromises on its analytical power, and how a
80
spectral representation of rotations can be coupled with a hierarchical spatial structure. I believe that these
questions post meaningful directions for future works on 3D shape analysis.
Generalizing the design of spatial operators in neural network While this dissertation offers a
few separate views on how certain spatial operator designs benefit visual applications, there have not yet
been enough efforts to study the global design space of neural operators, or a comparative study of the
abundant variances of neural operators proposed in the past. As neural architecture search (NAS), which
builds upon a search space of a few basic operations, has become a widely used technique to optimize
ANN’s performance in consumer products, one may wonder if such search space can be greatly expanded,
and whether it will be meaningful to expand it. Intuitively, neural operator designs may be generalized
as modifications along a certain directions on 1) the spatial extent of features that these operators attend
to, and 2) how features and learnable weights are interacted. In addition, the success of both deformable
convolution [19] and deformable attention [122] has suggested the feasibility of this generalization. Future
work may study a set of potential logic that can guide the assembly of spatial-aware operators and therefore
offer new insights into the search of appropriate ways to model spatial reasoning in neural networks.
Large-scale synthesis of virtual environment In Chapter 3 and 4, we have discussed generative networks that aim for large-scale synthesis of patterns with either a continuous implicit network or a bidirectional transformer that generates contents procedurally. These designs have been motivated by a larger
goal of generating 3D virtual environments, for example, a virtual city from priors learned from dronecaptured images of real-world landscapes. While both of the generative networks proposed in Chapter 3
and 4 show potential solutions to scalable synthesis, it remains a challenge to tackle a task as complicated
as synthesis of large-scale 3D environments. For one, the lack of global constraints or conditional information in my proposed networks would limit their ability to ensure that the synthesized environments
81
have a global distribution, or arrangements, of structures (e.g. downtown area surrounded by rural neighborhoods) that approximate real-world locations. We would imagine that many additional efforts would
be needed to devise a systematic pipeline for virtual world generations, which would necessarily include
further semantic analysis of the existing data and improvements on our synthesis algorithms to make them
much more consistent and controllable in modeling complex structures. These leave ample opportunities
for future works.
82
Bibliography
[1] Aharon Azulay and Yair Weiss. “Why do deep convolutional networks generalize so poorly to
small image transformations?” In: arXiv preprint arXiv:1805.12177 (2018).
[2] Coloma Ballester, Marcelo Bertalmio, Vicent Caselles, Guillermo Sapiro, and Joan Verdera.
“Filling-in by joint interpolation of vector fields and gray levels”. In: IEEE transactions on image
processing 10.8 (2001), pp. 1200–1211.
[3] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. “PatchMatch: A
randomized correspondence algorithm for structural image editing”. In: ACM Trans. Graph. 28.3
(2009), p. 24.
[4] Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman.
“Frequency bias in neural networks for input of non-uniform density”. In: International
Conference on Machine Learning. PMLR. 2020, pp. 685–694.
[5] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. “3dmfv: Three-dimensional point
cloud classification in real-time using convolutional neural networks”. In: IEEE Robotics and
Automation Letters 3.4 (2018), pp. 3145–3152.
[6] Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf. “Learning texture manifolds with the
Periodic Spatial GAN”. In: Proceedings of the 34th International Conference on Machine
Learning-Volume 70. 2017, pp. 469–477.
[7] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. “Image inpainting”.
In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques. 2000,
pp. 417–424.
[8] Pravin Bhat, Stephen Ingram, and Greg Turk. “Geometric texture synthesis by example”. In:
Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing. 2004,
pp. 41–44.
[9] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. “Maskgit: Masked
generative image transformer”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2022, pp. 11315–11325.
83
[10] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever.
“Generative pretraining from pixels”. In: International conference on machine learning. PMLR.
2020, pp. 1691–1703.
[11] Weikai Chen, Xiaolong Zhang, Shiqing Xin, Yang Xia, Sylvain Lefebvre, and Wenping Wang.
“Synthesis of filigrees for digital fabrication”. In: ACM Transactions on Graphics (TOG) 35.4 (2016),
pp. 1–13.
[12] François Chollet. “Xception: Deep learning with depthwise separable convolutions”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1251–1258.
[13] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi.
“Describing textures in the wild”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2014, pp. 3606–3613.
[14] Taco Cohen and Max Welling. “Group equivariant convolutional networks”. In: International
conference on machine learning. 2016, pp. 2990–2999.
[15] Taco S Cohen, Maurice Weiler, Berkay Kicanaoglu, and Max Welling. “Gauge equivariant
convolutional networks and the icosahedral cnn”. In: arXiv preprint arXiv:1902.04615 (2019).
[16] Taco S Cohen and Max Welling. “Steerable cnns”. In: arXiv preprint arXiv:1612.08498 (2016).
[17] Taco S. Cohen, Mario Geiger, Jonas Koehler, and Max Welling. “Spherical CNNs”. In: (2018).
Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018. url:
http://arxiv.org/abs/1801.10130.
[18] Antonio Criminisi, Patrick Perez, and Kentaro Toyama. “Object removal by exemplar-based
inpainting”. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2003. Proceedings. Vol. 2. IEEE. 2003, pp. II–II.
[19] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei.
“Deformable convolutional networks”. In: Proceedings of the IEEE international conference on
computer vision. 2017, pp. 764–773.
[20] Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B Goldman, and Pradeep Sen. “Image
melding: Combining inconsistent images using patch-based synthesis”. In: ACM Transactions on
graphics (TOG) 31.4 (2012), pp. 1–10.
[21] Haowen Deng, Tolga Birdal, and Slobodan Ilic. “Ppf-foldnet: Unsupervised learning of rotation
invariant 3d local descriptors”. In: Proceedings of the European Conference on Computer Vision
(ECCV). 2018, pp. 602–618.
[22] Haowen Deng, Tolga Birdal, and Slobodan Ilic. “Ppfnet: Global context aware local features for
robust 3d point matching”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 195–205.
[23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep
bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
84
[24] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. “Image super-resolution using
deep convolutional networks”. In: IEEE transactions on pattern analysis and machine intelligence
38.2 (2015), pp. 295–307.
[25] Iddo Drori, Daniel Cohen-Or, and Hezy Yeshurun. “Fragment-based image completion”. In: ACM
SIGGRAPH 2003 Papers. 2003, pp. 303–312.
[26] Alexei A Efros and William T Freeman. “Image quilting for texture synthesis and transfer”. In:
Proceedings of the 28th annual conference on Computer graphics and interactive techniques. 2001,
pp. 341–346.
[27] Alexei A Efros and Thomas K Leung. “Texture synthesis by non-parametric sampling”. In:
Proceedings of the seventh IEEE international conference on computer vision. Vol. 2. IEEE. 1999,
pp. 1033–1038.
[28] Gil Elbaz, Tamar Avraham, and Anath Fischer. “3D point cloud registration for localization using
a deep neural network auto-encoder”. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2017, pp. 4631–4640.
[29] Patrick Esser, Robin Rombach, and Bjorn Ommer. “Taming transformers for high-resolution
image synthesis”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2021, pp. 12873–12883.
[30] Carlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. “Learning so
(3) equivariant representations with spherical cnns”. In: Proceedings of the European Conference on
Computer Vision (ECCV). 2018, pp. 52–68.
[31] Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. “Polar
transformer networks”. In: arXiv preprint arXiv:1709.01889 (2017).
[32] Carlos Esteves, Yinshuang Xu, Christine Allen-Blanchette, and Kostas Daniilidis. “Equivariant
multi-view networks”. In: Proceedings of the IEEE International Conference on Computer Vision.
2019, pp. 1568–1577.
[33] Mohamed-Jalal Fadili, J-L Starck, and Fionn Murtagh. “Inpainting and zooming using sparse
representations”. In: The Computer Journal 52.1 (2009), pp. 64–79.
[34] Martin A Fischler and Robert C Bolles. “Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography”. In: Communications of the ACM
24.6 (1981), pp. 381–395.
[35] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. “Rich feature hierarchies for
accurate object detection and semantic segmentation”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2014, pp. 580–587.
[36] Zan Gojcic, Caifa Zhou, Jan D Wegner, and Andreas Wieser. “The perfect match: 3d point cloud
matching with smoothed densities”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2019, pp. 5545–5554.
85
[37] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. “Generative adversarial nets”. In: Advances in neural
information processing systems 27 (2014).
[38] Fabian Groh, Patrick Wieschollek, and Hendrik PA Lensch. “Flex-Convolution”. In: Asian
Conference on Computer Vision. Springer. 2018, pp. 105–122.
[39] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C Courville.
“Improved Training of Wasserstein GANs”. In: NIPS. 2017.
[40] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. “Deep
learning for 3d point clouds: A survey”. In: IEEE transactions on pattern analysis and machine
intelligence (2020).
[41] Zongyu Guo, Zhibo Chen, Tao Yu, Jiale Chen, and Sen Liu. “Progressive image inpainting with
full-resolution residual network”. In: Proceedings of the 27th acm international conference on
multimedia. 2019, pp. 2496–2504.
[42] Lars Kai Hansen and Peter Salamon. “Neural network ensembles”. In: IEEE transactions on pattern
analysis and machine intelligence 12.10 (1990), pp. 993–1001.
[43] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. “Learning a neural 3d texture space from 2d
exemplars”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2020, pp. 8356–8364.
[44] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
“Gans trained by a two time-scale update rule converge to a local nash equilibrium”. In: Advances
in neural information processing systems 30 (2017).
[45] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. “Mobilenets: Efficient convolutional
neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017).
[46] Binh-Son Hua, Minh-Khoi Tran, and Sai-Kit Yeung. “Pointwise convolutional neural networks”.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018,
pp. 984–993.
[47] Haibin Huang, Evangelos Kalogerakis, Siddhartha Chaudhuri, Duygu Ceylan, Vladimir G Kim,
and Ersin Yumer. “Learning local shape descriptors from part correspondences with multiview
convolutional networks”. In: ACM Transactions on Graphics (TOG) 37.1 (2018), p. 6.
[48] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. “Globally and locally consistent image
completion”. In: ACM Transactions on Graphics (ToG) 36.4 (2017), pp. 1–14.
[49] Takashi Ijiri, Radomír Mech, Takeo Igarashi, and Gavin Miller. “An example-based procedural
system for element arrangement”. In: Computer Graphics Forum. Vol. 27. 2. Wiley Online Library.
2008, pp. 429–436.
86
[50] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. “Image-to-image translation with
conditional adversarial networks”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2017, pp. 1125–1134.
[51] Nikolay Jetchev, Urs Bergmann, and Roland Vollgraf. “Texture synthesis with spatial generative
adversarial networks”. In: arXiv preprint arXiv:1611.08207 (2016).
[52] Chiyu Jiang, Jingwei Huang, Karthik Kashinath, Philip Marcus, Matthias Niessner, et al.
“Spherical cnns on unstructured grids”. In: arXiv preprint arXiv:1901.02039 (2019).
[53] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. “Perceptual losses for real-time style transfer and
super-resolution”. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer. 2016, pp. 694–711.
[54] Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. “Rotationnet: Joint object
categorization and pose estimation using multiviews from unsupervised viewpoints”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018,
pp. 5010–5019.
[55] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs for
Improved Quality, Stability, and Variation”. In: International Conference on Learning
Representations. 2018.
[56] Marc Khoury, Qian-Yi Zhou, and Vladlen Koltun. “Learning compact geometric features”. In:
Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 153–161.
[57] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[58] Keunsoo Ko and Chang-Su Kim. “Continuously masked transformer for image inpainting”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, pp. 13169–13178.
[59] Johannes Kopf, Chi-Wing Fu, Daniel Cohen-Or, Oliver Deussen, Dani Lischinski, and
Tien-Tsin Wong. “Solid texture synthesis from 2d exemplars”. In: (2007), 2–es.
[60] Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. “Texture optimization for
example-based synthesis”. In: ACM SIGGRAPH 2005 Papers. 2005, pp. 795–802.
[61] Vivek Kwatra, Arno Schödl, Irfan Essa, Greg Turk, and Aaron Bobick. “Graphcut textures: Image
and video synthesis using graph cuts”. In: Acm transactions on graphics (tog) 22.3 (2003),
pp. 277–286.
[62] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied
to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
[63] Joo Ho Lee, Inchang Choi, and Min H Kim. “Laplacian patch-based image synthesis”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 2727–2735.
87
[64] Jan Eric Lenssen, Matthias Fey, and Pascal Libuschewski. “Group equivariant capsule networks”.
In: Advances in Neural Information Processing Systems. 2018, pp. 8844–8853.
[65] Levin and Zomet. “Learning how to inpaint from global image statistics”. In: Proceedings Ninth
IEEE international conference on computer vision. IEEE. 2003, pp. 305–312.
[66] Chuan Li and Michael Wand. “Precomputed real-time texture synthesis with markovian
generative adversarial networks”. In: European conference on computer vision. Springer. 2016,
pp. 702–716.
[67] Jiaxin Li, Yingcai Bi, and Gim Hee Lee. “Discrete Rotation Equivariance for Point Cloud
Recognition”. In: arXiv preprint arXiv:1904.00319 (2019).
[68] Jiaxin Li, Ben M Chen, and Gim Hee Lee. “So-net: Self-organizing network for point cloud
analysis”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018,
pp. 9397–9406.
[69] Jingyuan Li, Ning Wang, Lefei Zhang, Bo Du, and Dacheng Tao. “Recurrent feature reasoning for
image inpainting”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2020, pp. 7760–7768.
[70] Junying Li, Zichen Yang, Haifeng Liu, and Deng Cai. “Deep rotation equivariant network”. In:
Neurocomputing 290 (2018), pp. 26–33.
[71] Lei Li, Siyu Zhu, Hongbo Fu, Ping Tan, and Chiew-Lan Tai. “End-to-End Learning Local
Multi-view Descriptors for 3D Point Clouds”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2020, pp. 1919–1928.
[72] Wenbo Li, Zhe Lin, Kun Zhou, Lu Qi, Yi Wang, and Jiaya Jia. “Mat: Mask-aware transformer for
large hole image inpainting”. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. 2022, pp. 10758–10768.
[73] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. “Sparse
convolutional neural networks”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2015, pp. 806–814.
[74] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro.
“Image inpainting for irregular holes using partial convolutions”. In: Proceedings of the European
conference on computer vision (ECCV). 2018, pp. 85–100.
[75] Hongyu Liu, Bin Jiang, Yi Xiao, and Chao Yang. “Coherent semantic attention for image
inpainting”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019,
pp. 4170–4179.
[76] Hongyu Liu, Ziyu Wan, Wei Huang, Yibing Song, Xintong Han, and Jing Liao. “Pd-gan:
Probabilistic diverse gan for image inpainting”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2021, pp. 9371–9381.
88
[77] Shichen Liu, Shunsuke Saito, Weikai Chen, and Hao Li. “Learning to Infer Implicit Surfaces
without 3D Supervision”. In: Advances in Neural Information Processing Systems 32 (2019),
pp. 8295–8306.
[78] Xinhai Liu, Zhizhong Han, Yu-Shen Liu, and Matthias Zwicker. “Point2sequence: Learning the
shape representation of 3d point clouds with an attention-based sequence to sequence network”.
In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019, pp. 8778–8785.
[79] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool.
“Repaint: Inpainting using denoising diffusion probabilistic models”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 11461–11471.
[80] Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. “Rotation equivariant vector
field networks”. In: Proceedings of the IEEE International Conference on Computer Vision. 2017,
pp. 5048–5057.
[81] Daniel Maturana and Sebastian Scherer. “Voxnet: A 3d convolutional neural network for
real-time object recognition”. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS). IEEE. 2015, pp. 922–928.
[82] Paul Merrell. “Example-based model synthesis”. In: Proceedings of the 2007 symposium on
Interactive 3D graphics and games. 2007, pp. 105–112.
[83] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and
Ren Ng. “Nerf: Representing scenes as neural radiance fields for view synthesis”. In: European
conference on computer vision. Springer. 2020, pp. 405–421.
[84] Jonathan Palacios, Chongyang Ma, Weikai Chen, Li-Yi Wei, and Eugene Zhang. “Tensor field
design in volumes”. In: SIGGRAPH ASIA 2016 Technical Briefs. 2016, pp. 1–4.
[85] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove.
“Deepsdf: Learning continuous signed distance functions for shape representation”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019,
pp. 165–174.
[86] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. “Context
encoders: Feature learning by inpainting”. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. 2016, pp. 2536–2544.
[87] Ken Perlin. “An image synthesizer”. In: ACM Siggraph Computer Graphics 19.3 (1985), pp. 287–296.
[88] Ken Perlin. “Improving noise”. In: Proceedings of the 29th annual conference on Computer graphics
and interactive techniques. 2002, pp. 681–682.
[89] Tiziano Portenier, Siavash Arjomand Bigdeli, and Orcun Goksel. “GramGAN: Deep 3D Texture
Synthesis From 2D Exemplars”. In: Advances in Neural Information Processing Systems 33 (2020).
[90] Javier Portilla and Eero P Simoncelli. “A parametric texture model based on joint statistics of
complex wavelet coefficients”. In: International journal of computer vision 40.1 (2000), pp. 49–70.
89
[91] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. “Pointnet: Deep learning on point sets
for 3d classification and segmentation”. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2017, pp. 652–660.
[92] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas.
“Volumetric and multi-view cnns for object classification on 3d data”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp. 5648–5656.
[93] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. “Pointnet++: Deep hierarchical
feature learning on point sets in a metric space”. In: Advances in neural information processing
systems. 2017, pp. 5099–5108.
[94] Alec Radford, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with
deep convolutional generative adversarial networks”. In: arXiv preprint arXiv:1511.06434 (2015).
[95] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. “Generating diverse high-fidelity images with
vq-vae-2”. In: Advances in neural information processing systems 32 (2019).
[96] Andrew Ross and Finale Doshi-Velez. “Improving the adversarial robustness and interpretability
of deep neural networks by regularizing their input gradients”. In: Proceedings of the AAAI
conference on artificial intelligence. Vol. 32. 1. 2018.
[97] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li.
“Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 2304–2314.
[98] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. “Singan: Learning a generative model from
a single natural image”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 4570–4580.
[99] Pourya Shamsolmoali, Masoumeh Zareapoor, and Eric Granger. “Transinpaint:
Transformer-based image inpainting with context adaptation”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2023, pp. 849–858.
[100] Jianhong Shen and Tony F Chan. “Mathematical models for local nontexture inpaintings”. In:
SIAM Journal on Applied Mathematics 62.3 (2002), pp. 1019–1043.
[101] Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani. “Ingan: Capturing and retargeting the"
dna" of a natural image”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 4492–4501.
[102] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale
image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[103] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein.
“Implicit neural representations with periodic activation functions”. In: Advances in Neural
Information Processing Systems 33 (2020).
90
[104] Yuhang Song, Chao Yang, Zhe Lin, Xiaofeng Liu, Qin Huang, Hao Li, and C-C Jay Kuo.
“Contextual-based image inpainting: Infer, match, and translate”. In: Proceedings of the European
conference on computer vision (ECCV). 2018, pp. 3–19.
[105] Riccardo Spezialetti, Samuele Salti, and Luigi Di Stefano. “Learning an Effective Equivariant 3D
Descriptor Without Supervision”. In: Proceedings of the IEEE International Conference on Computer
Vision. 2019, pp. 6401–6410.
[106] Jian Sun, Lu Yuan, Jiaya Jia, and Heung-Yeung Shum. “Image completion with structure
propagation”. In: ACM SIGGRAPH 2005 Papers. 2005, pp. 861–868.
[107] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha,
Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky.
“Resolution-robust Large Mask Inpainting with Fourier Convolutions”. In: arXiv preprint
arXiv:2109.07161 (2021).
[108] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions”.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9.
[109] Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan,
Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng. “Fourier features let
networks learn high frequency functions in low dimensional domains”. In: arXiv preprint
arXiv:2006.10739 (2020).
[110] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud, Beatriz Marcotegui, François Goulette,
and Leonidas J Guibas. “KPConv: Flexible and Deformable Convolution for Point Clouds”. In:
arXiv preprint arXiv:1904.08889 (2019).
[111] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and
Patrick Riley. “Tensor field networks: Rotation-and translation-equivariant neural networks for
3d point clouds”. In: arXiv preprint arXiv:1802.08219 (2018).
[112] Federico Tombari, Samuele Salti, and Luigi Di Stefano. “Unique shape context for 3D data
description”. In: Proceedings of the ACM workshop on 3D object retrieval. ACM. 2010, pp. 57–62.
[113] Aaron Van Den Oord, Oriol Vinyals, et al. “Neural discrete representation learning”. In: Advances
in neural information processing systems 30 (2017).
[114] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
[115] Li-Yi Wei. “Texture synthesis from multiple sources”. In: ACM Siggraph 2003 Sketches &
Applications. 2003, pp. 1–1.
[116] Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. “3d steerable
cnns: Learning rotationally equivariant features in volumetric data”. In: Advances in Neural
Information Processing Systems. 2018b, pp. 10381–10392.
91
[117] Maurice Weiler, Fred A Hamprecht, and Martin Storath. “Learning steerable filters for rotation
equivariant CNNs”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018, pp. 849–858.
[118] Steven Worley. “A cellular texture basis function”. In: Proceedings of the 23rd annual conference on
Computer graphics and interactive techniques. 1996, pp. 291–294.
[119] Daniel Worrall and Gabriel Brostow. “Cubenet: Equivariance to 3d rotation and translation”. In:
Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 567–584.
[120] Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow.
“Harmonic networks: Deep translation and rotation equivariance”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2017, pp. 5028–5037.
[121] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and
Jianxiong Xiao. “3d shapenets: A deep representation for volumetric shapes”. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2015, pp. 1912–1920.
[122] Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. “Vision transformer with
deformable attention”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2022, pp. 4794–4803.
[123] Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang, Fisher Yu, and
James Hays. “Texturegan: Controlling deep image synthesis with texture patches”. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 8456–8465.
[124] Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, and Jiebo Luo.
“Foreground-aware image inpainting”. In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. 2019, pp. 5840–5848.
[125] Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, and Yu Qiao. “Spidercnn: Deep learning on point
sets with parameterized convolutional filters”. In: Proceedings of the European Conference on
Computer Vision (ECCV). 2018, pp. 87–102.
[126] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. “Shift-net: Image
inpainting via deep feature rearrangement”. In: Proceedings of the European conference on
computer vision (ECCV). 2018, pp. 1–17.
[127] Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, and Zhan Xu. “Contextual residual aggregation
for ultra high-resolution image inpainting”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2020, pp. 7508–7517.
[128] Ahmet Burak Yildirim, Hamza Pehlivan, Bahri Batuhan Bilecen, and Aysegul Dundar. “Diverse
inpainting and editing with gan inversion”. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. 2023, pp. 23120–23130.
[129] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. “pixelnerf: Neural radiance fields
from one or few images”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2021, pp. 4578–4587.
92
[130] F Yu. “Multi-scale context aggregation by dilated convolutions”. In: arXiv preprint
arXiv:1511.07122 (2015).
[131] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. “Free-form image
inpainting with gated convolution”. In: Proceedings of the IEEE/CVF international conference on
computer vision. 2019, pp. 4471–4480.
[132] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. “Generative image
inpainting with contextual attention”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2018, pp. 5505–5514.
[133] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. “Deconvolutional
networks”. In: 2010 IEEE Computer Society Conference on computer vision and pattern recognition.
IEEE. 2010, pp. 2528–2535.
[134] Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and
Thomas Funkhouser. “3DMatch: Learning Local Geometric Descriptors from RGB-D
Reconstructions”. In: CVPR. 2017.
[135] Yu Zeng, Zhe Lin, Jimei Yang, Jianming Zhang, Eli Shechtman, and Huchuan Lu. “High-resolution
image inpainting with iterative confidence feedback and guided upsampling”. In: Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
XIX 16. Springer. 2020, pp. 1–17.
[136] Eugene Zhang, Konstantin Mischaikow, and Greg Turk. “Vector field design on surfaces”. In:
ACM Transactions on Graphics (ToG) 25.4 (2006), pp. 1294–1326.
[137] Haoran Zhang, Zhenzhen Hu, Changzhi Luo, Wangmeng Zuo, and Meng Wang. “Semantic image
inpainting with progressive generative networks”. In: Proceedings of the 26th ACM international
conference on Multimedia. 2018, pp. 1939–1947.
[138] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. “The unreasonable
effectiveness of deep features as a perceptual metric”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018, pp. 586–595.
[139] Yongheng Zhao, Tolga Birdal, Jan Eric Lenssen, Emanuele Menegatti, Leonidas Guibas, and
Federico Tombari. “Quaternion Equivariant Capsule Networks for 3D Point Clouds”. In: arXiv
preprint arXiv:1912.12098 (2019).
[140] Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. “Pluralistic image completion”. In: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 1438–1447.
[141] Chuanxia Zheng, Tat-Jen Cham, Jianfei Cai, and Dinh Phung. “Bridging global context
interactions for high-fidelity image completion”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2022, pp. 11512–11522.
[142] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. “Places: A 10
million image database for scene recognition”. In: IEEE transactions on pattern analysis and
machine intelligence 40.6 (2017), pp. 1452–1464.
93
[143] Kun Zhou, Xin Huang, Xi Wang, Yiying Tong, Mathieu Desbrun, Baining Guo, and
Heung-Yeung Shum. “Mesh quilting for geometric texture synthesis”. In: ACM SIGGRAPH 2006
Papers. 2006, pp. 690–697.
[144] Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, Daniel Cohen-Or, and Hui Huang.
“Non-stationary texture synthesis by adversarial expansion”. In: ACM Transactions on Graphics
(TOG) 37.4 (2018), pp. 1–13.
[145] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros, Oliver Wang, and
Eli Shechtman. “Toward multimodal image-to-image translation”. In: Advances in neural
information processing systems 30 (2017).
[146] Minghan Zhu, Maani Ghaffari, William A Clark, and Huei Peng. “E2pn: Efficient se
(3)-equivariant point network”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2023, pp. 1223–1232.
[147] Song-Chun Zhu, Cheng-En Guo, Yizhou Wang, and Zijian Xu. “What are textons?” In:
International Journal of Computer Vision 62 (2005), pp. 121–143.
94
Abstract (if available)
Abstract
All visual data, from image to CAD models, live in a 2D or 3D spatial domain. In order to understand and model the visual data, spatial reasoning has always been fundamental to computer vision algorithms. Naturally, the practice has been widely extended to the use of artificial neural networks built for visual analysis. The basic building blocks of a neural network - operators and representations - are means to learn spatial relationships and therefore are built with spatial properties. In this thesis, we present novel designs of neural operators and representations in different application contexts, with a unique focus on how these design choices affect the spatial properties of the neural networks in ways that are beneficial for the tasks at hand. The first topic explored is the equivariance property, where a SE(3) equivariant convolutional network is designed for 3D pose estimation and scene registration. In this chapter, we show that the equivariant property of a convolutional neural network can be practically extended to higher dimensional space and proved highly effective for applications that are not only sensitive to translation, but also 3D rotations. The second topic explored is the learning of neural operators that approximate spatially continuous function in a pattern synthesis application context. In this chapter, we explore the combination of deformable periodic encoding and continuous latent space which enables an implicit network, consisting of multilayer perceptron, to synthesize diverse, high-quality and infinitely large 2D and 3D patterns. The unique formulation allows the generative model to be at least 10 times faster and more memory efficient compared to previous efforts, and marked one of the earliest attempts to adopt the implicit network to the generative setting. The third topic explored is spatial awareness with regard to incomplete images, where a generative network model for image inpainting is designed based on an analysis-after-synthesis principle. In this model, a novel encoder is designed to restrict the receptive field in the analysis step, and the extracted features serve as priors to a bidirectional generative transformer that synthesize latent codes step by step. This novel paradigm demonstrates the effectiveness of disentangling analysis and synthesis in challenging image inpainting scenarios, as the resulted network model achieves state-of-the-art performance in both diversity and quality, when completing partial images with free-form holes occupying as large as 70% of the image.
I believe that the topics covered have contributed to a better understanding of neural operator and representation designs for both discriminative and generative learning in computer vision, from a perspective of identifying the effective ways of spatial reasoning for the targeted visual applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning to optimize the geometry and appearance from images
PDF
Point-based representations for 3D perception and reconstruction
PDF
Scalable dynamic digital humans
PDF
Deep representations for shapes, structures and motion
PDF
3D deep learning for perception and modeling
PDF
Differential verification of deep neural networks
PDF
Human appearance analysis and synthesis using deep learning
PDF
Experimental analysis and feedforward design of neural networks
PDF
Invariant representation learning for robust and fair predictions
PDF
Shape-assisted multimodal person re-identification
PDF
Learning invariant features in modulatory neural networks through conflict and ambiguity
PDF
Face recognition and 3D face modeling from images in the wild
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Neural network for molecular dynamics simulation and design of 2D materials
PDF
Learning distributed representations from network data and human navigation
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
Asset Metadata
Creator
Chen, Haiwei
(author)
Core Title
Designing neural networks from the perspective of spatial reasoning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-12
Publication Date
01/12/2025
Defense Date
12/09/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial neural network,autoregressive generative network,computer vision,image inpainting,implicit neural rendering,neural operator,neural representation,OAI-PMH Harvest,point cloud analysis,texture synthesis
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Zhao, Yajie (
committee chair
), Nealen, Andrew (
committee member
), Nevatia, Ramakan (
committee member
)
Creator Email
chw9308@hotmail.com,haiweich@usc.edu
Unique identifier
UC11399F8N6
Identifier
etd-ChenHaiwei-13740.pdf (filename)
Legacy Identifier
etd-ChenHaiwei-13740
Document Type
Dissertation
Format
theses (aat)
Rights
Chen, Haiwei
Internet Media Type
application/pdf
Type
texts
Source
20250112-usctheses-batch-1233
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial neural network
autoregressive generative network
computer vision
image inpainting
implicit neural rendering
neural operator
neural representation
point cloud analysis
texture synthesis