Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Green image generation and label transfer techniques
(USC Thesis Other)
Green image generation and label transfer techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GREEN IMAGE GENERATION AND LABEL TRANSFER TECHNIQUES
by
Xuejing Lei
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2024
Copyright 2024 Xuejing Lei
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Explainable, Efficient and Lightweight Texture Generation . . . . . . . . . . . . . . 4
1.2.2 Image Generative Modeling via Successive Subspace Learning . . . . . . . . . . . . 5
1.2.3 Green Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.4 Green Image Label Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Background Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Image Generative Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Texture Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Early-stage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Deep-Learning-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Natural Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Adversarial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Representative Non-Adversarial Models . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Autoregressive Models and Transformer-based Models . . . . . . . . . . . . . . . . 20
2.3.4 Denoising Diffusion Probability Models . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Image Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Traditional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Deep-learning-based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Successive Subspace Learning and Green Learning . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 3: TGHop: An Explainable, Efficient and Lightweight Method for Texture Generation . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Successive Subspace Analysis and Generation . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 TGHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Fine-to-Coarse Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Core Sample Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ii
3.3.4 Coarse-to-Fine Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 An Example: Brick Wall Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Performance Benchmarking with DL-based Methods . . . . . . . . . . . . . . . . . 38
3.4.3.1 Visual Quality Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.3.2 Comparison of Generation Time . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.4 Comparison of Model Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Chapter 4: GENHOP: An Image Generation Method Based on Successive Subspace Learning . . . . 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 DL-based Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Successive Subspace Learning (SSL) . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Proposed GenHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Module 1: High-to-Low Dimension Reduction . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Module 2: Seed Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Module 3: Low-to-High Dimension Expansion . . . . . . . . . . . . . . . . . . . . . 55
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.3 Generated Exemplary Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.4 Visualization of Intermediate Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.5 Effect of Training Sample Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.6 Model Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Chapter 5: Green Image Generation (GIG): Methodologies and Preliminary Results . . . . . . . . . 68
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 Deep-Learning based Generative Modeling . . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 Green-learning-based generative models . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Proposed GIG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.2 Forward Decomposition Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2.1 Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2.2 Seed Distribution Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.3 Reverse Generation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.3.1 Seed Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.3.2 Coarse-to-Fine Image Generation . . . . . . . . . . . . . . . . . . . . . . 80
5.3.3.3 Image Refinement in Fine Grids . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.1 Performance Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 Visualization of Image Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.3 Model Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
iii
Chapter 6: Green Image Label Transfer (GILT): Methodologies and Preliminary Results . . . . . . . 88
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Traditional Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.2 Deep-learning based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.2 Joint Discriminant Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.2.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.3 Source-to-Target Label Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.4 Weakly Supervised Learning in the Target Domain . . . . . . . . . . . . . . . . . . 100
6.4 Preliminary Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.2 FLOPs and Model Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Chapter 7: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Future Research Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2.1 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2.2 Unpaired Image-to-Image Translation . . . . . . . . . . . . . . . . . . . . . . . . . 109
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
iv
List of Tables
3.1 The settings of four generation models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Comparison of time needed to generate one texture image. . . . . . . . . . . . . . . . . . . 39
3.3 The time of three processes in our method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 The number of parameters of TGHop, under the setting of γ = 0.01, N = 50 K1 = 9,
K2 = 22, Dr = 909, F = 2, 518 and W = 200. . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 The reduced dimension, Dr, and the model size as a function of threshold γ used in SDR. . 42
4.1 Comparison of FID scores of the GenHop model and representative adversarial and
non-adversarial models. The lowest FID scores are shown in bold, while the second-lowest
FID scores are underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 The number of parameters of GenHop for MNIST, Fashion-MNIST and CelebA. . . . . . . 62
5.1 Comparison of FID scores of our GIG model and representative GL-based and DL-based
models. The lowest FID scores are shown in bold, while the second lowest FID scores are
underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Comparison of FLOPs and model size of our proposed method and two deep-learning-based
methods for MNIST-to-USPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1 The class labeling accuracy of 63.4% labeled data points in the target domain after
source-to-target label transfer. The source-only column represents the accuracy if we only
train with source data and apply it to the target domain, which labels 23.7% target data in
label transfer. The results are from MNIST-to-USPS transfer. . . . . . . . . . . . . . . . . . 99
6.2 Accuracy on digit classification for MNIST-to-USPS transfer. . . . . . . . . . . . . . . . . . 101
6.3 FLOPs and model size of our proposed method for MNIST-to-USPS. . . . . . . . . . . . . . 102
6.4 Comparison of FLOPs and model size of our proposed method and two deep-learning-based
methods for MNIST-to-USPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
v
List of Figures
1.1 Illustration of discriminative models and generative models. Discriminative models depict
the likelihood that a class label is assigned to a variable, while generative models directly
describe the likelihood of data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The comparison of (a) texture generation, (b) image generation and (c) image label
transfer. Texture generation takes an exemplary texture as the input and generates
texture images. Natural image generation takes a set of images, resembles its data
distribution, and generates samples from the distribution. Image label transfer attempts to
transfer knowledge from one domain to another domain, which is also known as domain
adaptation for image classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The categories of generative models in terms of density function used by a model. . . . . . 11
2.2 An illustration of the image quilting method [27]. An overview of the method is shown in
(a). The results obtained by the method are given (b). The defect on an apple occurring
many times in the generated images indicates the limited diversity of the method. . . . . . 12
2.3 An overview of the method in [33], which applies the Gram matrix as the statistical
measurement and iteratively optimizes an initial white-noise input image through
back-propagation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Results of using different feature maps on texture synthesis in [117]. It indicates that
textures of comparable quality can be generated without using a pre-trained VGG network. 14
2.5 Result of using statistics of shallow layers, conv1_1, pool_1 and pool_2, in [33]. Generating
texture only with shallow layers does not give good results. . . . . . . . . . . . . . . . . . 15
2.6 GAN [36]trains a generator and discriminator in turns until the generator can generate
high-quality synthetic images so that the discriminator cannot distinguish them from real
ones, yielding the same distribution between data examples and generated samples. . . . . 17
2.7 An example of linear interpolation and linear arithmetic properties of GANs given in [93]. 18
2.8 An overview of GLANN [46]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 An illustration of c/w Saab transform in [16]. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vi
3.1 Illustration of successive subspace analysis and generation, where a sequence of subspace
S1, . . . , Sn is constructed from source space, S0, through a successive process indicated
by blue arrows while red arrows indicate the successive subspace generation process. . . . 27
3.2 An overview of the proposed TGHop method. A number of patches are collected from the
exemplary texture image, forming source space S0. Subspace S1 and S2 are constructed
through analysis model E1
0
and E2
1
. Input filter window sizes to Hop-1 and Hop-2 are
denoted as I0 and I1. Selected channel numbers of Hop-1 and Hop-2 are denoted as
K1 and K2. A block of size Ii × Ii of Ki channels in space/subspace Si
is converted to
the same spatial location of Ki+1 channels in subspace Si+1. Red arrows indicate the
generation process beginning from core sample generation followed by coarse-to-fine
generation. The model for core sample generation is denoted as G2, and the models for
coarse-to-fine generation are denoted as G1
2
and G0
1
. . . . . . . . . . . . . . . . . . . . . . 30
3.3 Generated grass texture image with and without spatial dimension reduction (SDR). . . . . 32
3.4 Illustration of the interval representation, where the length of a segment in the unit
interval represents the probability of a cluster, pi
. A random number is generated in the
unit interval to indicate the cluster index. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Illustration of the generation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Examples of generated DC maps in core S2 (first column), generated samples in subspace
S1 (second co-lumn), and the ultimate generated textures in source S0 (third column). . . . 36
3.7 Examples of generated brick_wall texture patches and stitched images of larger sizes,
where the image in the bottom-left corner is the exemplary texture image, and the patches
highlighted in red squared boxes are unseen patterns. . . . . . . . . . . . . . . . . . . . . . 45
3.8 Generated patches using different settings, where the numbers below the figure indicates
the dimensions of S0, S1 and S2, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.9 Comparison of texture images generated by two DL-based methods and TGHop (from
left to right): exemplary texture images, texture images generated by [33], by [117], two
examples by TGHop without spatial dimension reduction (SDR) and one example by
TGHop with SDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.10 More texture images generated by TGHop. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.11 Generated brick_wall patches using different cluster numbers in independent component
histogram matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.12 Generated brick_wall patches using different threshold γ values in SDR. . . . . . . . . . . 48
4.1 An overview of the GenHop method. A sequence of high-to-low dimensional subspaces
is constructed from source image space with two PixelHop++ units. GenHop contains
three modules: 1) High-to-Low Dimension Reduction, 2) Seed Image Generation, and 3)
Low-to-High Dimension Expansion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii
4.2 Illustration of seed image generation in the lowest-dimensional subspace. . . . . . . . . . 53
4.3 Various representative samples from (a) MNIST and (b) Fashion-MNIST dataset. . . . . . . 54
4.4 Illustration of fine-tuning low-frequency and generating high-frequency. . . . . . . . . . . 56
4.5 The detailed procedure of fine-tuning low-frequency (LF) and generating high-frequency
(HF) using locally linear embedding (LLE). . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 An illustration of PQR channels obtained by pixel-wise PCA. Figures from left to right
are the original RGB image, RGB image reconstructed from P and Q channels, P channel
whose variance is 0.226, Q channel with variance 0.014 and R channel with variance 0.002. 57
4.7 Exemplary images generated by GenHop with training samples from the MNIST dataset. . 63
4.8 Exemplary images generated by GenHop with training samples from the Fashion-MNIST
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.9 Exemplary images generated by GenHop with training samples from the CelebA dataset. . 65
4.10 Illustration of the image generation process from core S4 (the first row) to subspace S2
(the second row), to source S0 before local detail generation (the third row) and after
LLE-based fine-tuning (the fourth row) for (a) MNIST and (b) Fashion-MNIST (c) CelebA
(P channel). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.11 Comparison of FID scores as a function of training percentages between GenHop, WGAN,
GLANN, and GLO for (a) MNIST and (b) Fashion-MNIST. . . . . . . . . . . . . . . . . . . . 67
5.1 The generation pipeline of the GenHop and GIG. . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Forward Decomposition Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Inverse Generation Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Cluster-wise whitening and coloring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Seed Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Conditional Detail Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.7 Some example images generated by GIG for (a) MNIST and (b) Fashion-MNIST. . . . . . . 87
6.1 An overview of our proposed GIT method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 An illustration of the significance of the preprocessing step at image pixel level. (a) images
in MNIST dataset and (b) images in USPS dataset (c) Aligned images in the USPS dataset. . 95
6.3 The pairwise cosine similarity among features before and after feature selection. . . . . . 97
viii
6.4 Examples of entropy change as the radius of a cluster increases. . . . . . . . . . . . . . . . 98
6.5 An example histogram of the source and target distributions. The figures from left to right
represent the source histogram, target histogram, and matched target histogram of the
first dimension in 8-D feature vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.1 Examples of (a) style transfer in [70] and (b) general purpose image-to-image translation
in [51]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Overview of the training process in DualGAN [127]. . . . . . . . . . . . . . . . . . . . . . . 110
ix
Abstract
Image generative modeling has been a long-standing problem that has received increasing attention. Generative models try to learn an estimate of an unknown distribution and generate new instances by sampling
from it. There is a resurgence of interest in generative modeling due to the performance breakthrough
achieved by deep learning (DL) in the last decade. Various DL models, including Generative Adversarial Networks (GANs) and Convolutional Neural Networks(CNNs), transformers, and denoising diffusion
models, yield visually pleasing results in image generation and texture generation. These models optimize
the parameters in a network architecture through back-propagation. Nevertheless, these models are usually large in size, intractable to explain in theory, and computationally expensive in training or generation.
Developing an efficient generation method that addresses small model sizes, mathematical transparency,
and efficiency in training and generation attracts more and more attention.
In this dissertation, we design generative modeling solutions based on the green learning concept for
texture synthesis, image generation, and image label transfer. We first present an explainable, efficient,
lightweight solution for texture synthesis. Then, we propose a novel generative pipeline based on successive subspace learning methodology and examine it on a natural image generation task. The method is
named GenHop. We then reformulate the GenHop to improve its efficiency and model sizes and propose
Green Image generation (GIG). To demonstrate the generalization ability of our generative modeling concepts, we finally adapt it to an image label transfer task and propose a method called Green Image Label
Transfer (GILT) for unsupervised domain adaptation.
x
Texture generation takes a texture image as an example and produces diverse texture images. Existing
methods can generate texture images of good quality but limited diversity, large in model size, difficult to
explain in theory, and computationally expensive in training. We propose an explainable, efficient, and
lightweight method for texture generation called TGHop (an acronym for Texture Generation PixelHop),
to avoid these drawbacks.
Different from texture generation, which includes only one image as the input, natural image generation takes a set of diverse images as the input and generates samples from the learned distribution. We
propose a novel image generative modeling pipeline that contains fine-to-coarse dimension reduction and
coarse-to-fine image generation two procedure. Our solution, called GenHop, can generate visually pleasant images whose FID scores are comparable with those of DL-based generative models against MNIST,
Fashion-MNIST, and CelebA datasets. GenHop has a small model size and can learn stable sample distributions with few training samples. Upon GenHop, we propose a new method for green image generation
called GIT, which uses a Mixture of Factor Analyzers (MFA) and XGBoost regressors to reduce model size
and avoid memory waste.
We further investigate our generative paradigm on transfer learning. Upon unsupervised domain adaptation (DA), which adapts a model trained on a source domain to work on a related target domain, we
addressed the limitations in deep learning-based DA models, which can be complex and prone to overfitting, especially when the amount of target domain data is limited. We introduced a new method for
unsupervised domain adaptation that emphasizes model efficiency and explainability, called Green Image
Label Transfer (GILT).
xi
Chapter 1
Introduction
1.1 Significance of the Research
Generative models learn the probability distribution of data and can generate new instances. Generative
modeling is generally more challenging than discriminative modeling. Discriminative models depict the
likelihood that a class label is assigned to a variable, while generative models directly describe the likelihood of data, for example, given an image where a boat is on the sea. Discriminative models distinguish
boat from not boat. Generative models need to capture correlations like Things look like boats are likely to
appear things that look like water. An illustration of their difference is given in Fig. 1.1. Generative models
take a training set that is drawn from an unknown distribution pdata and learn to represent an estimate
of that distribution in a certain way, yielding a probability distribution pmodel. Some generative models
define an explicit density function to represent pmodel, such as Mixtures of Gaussian (GMM), and generate
samples as well, such as variational auto-encoder (VAE) [56, 95]. Some generative models do not possess
an explicit density function but are able to resemble pdata and generate samples from pmodel. One example
is Generative Adversarial Network (GAN) [36].
Research on image generative modeling has attracted much attention in the machine learning community for decades. It is of significance for the following reasons. First, training and sampling from generative
models is beneficial to analyzing and manipulating high-dimensional data, which is ubiquitous in various
1
Figure 1.1: Illustration of discriminative models and generative models. Discriminative models depict
the likelihood that a class label is assigned to a variable, while generative models directly describe the
likelihood of data.
domains. Second, generative models provide an effective way to deal with unlabeled data. Generative
modeling on these unlabeled data alleviates the burden of acquiring numerous labels and improves the
generalization of a method. Third, generative models can handle the case that one input corresponds to
multiple acceptable outputs. One example is to predict the next frame in video prediction. Last but not least,
generating good samples is intrinsically required in some tasks, such as image and video super-resolution.
This thesis focuses on image generative modeling for two image generation tasks, texture generation and
natural image generation, and one image label transfer task, unsupervised domain adaptation. Texture
generation takes a texture image as an example and produces diverse texture images [33, 116, 74, 117] or
textures on 3D objects [120, 88] as shown in Fig. 1.2 (a). Different from texture generation, which has only
one image as the input, natural image generation takes a set of images, resembles the data distribution
pdata with a model probability distribution pmodel, and generates samples from the model distribution, as
shown in Fig. 1.2 (b). Image label transfer aims to transfer knowledge from one domain to another, as
shown in Fig. 1.2 (c).
Traditional methods can generate visually pleasing textures but are limited in generating realistic natural images. Early works of texture generation generates textures in pixel space [20, 27, 123, 26, 76, 18, 64]
2
(a) Texture Generation (b) Natural Image Generation
(c) Image Label Transfer (Domain Adaptation for Image classification)
Figure 1.2: The comparison of (a) texture generation, (b) image generation and (c) image label transfer.
Texture generation takes an exemplary texture as the input and generates texture images. Natural image
generation takes a set of images, resembles its data distribution, and generates samples from the distribution. Image label transfer attempts to transfer knowledge from one domain to another domain, which is
also known as domain adaptation for image classification.
or in a feature space [40, 92]. They can generate texture images of good quality but with limited diversity. Few attempts are made on image generation in the early stage. However, some generative models are
proposed for density estimation, such as Gaussian Mixture Models [94], or face recognition, such as Eigenfaces [113]. There is a resurgence of interest in generative modeling due to the performance breakthrough
achieved by deep learning (DL) in the last decade. Various DL models, including Generative Adversarial Networks (GANs) and Convolutional Neural Networks(CNNs), yield visually pleasing results in image
generation and texture generation. These models optimize the parameters in a network through backpropagation. Nevertheless, these models are usually large in model size, intractable to explain in theory,
computationally expensive in training, and sensitive to random initialization, network architecture, and
hyper-parameters when trained. It is desired to develop a new generation method that is small in model
size, mathematically transparent, efficient in training and inference, and able to provide comparable results
at the same time.
3
A mathematical theory named successive-subspace-learning (SSL) targets finding a mathematically
transparent solution with low complexity. Kuo et al. [59, 58, 60, 62] proposed two affine transform that
determine a sequence of joint spatial-spectral representations of different spatial/spectral trade-offs to
characterize the global structure and local detail of images at the same time. They are the Saak transform
[60] and the Saab transform [62]. PixelHop [16] consists of multi-stage Saab transforms in cascade. As
a variant of the Saab transform, the channel-wise (c/w) Saab transform [16] exploits weak correlations
among spectral channels and applies the Saab transform to each channel individually to reduce the model
size without sacrificing the performance, which is known as PixelHop++ [16]. SSL has been further propagated as Green Learning [61] and successfully applied to many application domains. Examples include [12,
130, 52, 80, 132, 133, 131, 53, 84, 111, 99, 98]. In this thesis, we design generative modeling solutions based
on the green learning concept for texture generation, image generation, and image label transfer. We first
present an explainable, efficient, and lightweight solution for texture generation and then propose a novel
image generative modeling pipeline and examine it on a natural image generation task named GenHop.
We reformulate the GenHop to improve its efficiency and model sizes and propose Green Image generation
(GIG). To demonstrate the generalization ability of our generative modeling solution, we finally adapt it
to the image label transfer task and propose a method called Green Image Label Transfer (GILT).
1.2 Contributions of the Research
1.2.1 Explainable, Efficient and Lightweight Texture Generation
Automatic generation of visually pleasant texture that resembles exemplary texture is of theoretical interest in texture analysis and modeling and of practical interest for a wide variety of applications. Although
the synthesis of visually pleasant texture can be achieved by deep neural networks, the associated models
4
are large in size, difficult to explain in theory, and computationally expensive in training. We propose a
new method called NITES and its extension called TGHop to tackle these defects.
• We propose a successive subspace analysis and generation pipeline for generation tasks.
• Given an exemplary texture, numerous sample patches are cropped out of it to form a collection
of sample patches called the source. Pixel statistics of samples from the source are analyzed, and a
sequence of fine-to-coarse subspaces for these patches is obtained using the PixelHop++ architecture [16].
• To generate realistic texture patches, samples are generated in the coarsest subspace, which is called
the core, by matching the distribution of real and generated samples. Spatial pixels are then generated given spectral coefficients from coarse to fine subspaces. Texture patches are finally stitched to
form texture images of a large size.
• Extensive experiments are conducted to show that TGHop can generate texture images of superior
quality with small model sizes and at a fast speed.
1.2.2 Image Generative Modeling via Successive Subspace Learning
Image generative modeling has been a long-standing problem that has received increasing attention. It
estimates the density of data and generates new samples. Deep-learning (DL) based models like generative
adversarial networks (GANs) offer impressive solutions to natural image generation. GANs suffer from
training difficulty, mode collapse problems, and mathematical intractability. We propose a new image
generative modeling pipeline called GenHop to avoid this problem, learn the data distribution, and generate
samples from the model distribution.
• For image analysis, we construct fine-to-coarse spatial-spectral subspaces from theinput space using
the PixelHop++ architecture [16]. The dimensions of these subspaces are non-increasing, yielding
5
the smallest dimension in the coarsest subspace. The samples in the coarsest subspace preserve the
global structure of source images. We obtain sample distributions in each subspace.
• By modeling the sample distribution in the coarsest subspace, new samples are drawn from the
distribution that resembles the global structure of source images in the generation step.
• From coarse-to-fine generation, we apply locally linear embedding to fine-tune the generated samples and then generate fine details.
• GenHop can generate visually pleasant images whose FID scores are comparable with those of deeplearning-based generative models against MNIST, Fashion-MNIST, and CelebA datasets. GenHop
has a small model size and can learn stable sample distributions with a small number of training
samples.
1.2.3 Green Image Generation
We address two limitations of the previous methods and propose a new green image generation method
named GIG. Similar to previous methods, this method transforms data to a lower-dimensional white noise
(i.e., latent space) through an analysis process while generating samples through a reverse generation
process.
• To further reduce the dimensions of the seed and thus reduce the model size, we conduct MFA for
seed learning and generation.
• To avoid using locally linear embedding (LLE) for detail generation, which results in memory waste
and slow generation speed, we propose replacing LLE with an XGBoost regressor.
• GIG offers an explainable, efficient, and high-performance solution for image generative modeling.
6
1.2.4 Green Image Label Transfer
We focus on unsupervised domain adaptation (DA), which adapts a model trained on a source domain to
work on a related target domain. We address the limitations in deep learning-based DA models, which
can be complex and prone to overfitting, especially when the amount of target domain data is limited. We
introduce a new method called Green Image Label Transfer (GILT) for unsupervised domain adaptation
that emphasizes model efficiency and explainability.
• GILT does not rely on neural networks to learn domain-invariant features, which increases the explainability of the method.
• GILT has a module that can learn a joint discriminant subspace to identify and choose features that
are invariant in both the source and target domains.
• GILT can transfer class labels from the source domain to the target domain.
• GILT can learn the labels for unlabeled data from the target domain in a weakly supervised manner.
• Experiments on MNIST-to-USPS transfer demonstrate the potential of our proposed method in terms
of comparable accuracy on digit classification, small model sizes, and efficient inference speed.
1.3 Organization
The rest of the thesis is organized as follows. In Chapter 2, we review the background, including image generative modeling, texture synthesis, natural image generation, image label transfer, and successive
subspace learning. In Chapter 3, we propose an explainable, efficient, and lightweight method for texture
generation, which can generate texture images of superior quality with small model sizes and at a fast
speed. In Chapter 4, we propose a novel image generative modeling pipeline that can generate visually
7
pleasant images and learn stable sample distributions with a small number of training samples. In Chapter 5, we propose the Green Image Generation (GIG) method, which addresses the efficiency and model
sizes. In Chapter 6, we generalize our green image generative model to image label transfer and propose
the Green Image Label Transfer (GILT) method. In Chapter 7, we summarize the research and provide
some future topics.
8
Chapter 2
Background Review
In this section, we give a background introduction related to our research. First, we introduce different
types of image generative modeling. Then, we discuss methods in texture generation, including earlystage work and recent deep-learning-based methods. Furthermore, we introduce the generative models
examined on the natural image generation task, including adversarial models, i.e., Generative Adversarial
Networks (GANs) and non-adversarial models. Finally, we summarize the research on successive subspace
learning.
2.1 Image Generative Modeling
Image generative modeling aims to represent the distribution of image data, denoted as p(x), with a model.
Samples can be generated from the model distribution. Unconditional generative models directly find P(x)
while conditional generative models capture P(x, y) or P(x|y) where y can be class labels or attribute labels. Early works approximate an unknown probability distribution P(x) by a mixture pf continuous distributions such as Gaussian distribution. Examples include Gaussian Mixture Modeling (GMM) or Parzen
window density estimation. These models demand explicit probability density functions as the prior and
estimate the model parameters by maximizing the likelihood function. However, the Gaussian distribution
assumption does not always hold for image data. As a result, image samples drawn from the distribution
9
have limited resolution and quality. Since natural image generation is challenging in the early days, various methods are proposed and adopted for texture generation. One example is patch sampling based on
its neighborhood via non-parametric density function estimation [76]. Non-parametric models [27, 29, 39]
often search in a database of existing images and have no parameter to optimize. In contrast, parametric models [40, 92] estimate parameters based on existing image data. With the development of neural
networks and deep learning, parametric models based on neural networks gradually become prosperous.
In terms of the density function used by a model, generative models can be categorized into two groups:
explicit density models and implicit density models, as illustrated in Fig. 2.1. Explicit density models define an explicit density function, which can be a tractable density [30] or an approximate density. Fully
visible belief networks (FVBN) [30] and non-linear independent components analysis (Nonlinear ICA) [21,
23] are two popular approaches for modeling tractable density. FVBN [30] decomposes a probability distribution over a vector into a product of one-dimensional probability distributions with the chain rule of
probability. WaveNet [89], PixelRNN [119] and PixelCNN [91, 103] are built upon this idea. These models require computational inefficiency. Nonlinear ICA [21, 23] converts the probability distribution of the
source space into that of a vector of latent variables through continuous, nonlinear transformations. The
invertible restriction on the transformations requires two spaces to be of the same dimension. To avoid
the drawbacks of tractable density models, the approximation of the explicit density function is used in
some methods, which can be categorized into variational approximations [56, 95] and Markov chain approximations [43, 42]. The most popular approach using variational approximations is the variational
auto-encoders (VAEs) [56]. It has an encoder-decoder structure with an objective to maximize the evidence lower bound of the marginal likelihood among all data samples. However, a too-weak approximate
posterior distribution or a too-weak prior distribution results in poor performance. Models using Markov
chain approximations such as Boltzmann machines [43, 42] suffer from slow convergence and higher computational costs. Implicit density models do not explicitly define a density function. They can be trained
10
through adversarial learning [36] or Markov chains [6]. Markov chains-based models perform unsatisfactorily in high dimensional space and demand high computational costs for use. GANs avoid these problems
and achieve state-of-art performance. The details of GANs are discussed in Sec. 2.3.
Figure 2.1: The categories of generative models in terms of density function used by a model.
2.2 Texture Generation
2.2.1 Early-stage Methods
Texture generation (or synthesis) has been a long-standing problem of great interest. The methods for it
can be categorized into two types. The first type generates one pixel or one patch at a time and grows
synthesized texture from small to large regions. The pixel-based method synthesizes a center pixel conditioned on its neighboring pixels. Efros and Leung [27] proposed to synthesize a pixel by randomly choosing
from the pixels that have a similar neighborhood as the query pixel. Patch-based methods [26, 76, 18, 64]
usually achieves higher quality than pixel-based methods [20, 27, 123]. The image quilting method [27]
randomly picks one texture patch among all that satisfies overlap constraints and finds the minimum error
11
boundary cut with dynamic programming. An illustration of this method is given in Fig. 2.2. These methods suffer from two problems. First, searching the whole space to find a matched patch is slow [27, 76].
Second, the methods [26, 18, 63] that stitch small patches to form a larger image sustain a limited diversity
of generated patches, though they are capable of producing high-quality textures at a fast speed. A certain
pattern may repeat several times in these generated textures without sufficient variations due to a lack of
understanding of the perceptual properties of texture images. The second type addresses this problem by
analyzing textures in feature spaces rather than pixel space. A texture image is first transformed into a
feature space with kernels. Then, statistics in the feature space, such as histograms [40] and handcrafted
summary [92], are analyzed and exploited for texture generation. For the transform, some pre-defined
filters such as Gabor filters [40] or steerable pyramid filter banks [92] were adopted in the early days. The
design of these filters, however, heavily relies on human expertise and lacks adaptivity. With the recent
advances of deep neural networks, filters from a pre-trained network such as VGG provide a powerful
transformation for analyzing texture images and their statistics [33, 74].
(a) Overview (b) Results
Figure 2.2: An illustration of the image quilting method [27]. An overview of the method is shown in (a).
The results obtained by the method are given (b). The defect on an apple occurring many times in the
generated images indicates the limited diversity of the method.
12
2.2.2 Deep-Learning-based Methods
Figure 2.3: An overview of the method in [33], which applies the Gram matrix as the statistical measurement and iteratively optimizes an initial white-noise input image through back-propagation.
DL-based methods often employ a texture loss function that computes the statistics of the features.
Fixing the weights of a pre-trained network, the method in [33] applies the Gram matrix as the statistical
measurement and iteratively optimizes an initial white-noise input image through back-propagation, an
overview of which is given in Fig. 2.3. The method in [74] computes feature covariances of white-noise
image and texture image and matches them through whitening and coloring. The method in [116] trained
a generator network using a loss based on the same statistics as [33] inside a pre-trained descriptor network. Its training is sensitive to hyper-parameter choices varying with different textures. These three
13
methods utilized a VGG-19 network pre-trained on the Imagenet to extract features. The method in [117]
abandons the deep VGG network but adopts only one convolutional layer with random filter weights.
Extensive experiments are conducted to analyze the influence of the feature maps on texture synthesis
performance. The results in Fig. 2.4 indicate that textures of comparable quality can be generated without using a pre-trained VGG network. Although these methods can generate visually pleasant textures,
the iterative optimization process (i.e., backpropagation) is computationally expensive. There are a lot of
follow-ups to [33] using the Gram matrix as the statistics measurement, such as incorporating other optimization terms [78, 97] and improving inference speed [73, 105]. However, there is a price to pay. The
former aggravates the computational burden while the latter increases the training time.
Figure 2.4: Results of using different feature maps on texture synthesis in [117]. It indicates that textures
of comparable quality can be generated without using a pre-trained VGG network.
Another problem with these methods lies in the difficulty of explaining the usage of a pre-trained network. The methods in [33, 74] develop upon a VGG-19 network pre-trained on the Imagenet dataset.
The Imagenet dataset is designed for understanding the semantic meaning of a large number of natural
images. Textures, however, mainly contain low-level image characteristics. Although shallow layers (such
as conv_1) of VGG are known to capture low-level characteristics of images, generating texture only with
14
shallow layers does not give good results in [33], as shown in Fig. 2.5. It is hard to justify whether the
VGG feature contains redundancy for textures or ignores some texture-specific information. Lack of explainability also raises the challenge of inspecting the methods when unexpected generation results occur.
There are some advances in explainable deep learning research. Visual Analytics systems are utilized for
in-depth model understanding and diagnoses [17]. The algorithm unrolling technique was developed to
help connect neural networks and iterative algorithms [87]. Although they are beneficial tools to alleviate
the issue, more work needs to be done to understand the mechanism of networks thoroughly. Thus, these
drawbacks motivate us to design a method that is efficient, lightweight, and dedicated to texture.
Figure 2.5: Result of using statistics of shallow layers, conv1_1, pool_1 and pool_2, in [33]. Generating
texture only with shallow layers does not give good results.
15
2.3 Natural Image Generation
Deep Generative modeling has received increasing attention due to the resurgence of neural networks.
These deep-learning-based generative models learn a mapping between an input and its desired output.
For the image generation task, which is the most fundamental task to examine a generative model, the
mapping can be formulated as
x = Fθ(z), (2.1)
where z is a random vector of a known distribution (say, Gaussian or uniform distribution), x denotes a
class of images to be generated and Fθ is a mapping function whose parameters, θ, are estimated through
backpropagation. Based on the estimation scheme, we categorize deep-learning-based generative models
into two families: adversarial and non-adversarial models.
2.3.1 Adversarial Models
Adversarial models are derived using the generative adversarial network (GAN) [36]. They contain a generator to generate synthetic images and a discriminator to distinguish real and synthetic images. They
are trained in turns until the generator can generate high-quality synthetic images so that the discriminator cannot distinguish them from real ones, yielding the same distribution between data examples and
generated samples as shown in Fig. 2.6. Two loss functions are proposed in the original GAN [36]. The
non-saturating loss (NS GAN) resolves the vanishing gradient problem of the minmax loss (MM GAN)
by maximizing the probability of generated samples being real rather than minimizing the probability of
being fake.
A lot of progress has been witnessed on adversarial models in recent years. First, efforts were made
to improve training stability and alleviate mode collapse. DCGAN [93] proposed to adopt a convolutional
neural network with the adversarial training protocol, achieving promising generation results of better
16
Figure 2.6: GAN [36]trains a generator and discriminator in turns until the generator can generate highquality synthetic images so that the discriminator cannot distinguish them from real ones, yielding the
same distribution between data examples and generated samples.
quality. Wasserstein GAN (WGAN) [3] allows a real number as the output of the discriminator and removes
the sigmoid function in the MMGAN loss, which improves the stability of training and alleviates the mode
collapse problem. WGAN-GP [38] further improves the training stability by penalizing the norm of the
gradient. DRAGAN [57] adopts this gradient norm penalty to NS GAN to encourage fast and stable training
with fewer mode collapses. LSGAN [85] proposed a least-square loss for the discriminator, which saturates
slower than the loss of MM GAN. BEGAN [7] applies an autoencoder as a discriminator and a variant
Wasserstein distance as a loss. Extensive experiments are performed to compare these methods in [83].
On the other hand, the generalization ability of GANs is examined in many works. Linear interpolation
or arithmetic in the z domain can have a semantic meaning in the x domain as shown in Fig. 2.7, which was
attributed to the inductive bias of deep convolutional neural networks (CNNs) [8]. Progressive GAN [54]
proposed a mechanism to generate high-quality images from low-resolution to high-resolution gradually.
High-quality and high-fidelity images were able to be synthesized in [55, 9]. COCO-GAN [77] proposes a
generator to generate by parts conditioned on their spatial coordinates and a discriminator to distinguish
among multiple assembled patches that considers global coherence, local appearance, and edge-crossing
continuity. This "generate by parts" design allows the generator to generate larger images than training
images and reduces the memory requirement in training and inference. SinGAN [104] learns a generative
model from a single natural image, which captures the internal distribution of patches within the image and
17
achieves good results in texture and natural image generation and many wide range of image manipulation
tasks.
(a) Linear Interpolation
(b) Linear Arithmetic
Figure 2.7: An example of linear interpolation and linear arithmetic properties of GANs given in [93].
2.3.2 Representative Non-Adversarial Models
The research on the alternatives of GANs is ceaseless. One famous example of non-adversarial generative
models is the variational auto-encoder (VAE) [56, 95]. It has an encoder-decoder structure with an objective
to maximize the evidence lower bound of the marginal likelihood among all data samples. Instead of
18
Figure 2.8: An overview of GLANN [46].
maximizing the likelihood function directly, the decoder models Fθ between a latent vector z and the
source data x whose parameters are estimated using a reconstruction loss and a latent loss to constrain
the latent space. Another example is the generative latent optimization (GLO) network [8], which has a
generator to model Fθ only. It computes the Laplacian pyramid loss between generated images and training
images in the image space and jointly estimates the parameters of the generator and the latent space. The
third example is the implicit maximum likelihood estimation (IMLE) network. It learns a generator to
model Fθ whose parameters are estimated by minimizing the distance between the data point in the image
space and its nearest generated sample. Images generated by VAE, GLO and IMLE are not as sharp as those
generated by GANs. VAE and GLO are limited by the constraints added to the latent space, while IMLE is
limited by the distance metrics used for finding the nearest samples. More recently, GLANN was proposed
in [46] based on the observation that IMLE with the Euclidean distance can offer good results when one
performs it in the feature space obtained by GLO. It cascades GLO and IMLE, uses GLO to learn a mapping
from latent image embedding space to pixel space, and adopts IMLE to learn a mapping from noise latent
space to latent image embedding space as shown in Fig. 2.8. However, finding the nearest neighbor in
19
GLANN is still time-consuming. Other non-adversarial models include autoregressive models [90, 118],
flow-based generative models [44], and energy-based generative models [25].
Although the deep-learning-based models exhibit certain generalization capabilities and produce visually impressive images, they are black-box solutions, difficult to train, and costly to deploy for edge
computing. This encourages the research on developing mathematically transparent models with low
time and space complexity.
2.3.3 Autoregressive Models and Transformer-based Models
Autoregressive models form another category of generative models, which differ from GANs and VAEs in
that they generate future data points based on previous observations. This statistical modeling paradigm
has seen significant advancements with the integration of deep learning techniques. Models such as PixelRNN and PixelCNN [119], introduced by van den Oord et al., have marked a breakthrough in image
generation by predicting pixel values in a sequential manner. Autoregressive models explicitly model data
dependencies, resulting in superior performance for sequential tasks. However, this comes at the cost of
computational efficiency for long sequences.
The extension of autoregressive models has been exemplified by the Transformer architecture. The
Vision Transformer (ViT) [24] by Dosovitskiy et al. demonstrates that transformers could be effectively
adapted for image-related tasks. These models capitalize on the self-attention mechanism to capture
global dependencies within images, enabling a more holistic understanding of visual data compared to
traditional convolutional approaches. The key strength of these models lies in their ability to learn rich,
high-dimensional data representations, facilitating the generation of highly detailed and diverse images.
Transformer-based image generative models, while powerful, often suffer from high computational demands, scalability issues, and a need for large-scale datasets, which can lead to challenges in training
efficiency and model generalization.
20
2.3.4 Denoising Diffusion Probability Models
Diffusion models have recently emerged as a groundbreaking development in the field of generative modeling, offering a novel approach that contrasts with the traditional frameworks. Rooted in the concept of
gradually learning to reverse a diffusion process, these models transform data into a Gaussian distribution
and then learn to reverse this process to generate new data. The work by Sohl-Dickstein et al. [107] laid the
foundational theory, which was later expanded upon by Ho et al. [45] in their introduction of Denoising
Diffusion Probabilistic Models (DDPMs). These models have demonstrated exceptional performance in
generating high-quality images, rivaling and, in some aspects, surpassing the capabilities of GANs, particularly in terms of image diversity and the avoidance of mode collapse. Moreover, their inherent stochastic
nature allows for a more controlled and varied generation process. Recent advancements have further
optimized these models regarding sampling efficiency and training stability, making them more practical
for a broader range of applications.
2.4 Image Transfer Learning
2.4.1 Traditional Methods
Traditional methods focused on designing simple techniques to align the features in source and target
domains. [19] introduces a remarkably simple adaptation technique based on data augmentation for the
case with sufficient target data. [28] introduces a novel algorithm where data from both source and target
domains are mapped to a domain-invariant feature space through eigen-decomposition. It has fast computing speed due to solving with a closed-form solution. [108] introduce CORrelation ALignment (CORAL),
an unsupervised approach that aligns the second-order statistics of the source and target distributions to
minimize domain shift. Despite its simplicity in implementation, CORAL demonstrates remarkable performance on benchmark datasets.
21
2.4.2 Deep-learning-based Methods
Deep-learning-based methods elevate the domain adaptation to a new level. [32] presents a novel representation learning method for domain adaptation, leveraging features that are agnostic to the differences
between domains. It introduces a new gradient reversal layer for easy implementation with the existing DL
frameworks. [79] introduces the Coupled Generative Adversarial Network (CoGAN) designed for learning
joint distributions of images across multiple domains by incorporating a weight-sharing constraint. [115]
presents a unified framework for adversarial adaptation and introduces a novel approach - Adversarial Discriminative Domain Adaptation (ADDA), which combines discriminative modeling, untied weight sharing,
and GAN loss. [75] introduces the Source HypOthesis Transfer (SHOT) framework, which employs information maximization and self-supervised pseudo-labeling to align target domain representations with the
source hypothesis. The versatility of SHOT is demonstrated through its application to various adaptation
scenarios, including closed-set, partial-set, and open-set domain adaptations, and its protectivity to private
datasets.
2.5 Successive Subspace Learning and Green Learning
A mathematical theory named successive-subspace-learning (SSL) targets at finding a mathematically
transparent solution with low complexity. It is developed upon traditional spectral analysis. Traditional
spectral analysis, such as the Fourier transform and the principle component analysis (PCA), attempts to
capture the global structure but sacrifices the local detail (e.g., object boundaries) of images. In contrast,
local detail can be well described in the spatial domain, yet the pure spatial representation cannot capture
the global information. To overcome these shortcomings, Kuo et al. [59, 58, 60, 62] proposed two affine
transform that determine a sequence of joint spatial-spectral representations of different spatial/spectral
trade-offs to characterize the global structure and local detail of images at the same time. They are the Saak
22
transform [60] and the Saab transform [62]. As a variant of the Saab transform, the channel-wise (c/w)
Saab transform [16] exploits weak correlations among spectral channels and applies the Saab transform to
each channel individually to reduce the model size without sacrificing the performance. An illustration of
c/w Saab transform is given in Fig. 2.9.
Figure 2.9: An illustration of c/w Saab transform in [16].
PixelHop [15] and PixelHop++ [16] are two architectures developed to implement SSL. PixelHop consists of multi-stage Saab transforms in cascade. PixelHop++ is an improved version of PixelHop that replaces the Saab transform with the channel-wise (c/w) Saab transform. It performs the Saab transform at
each spectral channel separately, exploiting the weak correlations among channels. SSL has been successfully applied to many application domains. Examples include [12, 130, 52, 80, 132, 133, 131, 53, 84, 111, 99,
98].
Successive Subspace Learning has been gradually developed into Green learning (GL), which was proposed as an alternative paradigm to address the transparency and efficiency concerns in DL-based models.
23
Two representative works are IPHop [Yang et al., 2022a] [126] learning systems for generating more expressive representations and the discriminant and relevant feature tests [126] for select discriminant features
based on labels.
There are preliminary efforts in the development of GL-based generative models, e.g., NITES [67],
TGHop [68], GenHop [66], Pager [4]. GL-based generative models generally contain two modules: Fineto-Coarse Image Analysis and Coarse-to-Fine Image Synthesis. The fine-to-coarse image analysis module
reduces the dimension of the image and learns the distribution of the meaningful representations. Coarseto-Fine Image Synthesis module first draws samples from the lowest-dimensional space and then generates
details.
24
Chapter 3
TGHop: An Explainable, Efficient and Lightweight Method for Texture
Generation
3.1 Introduction
Automatic generation of visually pleasant texture that resembles exemplary texture has been studied for
several decades since it is of theoretical interest in texture analysis and modeling. Research in texture
generation benefits texture analysis and modeling research [112, 11, 2, 135, 120, 128, 129] by providing
a perspective to understand the regularity and randomness of textures. Texture generation finds broad
applications in computer graphics and computer vision, including rendering textures for 3D objects [120,
88], image or video super-resolution [47], etc.
Early works of texture generation generate textures in pixel space. Based on exemplary input, texture
can be generated pixel-by-pixel [20, 27, 123] or patch-by-patch [26, 76, 18, 64], starting from a small unit
and gradually growing to a larger image. These methods, however, suffer from slow generation time [27,
76] or limited diversity of generated textures [26, 18, 63]. Later works transform texture images to a
feature space with kernels and exploit the statistical correlation of features for texture generation. Commonly used kernels include the Gabor filters [40] and the steerable pyramid filter banks [92]. This idea
is still being actively studied with the resurgence of neural networks. Various deep learning (DL) models,
25
including Convolutional Neural Networks(CNNs) and Generative Adversarial Networks (GANs), yield visually pleasing results in texture generation. Compared to traditional methods, DL-based methods [33, 78,
97, 73, 74, 117, 105] learn weights and biases through end-to-end optimization. Nevertheless, these models
are usually large in model size, difficult to explain in theory, and computationally expensive in training.
It is desired to develop a new generation method that is small in model size, mathematically transparent,
efficient in training and inference, and is able to offer high-quality textures at the same time. Along this
line, we propose the TGHop (Texture Generation PixelHop) method in this work.
TGHop consists of four steps. First, given an exemplary texture, TGHop crops numerous sample
patches out of it to form a collection of sample patches called the source. Second, it analyzes pixel statistics of samples from the source and obtains a sequence of fine-to-coarse subspaces for these patches by
using the PixelHop++ framework [16]. Third, to generate realistic texture patches, it begins with generating samples in the coarsest subspace, which is called the core, by matching the distribution of real and
generated samples, and attempts to generate spatial pixels given spectral coefficients from coarse to fine
subspaces. Last, texture patches are stitched to form texture images of a larger size. Extensive experiments
are conducted to show that TGHop can generate texture images of superior quality with small model sizes,
at a fast speed, and in an explainable way.
It is worthwhile to point out that this work is an extended version of our previous work in [67], where
a method called NITES was presented. Two works share the same core idea, but this work provides a more
systematic study of texture synthesis tasks. In particular, a spatial Principal Component Analysis (PCA)
transform is included in TGHop. This addition improves the quality of generated textures and reduces
the model size of TGHop as compared with NITES. Furthermore, more experimental results are given to
support our claim on efficiency (i.e., a faster computational speed) and lightweight (i.e., a smaller model
size).
26
Figure 3.1: Illustration of successive subspace analysis and generation, where a sequence of subspace
S1, . . . , Sn is constructed from source space, S0, through a successive process indicated by blue arrows
while red arrows indicate the successive subspace generation process.
The rest of the paper is organized as follows. A high-level idea of successive subspace analysis and
generation is described in Sec. 3.2. The TGHop method is detailed in Sec. 3.3. Experimental results are
shown in Sec. 3.4. Finally, concluding remarks and future research directions are given in Sec. 3.5.
3.2 Successive Subspace Analysis and Generation
In this section, we explain the main idea behind the TGHop method, successive subspace analysis, and
generation, as illustrated in Fig. 3.1. Consider an input signal space denoted by S˜
0, and a sequence of
subspaces denoted by S˜
1, · · · , S˜
n. Their dimensions are denoted by D˜
0, D˜
1, · · · , D˜
n. They are related to
each other by the constraint that any element in S˜
i+1 is formed by an affine combination of elements in
S˜
i
, where i = 0, · · · , n − 1.
An affine transform can be converted to a linear transform by augmenting vector a˜ in S˜
i via a =
(a˜
T
, 1)T
. We use Si to denote the augmented space of S˜
i and Di = D˜
i + 1. Then, we have the following
relationship
Sn ⊂ Sn−1 ⊂ · · · ⊂ S1 ⊂ S0, (3.1)
and
Dn < Dn−1 < · · · < D1 < D0. (3.2)
27
We use texture analysis and generation as an example to explain this pipeline. To generate homogeneous texture, we collect a number of texture patches cropped out of exemplary texture as the input set.
Suppose that each texture patch has three RGB color channels and a spatial resolution P × P. The input
set then has a dimension of 3P
2
and its augmented space S0 has a dimension of D0 = 3P
2 + 1. If P = 32,
we have D0 = 3073, which is too high to find an effective generation model directly.
To address this challenge, we build a sequence of subspaces S0, S1, · · · , Sn with decreasing dimensions. We call S0 and Sn the "source" space and the "core" subspace, respectively. We need to find an
effective subspace Si+1 from Si
, and such an analysis model is denoted by E
i+1
i
. Proper subspace analysis
is important since it determines how to decompose an input space into the direct sum of two subspaces in
the forward analysis path. Although we choose one of the two for further processing and discard the other
one, we need to record the relationship of the two decomposed subspaces so that they are well-separated
in the reverse generation path. This forward process is called fine-to-coarse analysis.
In the reverse path, we begin with the generation of samples in Sn by studying its own statistics.
This is accomplished by the generation model Gn. The process is called core sample generation. Then,
conditioned on a generated sample in Si+1, we generate a new sample in Si through a generation model
denoted by Gi
i+1. This process is called coarse-to-fine generation. In Fig. 3.1, we use blue and red arrows
to indicate analysis and generation. This idea can be implemented as a non-parametric method since we
can choose subspaces S1, · · · , Sn, flexibly in a feedforward manner. One specific design is elaborated in
the next section.
3.3 TGHop Method
The TGHop method is proposed in this section. An overview of the TGHop method is given in Sec. 3.3.1.
Next, the forward fine-to-coarse analysis based on the two-stage c/w Saab transforms is discussed in
28
Sec. 3.3.2. Afterwards, sample generation in the core is elaborated in Sec. 3.3.3. Finally, the reverse coarseto-fine pipeline is detailed in Sec. 3.3.4.
3.3.1 System Overview
An overview of the TGHop method is given in Fig. 3.2. The exemplary color texture image has a spatial
resolution of 256 × 256 and three RGB channels. We would like to generate multiple texture images that
are visually similar to the exemplary one. By randomly cropping patches of size 32 × 32 out of the source
image, we obtain a collection of texture patches serving as the input to TGHop. The dimension of these
patches is 32 × 32 × 3 = 3072. Their augmented vectors form source space S0. The TGHop system is
designed to generate texture patches of the same size that are visually similar to samples in S0. This is
feasible if we can capture both global and local patterns of these samples. There are two paths in Fig. 3.2.
The blue arrows go from left to right, denoting the fine-to-coarse analysis process. The red arrows go from
right to left, denoting the coarse-to-fine generation process. We can generate as many texture patches as
desired using this procedure. In order to generate a texture image of a larger size, we perform image
quilting [26] based on synthesized patches.
3.3.2 Fine-to-Coarse Analysis
The global structure of an image (or an image patch) can be well characterized by spectral analysis, yet it is
limited in capturing local details such as boundaries between regions. Joint spatial-spectral representations
offer an ideal solution to the description of both global shape and local detail information. Analysis model
E1
0 finds a proper subspace, S1, in S0 while analysis model E2
1 finds a proper subspace, S2, in S1. As shown
in Fig. 3.2, TGHop applies two-stage transforms. They correspond to E1
0
and E2
1
, respectively. Specifically,
we can apply the c/w Saab transform in each stage to conduct the analysis. In the following, we provide a
brief review of the Saab transform [62] and the c/w Saab transform [16].
29
Figure 3.2: An overview of the proposed TGHop method. A number of patches are collected from the
exemplary texture image, forming source space S0. Subspace S1 and S2 are constructed through analysis
model E1
0
and E2
1
. Input filter window sizes to Hop-1 and Hop-2 are denoted as I0 and I1. Selected
channel numbers of Hop-1 and Hop-2 are denoted as K1 and K2. A block of size Ii × Ii of Ki channels
in space/subspace Si
is converted to the same spatial location of Ki+1 channels in subspace Si+1. Red
arrows indicate the generation process beginning from core sample generation followed by coarse-to-fine
generation. The model for core sample generation is denoted as G2, and the models for coarse-to-fine
generation are denoted as G1
2
and G0
1
.
We partition each input patch into non-overlapping blocks, each of which has a spatial resolution of
I0 × I0 with K0 channels. We flatten 3D tensors into 1D vectors and decompose each vector into the sum
of one Direct Current (DC) and multiple Alternating Current (AC) spectral components. The DC filter is
an all-ones filter weighted by a constant. AC filters are obtained by applying the principal component
analysis (PCA) to the DC-removed residual tensor. By setting I0 = 2 and K0 = 3, we have a tensor
block of dimension 2 × 2 × 3 = 12. Filter responses of PCA can be positive or negative. There is a sign
confusion problem [59, 58] if both of them are allowed to enter the transform in the next stage. To avoid
sign confusion, a constant bias term is added to all filter responses to ensure that all responses become
positive, leading to the name "subspace approximation with adjusted bias (Saab)" transform. The Saab
transform is a data-driven transform, which is significantly different from traditional transforms (e.g.,
Fourier and wavelet transform), which are data-independent. We partition AC channels into two low- and
30
high-frequency bands. The energy of high-frequency channels (shaded by gray color in Fig. 3.2) is low,
and they are discarded for dimension reduction without affecting the performance much. The energy of
low-frequency channels (shaded by blue color in Fig. 3.2) is higher. For a tensor of dimension 12, we have
one DC and 11 AC components. Typically, we select K1 = 6 to 10 leading AC components and discard the
rest. Thus, after E1
0
, one 12-D tensor becomes a K1-D vector, which is illustrated by dots in subspace S1.
The K1-D response vectors are fed into the next stage for another transform.
The channel-wise (c/w) Saab transform [16] exploits the weak correlation property between channels
so that the Saab transform can be applied to each channel separately (see the middle part of Fig. 3.2). The
c/w Saab transform offers an improved version of the standard Saab transform with a smaller model size.
One typical setting used in our experiments is shown below.
• Dimension of the input patch (D˜
0): 32 × 32 × 3 = 3072;
• Dimension of subspace S˜
1 (D˜
1): 16 × 16 × 10 = 2560 (by keeping ten channels in Hop-1);
• Dimension of subspace S˜
2 (D˜
2): 8 × 8 × 27 = 1728 (by keeping 27 channels in Hop-2).
Note that the ratio between D˜
1 and D˜
0 is 83.3% while that between D˜
2 and D˜
1 is 67.5%. We are able
to reduce the dimension of the source space to that of the core subspace by a factor of 56.3%. In the
reverse path indicated by red arrows, we need to develop a multi-stage generation process. It should also
be emphasized that users can flexibly choose channel numbers in Hop-1 and Hop-2. Thus, TGHop is a
non-parametric method.
The first-stage Saab transform provides the spectral information on the nearest neighborhood, which is
the first hop of the center pixel. By generalizing from one to multiple hops, we can capture the information
in the short-, mid-, and long-range neighborhoods. This is analogous to increasingly larger receptive fields
in deeper layers of CNNs. However, filter weights in CNNs are learned from end-to-end optimization via
31
(a) without SDR (b) with SDR
Figure 3.3: Generated grass texture image with and without spatial dimension reduction (SDR).
backpropagation while weights of the Saab filters in different hops are determined by a sequence of PCAs
in a feedforward unsupervised manner.
3.3.3 Core Sample Generation
In the generation path, we begin with sample generation in core Sn which is denoted by Gn. In the
current design, n = 2. We first characterize the sample statistics in the core, S2. After two-stage c/w Saab
transforms, the sample dimension in S2 is less than 2000. Each sample contains K2 channels of spatial
dimension 8 × 8. Since correlations between spatial responses in each channel exist, PCA is adopted for
further spatial dimension reduction (SDR). We discard PCA components whose variances are lower than
the threshold γ. The same threshold applies to all channels. SDR can help reduce the model size and
improve the quality of generated textures. For example, we compare a generated grass texture with and
without SDR in Fig. 3.3. The quality with SDR significantly improves.
After SDR, we flatten the PCA responses of each channel and concatenate them into a 1D vector denoted by z. It is a sample in S2. To simplify the distribution characterization of a high-dimensional random
vector, we group training samples into clusters and transform random vectors in each cluster into a set
32
of independent random variables. We adopt the K-Means clustering algorithm to cluster training samples
into N clusters, which are denoted by {Ci}, i = 0, · · · , N − 1. Rather than modeling probability P(z)
directly, we model condition probability P(z | z ∈ Ci) with a fixed cluster index. The probability, P(z),
can be written as
P(z) =
N
X−1
i=0
P(z | z ∈ Ci) · P(z ∈ Ci), (3.3)
where P(z ∈ Ci) is the percentage of data points in cluster Ci
. It is abbreviated as pi
, i = 0, . . . , N − 1
(see the right part of Fig. 3.2).
Typically, a set of independent Gaussian random variables is used for image generation. To do the
same, we convert a collection of correlated random vectors into a set of independent Gaussian random
variables. To achieve this objective, we transform random vector z in cluster Ci
into a set of independent
random variables through independent component analysis (ICA), where non-Gaussianity serves as an
indicator of statistical independence. ICA finds applications in noise reduction [49], face recognition [5],
and image infusion [86]. Our implementation is detailed below.
1. Apply PCA to z in cluster Ci for dimension reduction and data whitening.
2. Apply FastICA [50], which is conceptually simple, computationally efficient, and robust to outliers,
to the PCA output.
3. Compute the cumulative density function (CDF) of each ICA component of random vector z in each
cluster based on its histogram of training samples.
4. Match the CDF in Step 3 with the CDF of a Gaussian random variable (see the right part of Fig. 3.2),
where the inverse CDF is obtained by resampling between bins with linear interpolation. To reduce
the model size, we quantize N-dimensional CDFs, which have N bins, with vector quantization (VQ)
and store the codebook of quantized CDFs.
33
We encode pi
in Eq. (3.3) using the length of a segment in [0, 1]. All segments are concatenated in
order to build the unit interval. The segment index is the cluster index. These segments are called the
interval representation, as shown in Fig. 3.4. To draw a sample from subspace S2, we use the uniform
random number generator to select a random number from interval [0, 1]. This random number indicates
the cluster index on the interval representation.
Figure 3.4: Illustration of the interval representation, where the length of a segment in the unit interval
represents the probability of a cluster, pi
. A random number is generated in the unit interval to indicate
the cluster index.
To generate a new sample in S2, we perform the following steps:
1. Select a random number from the uniform random number generator to determine the cluster index.
2. Draw a set of samples independently from the Gaussian distribution.
3. Match histograms of the generated Gaussian samples with the inverse CDFs in the chosen cluster.
4. Repeat Steps 1-3 if the generated sample of Step 3 has no value larger than a pre-set threshold.
5. Perform the inverse transform of ICA and the inverse transform of PCA.
6. Reshape the 1D vector into a 3D tensor and this tensor is the generated sample in S2.
The above procedure is named Independent Components Histogram Matching (ICHM). To conclude,
there are two main modules in the core sample generation: spatial dimension reduction and independent
components histogram matching as shown in Fig. 3.2.
34
Figure 3.5: Illustration of the generation process.
3.3.4 Coarse-to-Fine Generation
In this section, we examine the generation model Gi
i+1, whose role is to generate a sample in Si given
a sample in Si+1. Analysis model, E
i+1
i
, transforms Si to Si+1 through the c/w Saab transform in the
forward path. In the reverse path, we perform the inverse c/w Saab transform on generated samples in
Si+1 to Si
. We take the generation model G1
2
as an example to explain the generation process from S2 and
to S1. A generated sample in S2 can be partitioned into K1 groups as shown in the left part of Fig. 3.5.
Each group of channels is composed of one DC channel and several low-frequency AC channels. The kth
group of channels in S2, whose number is denoted by K
(k)
2
, is derived from the kth channel in S1. We
apply the inverse c/w Saab transform to each group individually. The inverse c/w Saab transform converts
the tensor at the same spatial location across K
(k)
2
channels (represented by white dots in Fig. 3.5) in S2
into a block of size Ii × Ii
(represented by the white parallelogram in Fig. 3.5) in S1, using the DC and
AC components obtained in the fine-to-coarse analysis. After the inverse c/w Saab transform, the Saab
coefficients in S1 form a generated sample in S1. The same procedure is repeated between S1 and S0.
35
Examples of several generated textures in core S2, intermediate subspace S1, and source S0 are shown
in Fig. 3.6. The DC channels generated in the core offer gray-scale low-resolution patterns of a generated
sample. More local details are added gradually from S2 to S1 and from S1 to S0.
Figure 3.6: Examples of generated DC maps in core S2 (first column), generated samples in subspace S1
(second co-lumn), and the ultimate generated textures in source S0 (third column).
3.4 Experiments
3.4.1 Experimental Setup
The following hyperparameters (see Fig. 3.2) are used in our experiments.
36
• Input filter window size to Hop-1: I0 = 2,
• Input filter window size to Hop-2: I1 = 2,
• Selected channel numbers in Hop-1 (K1): 6 ∼ 10,
• Selected channel numbers in Hop-2 (K2): 20 ∼ 30.
The window size of the analysis filter is the same as the generation window size. All windows are nonoverlapping with each other. The actual channel numbers K1 and K2 are texture-dependent. That is,
we examine the energy distribution based on the PCA eigenvalues and choose the knee point where the
energy becomes flat.
3.4.2 An Example: Brick Wall Generation
We show generated brick_wall patches of size 32 × 32 and 64 × 64 in Figs. 3.7(a) and (c). We performed
two-stage c/w Saab transforms on 32×32 patches and three-stage c/w Saab transforms on 64×64 patches,
whose core subspace dimensions are 1728 and 4032, respectively. Patches in these figures were synthesized
by running the TGHop method in one hundred rounds. Randomness in each round primarily comes from
two factors: 1) random cluster selection and 2) random seed vector generation.
Generated patches retain the basic shape of bricks and the diversity of brick texture. We observe some
unseen patterns generated by TGHop, which are highlighted by red squared boxes in Fig. 3.7 (a) and (c).
As compared with generated 32 × 32 patches, generated 64 × 64 patches were sometimes blurry (e.g., the
one in the upper right corner) due to a higher source dimension.
As a non-parametric model, TGHop can choose multiple settings under the same pipeline. For example,
it can select different channel numbers in S˜
1 and S˜
2 to derive different generation results. Four settings
are listed in Table 3.1. The corresponding generation results are shown in Fig. 3.8. Dimensions decrease
37
faster from (a) to (d). The quality of generated results becomes poorer due to smaller dimensions of the
core subspace, S˜
2, and the intermediate subspace, S˜
1.
Table 3.1: The settings of four generation models.
Setting D˜
0 D˜
1 D˜
2
a 3072 2560 2048
b 3072 1536 768
c 3072 1280 512
d 3072 768 192
To generate larger texture images, we first generate 5,000 texture patches and perform image quilting [26] with them. The results after quilting are shown in Figs. 3.7 (b) and (d). All eight images are of the
same size, i.e., 256 × 256. They are obtained using different initial patches in the image quilting process.
By comparing the two sets of stitched images, the global structure of the brick wall is better preserved
using larger patches (i.e., of size 64 × 64) while its local detail is a little bit blurry sometimes.
3.4.3 Performance Benchmarking with DL-based Methods
3.4.3.1 Visual Quality Comparison
The quality of the generated texture is usually evaluated by human eyes. Subjective user study is not a
common choice for texture synthesis because different people have different standards to judge the quality
of generated texture images. Evaluation metrics such as Inception Score [102] or Fréchet Inception Distance [41] are proposed for natural image generation. These metrics, however, demand a sufficient number
of samples to evaluate the distributions of generated images and real images, which are not suitable for
texture synthesis containing only one image for reference. The value of the loss function was used to
measure texture quality or diversity for CNN-based methods [105, 73]. A lower loss, however, does not
guarantee better generation quality. Since TGHop does not have a loss function, we show the generated
results of two DL-based methods and TGHop side by side in Fig. 3.9 for 10 input texture images collected
38
from [33, 117, 92] or the Internet. The benchmarking DL methods were proposed by Gatys et al. [33] and
Ustyuzhaninov et al. [117]. By running their codes, we show their results in the second and third columns
of Fig. 3.9, respectively, for comparison. These results are obtained by default iteration numbers; namely,
2000 in [33] and 4000 in [33]. The results of TGHop are shown in the last three columns. The two left
columns are obtained without spatial dimension reduction (SDR) in two different runs, while the last column is obtained with SDR. There is little quality degradation after dimension reduction of S2 with SDR.
For meshed and cloud textures, the brown fog artifact in [33, 117] is apparent. In contrast, it does not exist
in TGHop. More generated images using TGHop are given in Fig. 3.10. As shown in Figs. 3.9 and 3.10,
TGHop can generate high-quality and visually pleasant texture images.
3.4.3.2 Comparison of Generation Time
We compare the generation time of different methods in Table 3.2. All experiments were conducted on
the same machine composed of 12 CPUs (Intel Core i7-5930K CPU at 3.50GHz) and 1 GPU (GeForce GTX
TITAN X). GPU was needed in two DL-based methods but not in TGHop. We set the iteration number
to 1000 for [33] and 100 for [117]. TGHop generated 10K 32 × 32 patches for texture quilting. For all
three methods, we show the time needed to generate one image of the size 256 × 256 in Table 3.2, TGHop
generates one texture image in 291.25 seconds, while Gatys’ method and Ustyuzhaninov’s method demand
513.98 and 949.64 seconds, respectively. TGHop is significantly faster.
Table 3.2: Comparison of time needed to generate one texture image.
Methods Time (seconds) Factor
Ustyuzhaninov et al. [117] 949.64 4.62x
Gatys et al. [33] 513.98 2.50x
TGHop with analysis overhead 291.25 1.42x
TGHop w/o analysis overhead 205.50 1x
39
We break down the generation time of TGHop into three parts: 1) successive subspace analysis (i.e.,
the forward path), 2) core and successive subspace generation (i.e., the reverse path) and 3) the quilting
process. The time required for each part is shown in Table 3.3. They demand 85.75, 197.42 and 8.08 seconds,
respectively. To generate multiple images from the same exemplary texture, we run the first part only once,
which will be shared by all generated texture images, and the second and third parts multiple times (i.e.,
one run for a new image). As a result, we can view the first part as a common overhead and count the
last two parts as the time for single-texture image generation. This is equal to 205.5 seconds. The two DL
benchmarks do not have such a breakdown and need to go through the whole pipeline to generate one
new texture image.
Table 3.3: The time of three processes in our method.
Processes Time (seconds)
Analysis (Forward Path) 85.75
Generation (Reverse Path) 197.42
Quilting 8.08
3.4.4 Comparison of Model Sizes
The model size is measured by the number of parameters. The size of TGHop is calculated below.
• Two-stage c/w Saab Transforms
The forward analysis path and the reverse generation path share the same two-stage c/w Saab transforms. For an input RGB patch, the input tensor of size 2 × 2 × 3 = 12 is transformed into a K1-D
tensor in the first-stage transform, leading to a filter size of 12K1 plus one shared bias. For each of
K1 channels, the input tensor of size 2 × 2 is transformed into a tensor in the second stage transform. The sum of the output dimensions is K2. The total parameter number for all K1 channels
40
is 4K1K2 plus K1 biases. Thus, the total number of parameters in the two-stage transforms is
13K1 + 4K1K2 + 1.
• Core Sample Generation
Sample generation in the core contains two modules: spatial dimension reduction (SDR) and independent components histogram matching (ICHM). For the first module, SDR is implemented by K2
PCA transforms, where the input of size 8 × 8 = 64 and the output is a Kri
dimensional vector,
yielding the size of each PCA transformation matrix to be 64×Kri
. The total number of parameters
is 64 ×
PK2
i=1 Kri = 64Dr, where Dr is the dimension of the concatenated output vector after SDR.
For the second module, it has three components:
1. Interval representation p0, . . . , pN−1
N parameters are needed for N cluster.
2. Transform matrices of FastICA
If the input vector is of dimension Dr and the output dimension of FastICA is Kci
for the ith
cluster, i = 1, · · · , N, the total parameter number of all transforms matrices is DrF, where
F =
PN
i=1 Kci
is the number of CDFs.
3. Codebook size of quantized CDFs
The codebook contains the index, the maximum, and the minimum values for each CDF. Furthermore, we have W clusters of CDF, where all CDFs in each cluster share the same bin
structure of 256 bins. As a result, the total parameter number is 3F + 256W.
By adding all of the above together, the total parameter number in core sample generation is 64Dr +
N + (Dr + 3)F + 256W.
The above equations are summarized, and an example is given in Table 3.4 under the experiment setting
of N = 50 K1 = 9, K2 = 22, Dr = 909, F = 2, 518 and W = 200. The model size of TGHop is 2.4M.
41
Table 3.4: The number of parameters of TGHop, under the setting of γ = 0.01, N = 50 K1 = 9, K2 = 22,
Dr = 909, F = 2, 518 and W = 200.
Module Equation Num. of Param.
Transform - stage 1 12K1 + 1 109
Transform - stage 2 4K1K2 + K1 801
Core - SDR 64Dr 58,176
Core - ICHM(i) N 50
Core - ICHM(ii) F Dr 2,288,862
Core - ICHM(iii) 3F + 256W 58,754
Total 2,406,752
For comparison, the model sizes of [33] and [117] are 0.852M and 2.055M, respectively. A great majority of
TGHop model parameters come from ICHM(ii). Further model size reduction without sacrificing generated
texture quality is an interesting extension of this work.
Table 3.5: The reduced dimension, Dr, and the model size as a function of threshold γ used in SDR.
γ Dr Number of Parameters
0 1408 3.72M
0.0005 1226 3.26M
0.005 1030 2.74M
0.01 909 2.41M
0.02 718 1.88M
0.03 553 1.43M
0.04 399 1.00M
0.05 289 0.69M
0.1 102 0.19M
As compared with [67], SDR is a new module introduced in this work. It helps remove correlations
of spatial responses to reduce the model size. We examined the impact of using different threshold γ in
SDR on texture generation quality and model size with brick_wall texture. The same threshold is adopted
for all channels to select PCA components. The dimension of reduced space, Dr, and the cluster number,
N, are both controlled by threshold γ, used in SDR. γ = 0 represents all 64 PCA components are kept
in SDR. We can vary the value of γ to get a different cluster number and the associated model size. The
42
larger the value of γ, the smaller Dr and N and, thus, the smaller the model size as shown in Table 3.5.
The computation is given in Table 3.4 is under the setting of γ = 0.01.
A proper cluster number is important since too many clusters lead to larger model sizes, while too few
clusters result in bad generation quality. To give an example, consider the brick_wall texture image of size
256 × 256, where the dimension of S2 is 8 × 8 × 22 = 1408 with K2 = 22. We extract 12,769 patches
of size 32 × 32 (with stride 2) from this image. We conduct experiments with N =50, 80, 110, and 200
clusters and show generated patches in Fig. 3.11. As shown in (a), 50 clusters were too few, and we see the
artifact of over-saturation in generated patches. By increasing N from 50 to 80, the artifact still exists but
is less apparent in (b). The quality further improves when N = 100, as shown in (c). We see little quality
improvement when N goes from 100 to 200. Furthermore, patches generated using different thresholds
γ are shown in Fig. 3.12. We see little quality degradation from (a) to (f) while the dimension is reduced
from 1408 to 553. Image blur shows up from (g) to (i), indicating that some details were discarded along
with the corresponding PCA components.
3.5 Conclusion
An explainable, efficient, and lightweight texture generation method called TGHop, was proposed in this
work. Texture can be effectively analyzed using the multi-stage c/w Saab transforms and expressed in the
form of joint spatial-spectral representations. The distribution of sample texture patches was carefully
studied so that we could generate samples in the core. Based on the generated core samples, we can go
through the reverse path to increase its spatial dimension. Finally, patches can be stitched to form texture
images of a larger size. It was demonstrated by experimental results that TGHop can generate texture
images of superior quality with a small model size and at a fast speed.
Future research can be extended in several directions. Controlling the growth of dimensions of intermediate subspaces in the generation process appears to be important. Is it beneficial to introduce more
43
intermediate subspaces between the source and the core? Can we apply the same model for the generation
of other images, such as human faces, digits, scenes and objects? Is it possible to generalize the framework to image inpainting? How does our generation model compare to GANs? These are all open and
interesting questions for further investigation.
44
(a) Synthesized 32 × 32 Patches (b) Stitched Images with 32 × 32 Patches
(c) Synthesized 64 × 64 Patches (d) Stitched Images with 64 × 64 Patches
Figure 3.7: Examples of generated brick_wall texture patches and stitched images of larger sizes, where
the image in the bottom-left corner is the exemplary texture image, and the patches highlighted in red
squared boxes are unseen patterns.
(a)
3072 → 2560 → 2048
(b)
3072 → 1536 → 768
(c)
3072 → 1280 → 512
(d) 3072 → 768 → 192
Figure 3.8: Generated patches using different settings, where the numbers below the figure indicates the
dimensions of S0, S1 and S2, respectively.
45
Figure 3.9: Comparison of texture images generated by two DL-based methods and TGHop (from left to
right): exemplary texture images, texture images generated by [33], by [117], two examples by TGHop
without spatial dimension reduction (SDR) and one example by TGHop with SDR.
46
Figure 3.10: More texture images generated by TGHop.
(a) 50 clusters (b) 80 clusters (c) 110 clusters (d) 200 clusters
Figure 3.11: Generated brick_wall patches using different cluster numbers in independent component histogram matching.
47
(a) γ = 0 (b) γ = 0.0005 (c) γ = 0.005
(d) γ = 0.01 (e) γ = 0.02 (f) γ = 0.03
(g) γ = 0.04 (h) γ = 0.05 (i) γ = 0.1
Figure 3.12: Generated brick_wall patches using different threshold γ values in SDR.
48
Chapter 4
GENHOP: An Image Generation Method Based on Successive Subspace
Learning
4.1 Introduction
Unconditional image generation has received increasing attention recently due to impressive results offered by deep-learning (DL) based methods such as generative adversarial networks (GANs), variational
auto-encoders (VAEs), and flow-based methods. Yet, DL-based methods are black box tools. The end-toend optimization of networks is a non-convex optimization problem, which is mathematically intractable.
Being motivated by the design of other generative models that allow mathematical interpretation, a new
image generative model is proposed in this work. Our method is developed based on the successive subspace learning (SSL) principle [59, 58, 60, 62] and built upon the foundation of the PixelHop++ architecture [16]. Thus, it is called Generative PixelHop (or GenHop in short). Its high-level idea is sketched
below.
Since high-dimensional input images have complicated statistical correlations among pixel values, generating images directly in the pixel domain is difficult. To address this problem, GenHop contains three
modules: 1) high-to-low dimension reduction, 2) seed image generation, and 3) low-to-high dimension expansion. The first module builds a sequence of high-to-low dimensional subspaces through a sequence of
49
Figure 4.1: An overview of the GenHop method. A sequence of high-to-low dimensional subspaces is
constructed from source image space with two PixelHop++ units. GenHop contains three modules: 1)
High-to-Low Dimension Reduction, 2) Seed Image Generation, and 3) Low-to-High Dimension Expansion.
whitening processes called the channel-wise Saab transform, where high-frequency components are discarded to lower the dimension. In the second module, called seed image generation, the sample distribution
in the lowest dimensional subspace can be analyzed and generated by white Gaussian noise. In the third
module, GenHop attempts to find the corresponding source image of a seed image through dimension expansion and a coloring mechanism. For dimension expansion, discarded high-frequency components are
recovered via locally linear embedding (LLE). The coloring process is the inverse of the whitening process,
which is achieved by the inverse Saab transform. Experiments are conducted on MNIST, Fashion-MNIST
and CelebA three datasets to demonstrate that GenHop can generate visually pleasant images whose FID
scores are comparable with (or even better than) those of DL-based generative models.
4.2 Review of Related Work
4.2.1 DL-based Generative Models
An image generative model learns the distribution of image samples from a certain domain and then
generates new images that follow the learned distribution. Generally speaking, the design of image generative models involves the analysis and generation two pipelines. The former analyzes the properties of
training image samples, while the latter generates new images after the training is completed. So far, the
50
best performing image generative models are all DL-based. DL-based generative methods can be categorized into two categories: adversarial and non-adversarial models. For the adversarial category, generative
adversarial networks (GANs) [36] demand that distributions of training and generated images are indistinguishable by training a generator/discriminator pair through end-to-end optimization. GANs exhibit good
generalization capability and yield visually impressive images. For the non-adversarial category, examples
include Variational Auto-Encoders (VAEs) [56], flow-based methods [23] and GLANN [46]. VAEs learn an
approximation of the density function with an encoder/decoder structure. Flow-based methods transform
the Gaussian distribution into a complex distribution by applying a sequence of invertible transformation functions. GLANN [46] maps images to a feature space obtained by GLO [8] and maps the feature
space to the noise space via IMLE [69]. It achieves the state-of-the-art performance among non-adversarial
methods.
4.2.2 Successive Subspace Learning (SSL)
Traditional spectral analysis, such as the Fourier transform and the principle component analysis (PCA)
attempt to capture the global structure but sacrifice images’ local details (e.g., object boundaries). In contrast, local detail can be well described in the spatial domain, yet the pure spatial representation cannot
capture the global information well. To overcome these shortcomings, Kuo et al. [59, 58, 60, 62] proposed
two affine transform that determine a sequence of joint spatial-spectral representations of different spatial/spectral trade-offs to characterize the global structure and local detail of images at the same time. They
are the Saak transform [60] and the Saab transform [62]. As a variant of the Saab transform, the channelwise (c/w) Saab transform [16] exploits weak correlations among spectral channels and applies the Saab
transform to each channel individually to reduce the model size without sacrificing the performance. The
mathematical theory is called successive-subspace-learning (SSL). PixelHop [15] and PixelHop++ [16] are
51
two architectures developed to implement SSL. PixelHop consists of multi-stage Saab transforms in cascade. PixelHop++ is an improved version of PixelHop that replaces the Saab transform with the c/w Saab
transform. SSL has been successfully applied to many application domains. Examples include [12, 130, 52,
80, 132, 133, 131, 53, 84, 111, 99, 98]. It is worthwhile to mention that SSL-based texture synthesis was
studied in [67, 68]. Here, we examine SSL-based image generation that goes beyond texture and demands
a few extensions, such as improved fine detail generation, quality enhancement via local detail generation,
etc.
4.3 Proposed GenHop Method
An overview of the proposed GenHop method is shown in Fig. 4.1, which contains three modules as
elaborated below.
4.3.1 Module 1: High-to-Low Dimension Reduction
A sequence of high-to-low dimensional subspaces is constructed from the source image space through
PixelHop++ [16] as shown in the figure. Each PixelHop++ unit behaves like a whitening operation. It
decouples a local neighborhood (i.e., a block) into DC and AC parts and conducts the principal component
analysis (PCA) on the AC part. This is named the Saab transform. The reason to remove the DC first
is that the ensemble mean of AC part can be well approximated by zero so that the PCA can be applied
without the need to estimate the ensemble mean. The PCA is essentially a whitening process. It removes
the correlation between AC components among pixels in the same block.
To give an example, for an input gray-scale image of size 28x28, we apply the Saab transform to 2x2
non-overlapping blocks, which offer one DC and three AC channels per block, in the first PixelHop unit.
The output is a joint-spatial-spectral representation of dimension 14x14x4, which forms the first subspace.
By setting an energy threshold, we can partition spectral channels into low- and high-frequency channels
52
whose numbers are denoted by K1,l and K1,h, respectively. Low-frequency channels have larger energy,
representing the main structure of an image, while high-frequency channels have lower energy representing image details. Only low-frequency channels proceed to the next stage. In other words, high-frequency
channels are discarded to lower the dimension and will be estimated via LLE as discussed in Sec. 4.3.3.
High-frequency channels with sufficiently small energy will not be estimated, for example, on the MNIST
dataset. This lead to the sum of K1,l and K1,h being less than 4. The cascade of several PixelHop++ units
yields several subspaces. For images of small spatial resolutions, we adopt two PixelHop++ units as shown
in Fig. 4.1 to ensure a proper spatial resolution in the subspace, which has the lowest joint spatial-spectral
dimension to capture the global structure of an image.
4.3.2 Module 2: Seed Image Generation
In the training phase, we conduct the following four steps to learn the sample distributions in the lowest
dimension furthermore, as illustrated in Fig. 4.2.
Figure 4.2: Illustration of seed image generation in the lowest-dimensional subspace.
1. Spatial PCA. There exist correlations between spatial pixels in the lowest dimension. They can be
removed by applying PCA to the spatial dimension of each channel, called spatial PCA. Components
53
with eigenvalues less than a threshold, γ, are discarded. After a sequence of whitening operations,
elements of these vector samples are uncorrelated. However, they may still be dependent. Furthermore, they are not Gaussian distributed.
2. Sample Clustering. We perform k-means clustering on them to generate multiple clusters so that
the sample distribution of each cluster can be simplified. This is especially essential for multi-modal
sample distributions. Examples are given by different columns and rows in Fig. 4.3
(a)
(b)
Figure 4.3: Various representative samples from (a) MNIST and (b) Fashion-MNIST dataset.
3. Independent Component Analysis (ICA). We perform ICA in each cluster to ensure elements of
vector samples are independent.
4. Cumulative Histogram Matching. We would like to match the cumulative histogram of each
independent component in a cluster with that of a Gaussian random variable of zero mean and unit
variance [67, 68].
54
In the generation phase, we conduct the following steps, which are the inverse of the operations as
described above.
1. Cluster Selection. The probability of selecting a cluster is defined by the ratio of the number of
samples in that cluster and the total number of samples. We randomly select a cluster based on its
probability.
2. Sample Generation. We generate a random variable using the Gaussian density of zero mean and
unit variance and map it to the corresponding value of the sample distribution in the cluster via
inverse cumulative histogram matching.
3. Inverse ICA. It rebuilds dependency among elements of random vectors.
4. Inverse Spatial PCA. It rebuilds spatial correlations among pixels of each channel.
4.3.3 Module 3: Low-to-High Dimension Expansion
Recovering Discarded Details via LLE. The generated sample in the lowest dimension contains only
low-frequency (LF) components since high-frequency (HF) components are discarded to simplify the seed
generation procedure. HF responses should be generated along the reverse direction to enhance details.
We adopt LLE [100] to achieve this task. LLE is a commonly used technique to build the correspondence between the manifolds of low- and high-resolution images in image super-resolution [10] or restoration [48],
assuming two manifolds have similar local geometries. Here, LLE is used to ensure two things, as illustrated in Fig. 4.4. First, generated LF samples are located on the manifold of training LF samples. Second,
we determine the correspondence between samples of LF components and samples of HF components.
LLE is implemented in small regions of spatial resolutions 2x2 or 3x3 to reduce complexity, as shown in
Fig. 4.5.
55
Figure 4.4: Illustration of fine-tuning low-frequency and generating high-frequency.
Neighborhood Coloring via inverse Saab Transform. Finally, we build the correlations among
spatial pixels via the inverse Saab transform, which can be interpreted as a coloring process. The Saab
transform parameters are determined by the PCA of AC components of a local neighborhood. The parameters of the inverse transform can be derived accordingly.
4.4 Experiments
4.4.1 Experimental Setup
We conduct experiments on three datasets: MNIST, Fashion-MNIST and CelebA. They are often used for
unconditional image generation. MNIST and Fashion-MNIST contain gray-scale images (i.e. K0 = 1),
while CelebA contains RGB color images. To remove the correlation between R, G, B, three color channels,
we perform pixel-wise PCA to decouple them, yielding three uncorrelated channels denoted by P, Q and
R, as shown in Fig. 4.6. We discard the R channel that has the smallest eigenvalue to reduce the dimension.
To recover the RGB channels, we apply LLE conditioned on generated P and Q channels. As a result,
K0 = 2 for CelebA. Hyper-parameters (K1,l, K1,h, K2,l, K2,h) are set to (2, 1, 4, 3), (2, 2, 4, 4) and (3, 1, 4,
4) for MNIST, Fashion-MNIST and CelebA, respectively. They are chosen to ensure the gradual dimension
56
Figure 4.5: The detailed procedure of fine-tuning low-frequency (LF) and generating high-frequency (HF)
using locally linear embedding (LLE).
transition between two successive subspaces. The eigenvalue threshold, γ, is set to 0.01, 0.01, and 0.03 for
MNIST, Fashion-MNIST and CelebA, respectively. The number of nearest neighbors in LLE is adaptively
chosen and upper bounded by 3. For CelebA, since the number of training samples of LLE is high, we
perform LLE at one location at a time.
Figure 4.6: An illustration of PQR channels obtained by pixel-wise PCA. Figures from left to right are the
original RGB image, RGB image reconstructed from P and Q channels, P channel whose variance is 0.226,
Q channel with variance 0.014 and R channel with variance 0.002.
4.4.2 Performance Comparison
We compare the performance of GenHop with several representative DL-based generative models in Table
4.1. The performance metric is the Fréchet Inception Distance (FID) score. It is commonly used since
57
Table 4.1: Comparison of FID scores of the GenHop model and representative adversarial and nonadversarial models. The lowest FID scores are shown in bold, while the second-lowest FID scores are
underlined.
MNIST Fashion CelebA
MM GAN [36] 9.8 29.6 65.6
NS GAN [36] 6.8 26.5 55.0
LSGAN [85] 7.8 30.7 53.9
WGAN [3] 6.7 21.5 41.3
WGAN-GP [38] 20.3 24.5 30.0
DRAGAN [57] 7.6 27.7 42.3
BEGAN [7] 13.1 22.9 38.9
VAE [56] 23.8 58.7 85.7
GLO [8] 49.6 57.7 52.4
GLANN [46] 8.6 13.0 46.3
Ours (GenHop) 5.1 18.1 40.3
both the diversity and fidelity of generated images are considered. By following the procedure described
in [83], we extract the embedding of 10K generated and 10K real images from the test set obtained by the
Inception network and fit them into two multivariate Gaussians, respectively. The difference between the
two Gaussians is measured by the Fréchet distance with their mean vectors and covariance matrices. A
smaller FID score means better performance. The FID scores of representative GAN-based models (listed
in the first section of Table 4.1) are collected from [83] while those of non-adversarial models (the second
section) are taken from [46]. We see from the table that Genhop has the best FID score 5.1 for MNIST
and the second best 18.1 for Fashion-MNIST, falling behind GLANN, and 40.3 for CelebA, falling behind
WGAN-GP.
4.4.3 Generated Exemplary Images
Some exemplary images generated by GenHop are shown in Fig. 4.7, Fig. 4.8 and Fig. 4.9 for visual quality
inspection. For MNIST, the structure of digits is well captured by GenHop with sufficient diversity. For
Fashion-MNIST, GenHop generates diverse examples for different classes. Fine details such as texture on
shoes and printing on T-shirts can be synthesized naturally. For CelebA, the great majority of samples
58
generated by GenHop are semantically meaningful and realistic. The color of some generated objects is
not natural. This failure is attributed to the lack of global information since LLE is only performed in a
small region of an image.
4.4.4 Visualization of Intermediate Steps
We illustrate the image generation process from 7x7 resolution (the first row) to 14x14 resolution (the
second row) and to 28x28 resolution (the third and the fourth row) for MNIST, Fashion-MNIST, and CelebA
in Figs. 4.10(a), (b), and (c). The third and fourth rows present the generated images before and after LLEbased fine-tuning to demonstrate the effect of locally linear embedding. At 7x7 and 14x14 resolution,
only the DC channel is visualized for ease of visualization. As shown in the figure, the seed determines
the global shape of a generated sample. Further details are added when the resolution increases. Finally,
images are fine-tuned with enhanced smoothness and brightness. The visualization capability of GenHop
shows the power of the approach, which allows one to inspect generated samples from coarse to fine.
4.4.5 Effect of Training Sample Numbers
To show the robustness of the learned generative models, we vary the number of training samples and
compare the FID scores of four generative methods, including GenHop, WGAN, GLANN, and GLO, in
Fig. 4.11, where 20%, 40%, 60%, 80% and 100% from 60,000 training samples are selected for MNIST and
Fashion-MNIST. For MNIST, the FID scores of GLO increase as the training sample size becomes larger,
which reveals its weakness in scalability. The FID scores of both WGAN and GLANN fluctuate slightly
and achieve the best FID scores when the number of training samples reaches 100%. The FID scores of
GenHop are the most stable among the four. It shows that GenHop can capture the sample distribution
well even with 20% of training samples, and the distribution remains stable when the sample sizes become
larger. This is more consistent with human intuition. For Fashion-MNIST, the FID scores of WGAN rise
59
as the number of training samples increases, revealing its weakness in scalability. The FID scores of GLO
and GLANN decrease as the number of training samples increases. Again, GenHop provides the most
stable FID scores for a wide range of training sample numbers. As compared to other generative models,
we conclude that GenHop can learn stable sample distributions with much fewer training samples and
provide the same level of FID scores.
4.4.6 Model Size
The model size is measured by the number of parameters in a model. The size of GenHop is calculated as
below.
• Two-stage c/w Saab Transforms
The forward analysis path and the reverse generation path share the same two-stage c/w Saab transforms. For an input patch, the input tensor of size 2 × 2 × K0 = 4K0 is transformed into a K1-D
tensor in the first-stage transform, leading to a filter size of 4K0K1 plus K0 shared bias. For each of
K1, low channels, the input vector of size 2×2 is transformed into a vector in the second stage transform, yielding the sum of output dimensions to be K2. The total parameter number for all K1, low
channels is 4K2K1, low plus K1, low biases. Thus, the total number of parameters in the two-stage
transforms is 4K0K1 + K0 + 4K2K1, low + K1, low.
• Seed Analysis and Generation
Sample generation in S4 contains two modules: spatial PCA (SPCA) and cluster-wise independent
components histogram matching (CICHM). For the first module, SPCA is implemented by K2 PCA
transforms, where the input of size I2 × I2 and the output is a Kri
dimensional vector, yielding the size of each PCA transformation matrix to be I2
2Kri
. The total number of parameters is
I2
2 PK2
i=1 Kri = I2
2D˜
4, where D˜
4 is the dimension of the concatenated output vector after SPCA.
For the second module, it has three components:
60
1. Interval representation p0, . . . , pL−1
L parameters are needed for L cluster.
2. Transform matrices of FastICA
If the input vector is of dimension D˜
4 and the output dimension of FastICA is Kci
for the ith
cluster, i = 1, · · · , L, the total parameter number of all transforms matrices is D˜
4F, where
F =
PL
i=1 Kci
is the number of CDFs.
3. Codebook size of quantized CDFs
The codebook contains the index, the maximum, and the minimum values for each CDF. Furthermore, we quantize F CDFs into W clusters, where all CDFs in each cluster share the same
bin structure of 100 bins. As a result, the total parameter number is 3F + 100W.
By adding all of the above together, the total parameter number in seed generation is I2
2D˜
4 + L +
D˜
4F + 3F + 100W.
• Locally Linear Embedding
The only parameter in Locally Linear Embedding (LLE) is the number of neighbors. Locally Linear
Embedding is a non-parametric method. The input of each LLE module is either input images or
Saab coefficients. Saab coefficients can be deviated fast from the input images using the parameters
of the above-mentioned c/w Saab transforms.
The model size of MNIST, Fashion-MNIST and CelebA in our experiments are summarized in Table 4.2. W = 1000 is fixed for three datasets. For MNIST, L = 2680 and F = 21341. For Fashion-MNIST,
L = 2914 and F = 17, 651. For Celeba, L = 9, 684 and F = 89, 164. Its parameters also include a transformation matrix of pixel-wise PCA which converts RGB channels to PQ channels, yielding 6 parameters.
61
Table 4.2: The number of parameters of GenHop for MNIST, Fashion-MNIST and CelebA.
Module MNIST Fashion CelebA
RGB2PQ - - 6
Transform - stage 1 13 17 34
Transform - stage 2 58 66 99
Seed - SPCA 6174 8624 13248
Seed - CICHM(i) 2680 2914 9684
Seed - CICHM(ii) 2,688,966 3,106,576 18,456,948
Seed - CICHM(iii) 164,023 152,953 367,492
Total 2,861,914 3,271,150 18,847,511
4.5 Conclusion and Future Work
A non-DL-based image generation method, called GenHop, was proposed in this work. To summarize,
GenHop conducted the following tasks: 1) removing correlations among pixels in a local neighborhood
via the Saab transform, 2) discarding high-frequency components for dimension reduction, 3) generating
seed images in the lowest dimensional space using white Gaussian noise, 4) adding back discarded highfrequency components for dimension expansion based on LLE and 5) recovering correlations of pixels via
the inverse Saab transform. All tasks except seed generation are performed in multiple stages to control dimension change. GenHop achieved state-of-the-art performance for MNIST, Fashion-MNIST, and CelebA,
three datasets in FID scores. As a future extension, it is desired to use GenHop to generate images of higher
resolution and more complicated content. It is also interesting to apply GenHop to the context of transfer
learning (e.g., transfer between horses and zebras) and image inpainting.
62
Figure 4.7: Exemplary images generated by GenHop with training samples from the MNIST dataset.
63
Figure 4.8: Exemplary images generated by GenHop with training samples from the Fashion-MNIST
dataset.
64
Figure 4.9: Exemplary images generated by GenHop with training samples from the CelebA dataset.
65
(a) MNIST
(b) Fashion-MNIST
(c) CelebA
Figure 4.10: Illustration of the image generation process from core S4 (the first row) to subspace S2 (the
second row), to source S0 before local detail generation (the third row) and after LLE-based fine-tuning
(the fourth row) for (a) MNIST and (b) Fashion-MNIST (c) CelebA (P channel).
66
(a)
(b)
Figure 4.11: Comparison of FID scores as a function of training percentages between GenHop, WGAN,
GLANN, and GLO for (a) MNIST and (b) Fashion-MNIST.
67
Chapter 5
Green Image Generation (GIG): Methodologies and Preliminary Results
5.1 Introduction
Generative models can learn the data distribution and generate new instances that resemble the original
data. Research on image generative modeling is challenging due to the high dimensions of images, the
requirement of large datasets and extensive training time, and the need to find an effective and expressive
representation. Generative models not only need to identify the appearance or location of objects but also
need to understand relationships among objects in an image. This makes the generative tasks even more
difficult than the discriminative tasks such as classification or localization.
Recent breakthroughs in deep learning have sparked a new wave of enthusiasm for generative modeling. Two types of generative models, Generative Adversarial Networks (GANs) [36] and Variational
Figure 5.1: The generation pipeline of the GenHop and GIG.
68
Auto-Encoders (VAEs) [56, 95], have led the charge since 2014. Both VAEs and GANs map the data to a
latent space, but they differ in training procedures and implementation.
VAEs use an encoder to analyze the statistical properties of the source data and a decoder to generate
new images. The encoder is trained to learn a compressed representation of the data, called the latent
space, which is a lower-dimensional representation of the input data. The decoder is then trained to
generate new data points from this latent space. Both modules are trained using variational methods that
allow the generation of new data points similar to the input data.
In contrast, GANs employ a generator to produce images and a discriminator to evaluate the quality
of the generated images. The generator is trained to produce images that are similar to the input data by
learning a mapping between the latent space and the data space. The discriminator is trained to distinguish
between real and fake images, and its feedback is used to improve the generator’s output. The training
process of GANs involves the generator and discriminator competing against each other to improve the
quality of the generated images.
As computational resources have advanced, more and more models have been developed. Among these
models, transformer-based image generative models are a significant breakthrough that uses transformer
architectures commonly employed in natural language processing. Another generative model type that
has recently gained attention is the diffusion model. These models follow a unique approach based on
gradually learning to reverse a diffusion process. They first transform data into a Gaussian distribution
by adding noises and then learn to reverse this process by predicting the added noises to generate new
data step by step. The quality of image generation has been greatly improved with the use of language
guidance.
Despite their ability to generate highly realistic images, deep-learning-based models are computationally expensive. The transformers require significant computational resources due to their self-attention
mechanism. These models also often require large amounts of training data to perform well. Diffusion
69
models involve running an iterative denoising process, leading to slow generation speed. They also require considerable computational resources for both training and inference. Another major concern with
deep learning-based generative models is their lack of explainability. Due to their multi-layer end-to-end
optimization, deep learning is essentially a non-convex optimization problem that is mathematically intractable. We are motivated to explore alternative generative models due to these limitations.
Several generative modeling approaches have been proposed based on successive subspace learning.
NITES [67] and TGHop [68] are two examples of such methods that are used for texture synthesis. GenHop [66] and Pager [4] are two other generative modeling approaches that are used for unconditional and
conditional image generation. These generative models generally contain two modules: Fine-to-Coarse
Image Analysis and Coarse-to-Fine Image Synthesis. The fine-to-coarse image analysis module reduces
the dimension of the image and learns the distribution of the meaningful representations. Coarse-to-Fine
Image Synthesis module first draws samples from the lowest-dimensional space and then generates details.
An illustration is given in Fig. 5.1. There are two limitations of the previous methods. First, the dimensions
of the seed are not sufficiently low, resulting in a larger model size. We noticed a significant decrease in
generation performance when using very low seed dimensions. The second limitation is that the use of
locally linear embedding (LLE) for detail generation results in memory waste and slow generation speed.
LLE is an example-based method that requires storing a database for searching neighborhoods, and in
generation, searching is needed for every query sample. To overcome these limitations, we reformulate
the generative model proposed in GenHop to improve efficiency further.
Similar to previous methods, this method transforms data to a lower-dimensional white noise (i.e., latent space) through an analysis process while generating samples through a reverse generation process.
As the input space has high dimensions and the perceptual properties of source images are poorly understood, analyzing the statistics of image data in the input space is generally difficult. The analysis process
reduces the dimension of the high-dimensional input image space to a set of lower-dimensional spaces
70
by removing the correlations among pixels. The samples in the lowest-dimensional space preserve the
global structure of source images, and their sample distribution is easier to analyze than in the input space
owing to their lower dimensions and better perceptual understanding. The distribution of samples in a
lower-dimensional space can be modeled into a seed distribution. The generation process is the reversed
version of the forward process. It first generates the global structure of an image and then adds details
to it. The reverse generation process involves generating seeds by sampling from the seed distribution,
followed by conditional detail generation to generate spectral channels and a coloring process to increase
spatial dimension.
The rest of the chapter is organized as follows. The related work is reviewed in Sec. 5.2 The GIG model
is proposed in Sec. 5.3. Experimental results are shown in Sec. 5.4. Finally, concluding remarks and future
extensions are provided in Sec. 5.5.
5.2 Review of Related Work
5.2.1 Deep-Learning based Generative Modeling
Deep learning (DL)-based generative modeling has emerged as a transformative approach in the field of
artificial intelligence. It enables the creation of highly realistic and diverse data representations. Generative Adversarial Networks (GANs) [36] and Variational Autoencoders (VAEs) [56, 95] are central to this
evolution and have revolutionized how machines understand and generate complex data patterns. GANs,
introduced by Goodfellow et al. in 2014, operate through a competitive framework of a generator and a
discriminator. This leads to the generation of high-fidelity images. On the other hand, VAEs, conceptualized by Kingma and Welling in 2013, offer a probabilistic twist to the traditional autoencoder architecture.
They excel in tasks where modeling the distribution of data is crucial. GANs (Generative Adversarial
Networks) often face challenges such as training instability, the occurrence of mode collapse where the
71
model generates limited varieties of outputs, and difficulty in ensuring the convergence of the generator
and discriminator. VAEs (Variational Autoencoders), on the other hand, can struggle with producing less
sharp and detailed outputs compared to other models, and their probabilistic nature can lead to a trade-off
between sample quality and model complexity.
Autoregressive models form another category of generative models, which differ from GANs and VAEs
in that they generate future data points based on previous observations. This statistical modeling paradigm
has seen significant advancements with the integration of deep learning techniques. Models such as PixelRNN and PixelCNN [119], introduced by van den Oord et al., have marked a breakthrough in image
generation by predicting pixel values in a sequential manner. Autoregressive models explicitly model data
dependencies, resulting in superior performance for sequential tasks. However, this comes at the cost of
computational efficiency for long sequences.
The extension of autoregressive models has been exemplified by the Transformer architecture. The
Vision Transformer (ViT) [24] by Dosovitskiy et al demonstrates that transformers could be effectively
adapted for image-related tasks. These models capitalize on the self-attention mechanism to capture
global dependencies within images, enabling a more holistic understanding of visual data compared to
traditional convolutional approaches. The key strength of these models lies in their ability to learn rich,
high-dimensional data representations, facilitating the generation of highly detailed and diverse images.
Transformer-based image generative models, while powerful, often suffer from high computational demands, scalability issues, and a need for large-scale datasets, which can lead to challenges in training
efficiency and model generalization.
Diffusion models have recently emerged as a groundbreaking development in the field of generative
modeling, offering a novel approach that contrasts with the traditional frameworks. Rooted in the concept
of gradually learning to reverse a diffusion process, these models transform data into a Gaussian distribution and then learn to reverse this process to generate new data. The work by Sohl-Dickstein et al. [107]
72
laid the foundational theory, which was later expanded upon by Ho et al. [45] in their introduction of
Denoising Diffusion Probabilistic Models (DDPMs). These models have demonstrated exceptional performance in generating high-quality images, rivaling and, in some aspects, surpassing the capabilities of
GANs, particularly in terms of image diversity and the avoidance of mode collapse. Moreover, their inherent stochastic nature allows for a more controlled and varied generation process. Recent advancements
have further optimized these models in terms of sampling efficiency and training stability, making them
more practical for a broader range of applications.
Deep-learning-based models have the ability to create visually appealing images and display some level
of generalization capability. However, these models have certain limitations. They are generally black-box
solutions that are expensive to deploy and difficult to train. Due to these limitations, there is a growing
need for the development of mathematically transparent models that are efficient in terms of both time
and space complexity.
5.2.2 Green-learning-based generative models
Green learning (GL) has been proposed as an alternative paradigm to address the transparency and efficiency concerns in DL-based models. New and powerful tools have been developed in the last several
years, e.g., the Saak [60] and Saab transforms [62] for the foundation of getting image representations, the
PixelHop [15], PixelHop++ [16] and IPHop [Yang et al., 2022a] [125] learning systems for generating more
expressive representations, and the discriminant and relevant feature tests [126] for select discriminant
features based on labels.
There are preliminary efforts in the development of GL generative models, e.g., NITES [67], TGHop [68],
GenHop [66], Pager [4]. GL-based generative models generally contain two modules: Fine-to-Coarse Image Analysis and Coarse-to-Fine Image Synthesis. The fine-to-coarse image analysis module reduces the
73
dimension of the image and learns the distribution of the meaningful representations. Coarse-to-Fine Image Synthesis module first draws samples from the lowest-dimensional space and then generates details.
In addition to the green-learning-based generative models, there are several other research papers
that focus on proposing efficient and effective solutions for image generation. A recent paper by Granot
et al. [37] introduces a single-image generative solution, which can be applied in various areas such as
retargeting, conditional inpainting, etc. Another paper by Richardson et al. [96] compares the Gaussian
Mixture Model (GMM) and GANs. GMMs can generate realistic samples but are less sharp than GANs.
However, GMMs can capture the full distribution, which GANs fail to do. Additionally, GMMs allow
efficient inference and explicit representation of the underlying statistical structure.
In the following sections, we investigate previous GL-based image generation solutions and propose
a few necessary improvements. These include dimension control in seed generation, fine detail generation using a model-based method instead of an example-based method, and a new model-based image
refinement scheme.
5.3 Proposed GIG Method
5.3.1 Overview
The proposed generative model consists of a forward decomposition process and a reverse generation
process. An illustration of two processes is given in Fig. 5.2 and Fig. 5.3.
The forward decomposition process serves two purposes. One is to reduce the dimension of the highdimensional input image space to a set of lower-dimensional spaces. The reason is that, due to complicated
statistical correlations among pixels, the probability distribution of high-dimensional image space is often
74
intricate to model compared to that of low-dimensional space. The other is to find joint spatial-spectral representations that can represent both an image’s global structure and explicit details. The forward decomposition process involves a series of whitening processes to decompose an image into joint spatial-spectral
representations. As whitening processes take place, the spatial dimension of these joint spatial-spectral
representations decreases. The distribution of the lowest-dimensional representation, or seed, is learned
by a statistical model. The seed represents the global structure of an image. The detail of an image is represented by the representation of middle stages, and its distribution will be modeled through conditional
detail learning.
The generation process is the reversed version of the forward decomposition process. It first generates
the global structure of an image and then adds details to it. The reverse generation process involves
generating seeds by sampling from the seed distribution, followed by conditional detail generation to
generate spectral channels and a series of coloring processes to increase spatial dimension.
Figure 5.2: Forward Decomposition Process.
Figure 5.3: Inverse Generation Process.
75
5.3.2 Forward Decomposition Process
The forward decomposition process has two components: 1) a series of whitening processes and 2) seed distribution learning. The whitening process transforms a spatially local neighborhood into spectral channels.
A joint spatial-spectral representation with a lower spatial dimension and more spectral channels than the
original image is obtained by maintaining the same spatial arrangement of the local neighborhoods as
in the original image. After several stages of whitening, an image is converted into a representation of
the lowest spatial dimension, named seed. We apply a mixture of factor analyzers to model the sample
distribution of seed and then sample from the distribution in the reverse generation process.
5.3.2.1 Whitening
Modeling the probability distribution of high-dimensional images is often complicated due to complicated
statistical correlations among pixels. To alleviate the difficulty of analyzing the distribution of image data,
we perform a series of whitening operations to reduce dimension and simplify the distribution analysis. We
use the channel-wise Saab transform [16] for whitening following our previous work [66]. It exploits the
correlation among pixels in a local neighborhood (i.e., block) and transforms a spatial local neighborhood
into DC and AC parts. The DC kernel is a normalized all-ones vector and plays the role of averaging pixels
in a block. AC kernels are obtained by the principal component analysis (PCA), which is essentially a
whitening process. This is known as the Saab transform [62].
An example is given in Fig. 5.2. Given an input gray-scale image of size 28x28, we perform the Saab
transform on 2x2 non-overlapping blocks, yielding one DC and three AC channels in the first whitening
unit. After rearranging blocks based on their original locations, the output is a joint-spatial-spectral representation of spatial dimension 14x14 with four channels. We only allow the DC channel to proceed to
the next stage as the input of the next whitening unit. Another three AC channels are discarded to lower
76
the dimension and will be estimated in conditional detail learning unit as stated in Sec. 5.3.3.2. The second whitening unit performs the same as the first one, obtaining a joint-spatial-spectral representation of
spatial dimension 7x7 with one DC and three AC channels. The third whitening unit gets the DC channel
of size 7x7x1 and performs Saab transform with kernel 7x7 on it to transform all spatial correlations into
spectral, yielding one DC value and K AC values. We will estimate DC and ACs in a different way, as
discussed in the next section.
To address the diversity and heterogeneity of samples, we conduct adaptive Saab transform in each
whitening unit. We cluster 2x2 blocks into several groups based on their neighborhood in the preceding
stage. An example is given in Fig. 5.4, a 2x2 block at 14x14 resolution (colored in green) is transformed
to a DC coefficient at 7x7 resolution (colored in blue) via the Saab transform. We adopt the neighborhood
region of the DC coefficient to decide how this Saab transform should be performed. We expect that the
DC coefficient with a similar neighborhood should be obtained by the same Saab transform. Note that this
neighborhood could not be collected at 14x14 resolution because, in the generation process, we will first
generate samples at 7x7 resolution and then generate samples at 14x14 resolution conditioned on the 7x7
one. This will be further discussed in Sec. 5.3.3.2. We perform clustering with the neighborhood regions.
Each 2x2 block is then assigned to a cluster. Finally, we learn an adaptive Saab transform in each cluster.
5.3.2.2 Seed Distribution Learning
We reduced the dimension of an image to a lower dimension of a vector, which consists of one DC value
and three AC values, as discussed in the previous section. As the DC and AC parts are inherently different,
we estimate them in two different ways. We first estimate the joint distribution of AC values through a
mixture of factor analyzers (MFA) [35]. It serves two purposes: to reduce the dimension further and to map
the sample distribution to a Gaussian distribution. We then estimate the DC value through an XGBoost
regressor, taking AC values as the input.
77
Figure 5.4: Cluster-wise whitening and coloring.
One possible statistical model for learning ACs’ distribution is the Gaussian Mixture Model (GMM).
However, its parameters in the covariance matrix and the number of Gaussians required to construct the
densities grow explosively with the dimension of the input data. In order to address the computational
concern, we utilize a mixture of factor analyzers to learn the distribution of ACs. This method combines
clustering and dimensionality reduction, allowing different low-dimensional approximations to model distinct regions of the data space. Each factor analyzer reduces a large number of variables into fewer numbers
of factors by extracting the maximum common variance from all variables. The model for a single factor
analyzer with x ∈ R
p
is
x = W z + µ + ϵ, (5.1)
where z ∼ N(0, I) is a latent variable called a factor whose dimension is k and k ≪ p, µ is the pdimensional mean vector and ϵ ∼ N(0, D) is white noise with a p×p diagonal covariance matrix D. This
results in the Gaussian distribution x ∼ N(µ, WWT + D). The free parameters are the p-dimensional
mean vector, the p × k loading matrix W, and the noise covariance D, yielding p(k + 2) parameters.
78
One drawback is that factor analysis assumes variables are linear combinations of factors, limiting
its ability to model nonlinear distributions. MFA alleviates this issue by using the linear combination of
multiple-factor analyzers. Like Gaussian Mixture Models (GMM), MFA can model a complicated distribution P(x) with the linear combination of several simple distributions,
P(x) = X
M
i=1
Z
P(x|z, ωi)P(z|ωi)P(ωi)dz, (5.2)
where πi = P(ωi) is the mixing coefficients, P(x|z, ωi) ∼ N(µi + Wiz, D) and P(z|ωi) ∼ N(0, I).
An MFA model with M components requires M[p(k + 2) + 1] free parameters, including the parameters
of M factor analyzers and M mixing coefficients. This is much less than the M[p(p + 1) + 1] parameters
in a GMM with M components when k ≪ p. MFA can reduce the memory and complexity growth with
dimensions from quadratic, as in standard GMM, to linear. To address computational concerns for large
datasets, instead of using the expectation maximization (EM) algorithm to estimate these parameters, we
optimize the log-likelihood using Stochastic Gradient Descent on GPU as in [96].
Figure 5.5: Seed Generation.
5.3.3 Reverse Generation Process
The generation process is the reversed version of the forward decomposition process. It consists of three
components: 1) seed generation, 2) coarse-to-fine image generation, and 3) image refinement in fine grids.
Seed Generation generates seeds by sampling from the seed distribution to decide the global structure of an
79
image. Coarse-to-fine image generation consists of a series of conditional detail generation and coloring.
Conditional detail generation generates three AC channels based on the generated DC channel. The DC
and AC channels are transformed to increase the spatial resolution through coloring, i.e., the inverse Saab
transform. To further boost the quality of the generated images, we perform image refinement in the local
area of the generated images at each stage.
5.3.3.1 Seed Generation
We first generate the AC part of the seed and then generate the DC part. We can easily draw samples
from the learned distribution after we model the ACs’ distribution using MFA. Firstly, we determine the
number of samples to be produced in each Gaussian distribution based on the mixing coefficients (πi
)
learned during the modeling process. For each component, we randomly sample a Gaussian distribution
P(z|ωi) ∼ N(0, I) to generate z and use Eq. 5.1 to construct the corresponding generated sample x, i.e.,
the ACs of the generated seed. Then we adopt the XGBoost regressor to predict the corresponding DC
given the generated ACs. By concatenating DC with the ACs, we obtained the generated seed samples.
In the following section, we will discuss how to enhance the resolution of the seed and generate the fine
details.
5.3.3.2 Coarse-to-Fine Image Generation
Conditional Detail Generation. In the forward decomposition process, we only allow the DC channel to
proceed to the next stage to reduce the dimension. However, discarding AC channels directly will lead to
poor generation quality in the reverse generation process as these channels depict the details of an image.
Thus, we must generate three AC channels conditioned on the DC channel in the reverse generation process. Specifically, we generate one AC value at a single location at a time conditioned on its corresponding
DC’s neighborhood. An example is given in Fig. 5.6. We collect the neighborhood of each location on
80
the DC channel, extract features on the neighborhood, use resulting coefficients as the input feature, and
output the AC value at the center location. We utilize three XGBoost regressors to fulfill this task for three
AC channels. During the forward process, we collect the neighborhood patch on the DC channel and its
center values on the AC channels of training images to train XGBoost regressors; during the generation
process, we predict the center AC values given the generated DC channel. To better improve the generation quality, we also have another XGBoost regressor to predict the DC values based on its original DC
neighborhood patch and three generated AC values.
Coloring. With the generated DC and three AC channels, we perform the coloring to enhance the
image’s resolution. We utilize the inverse Saab transform for coloring, which builds the correlations among
spatial pixels. The parameters of the Saab transform are determined by PCA in the forward whitening
process. We can easily derive the parameters of the inverse transform by transposing the transformation
matrix.
By repeating the conditional detail generation and coloring steps one after the other, we can finally
generate an image from coarse to fine, as shown in Fig. 6.1.
5.3.3.3 Image Refinement in Fine Grids
We further improve the quality by refining the generated image in the local area, which can add texture
details and sharpen the edges. We collect the local neighborhood of each pixel on the training images,
extract the features as the input, and learn XGBoost regressors to predict the value of the center pixel. We
apply the regressors on the generated images in fine grids after the coloring step during the generation to
refine the quality of the generated images.
To enhance the regression power and reduce the computational burden, we divide the samples into
several groups and have one XGBoost regressor for each group. We first cluster the neighborhood patches
into several clusters. We then calculate the Fréchet distance (FD) of the mean vectors and covariance
81
matrices between the training and the generated samples in each cluster. We finally divide the clusters
into several groups according to their FD scores. For the neighborhood samples in the group whose clusters
have similar FD scores, we adopt an XGBoost regressor to predict the refined value of the center pixel.
5.4 Experimental Results
5.4.1 Performance Benchmarking
We measure the performance of generative methods by measuring the diversity and quality of the generated images. We adopt the commonly used Fréchet Inception Distance (FID) [41] as the measure. FID first
extracts the features for generated images and real images with Inception Net, fits with two multi-variate
Gaussians, and computes the Wasserstein-2 distance between the means and covariance matrices. It is
adept at detecting high-fidelity generating samples, agrees with human perceptual judgments and human
rankings of models, and has low sample and computational complexity. It also has moderate sensitivity
to overfitting, mode collapse, and mode drop and moderate ability to undermine trivial models such as
the memory GAN, which simply stores all training samples. FID is sensitive to image distortions and
transformations and has the ability to detect intra-class mode dropping.
The embeddings of generated and real images are first extracted by an Inception network. Then, we
fit them into a multivariate Gaussian, respectively. Given mean vectors mg, mt and covariance matrices
Σg, Σt
, the difference between two Gaussians can be measured by the Wasserstein-2 distance as
F ID = ∥mg − mt∥
2
2 + T r(Σg + Σt − 2(ΣgΣt)
1/2
). (5.3)
A smaller FID score means better performance. By following the procedure given in [83], we compute the
FID score using 10K images generated by a model and 10K images from the test set. Training images in a
dataset are only used to train the model, and are not used in testing.
82
A large-scale study was conducted in [83] to evaluate the performance of GAN-based models. It
fixed the architecture of benchmarking generative models as InfoGAN and searched for their best hyperparameters. The performance of several non-adversarial generative models was studied in [46]. They
utilized the same generator architecture as that in [83] but introduced an additional network for sampling
and the VGG perceptual loss. We compare the FID scores of our GIG method with representative GANbased (adversarial) models in [83], non-adversarial models in [46] whose FID scores are reported by [83]
and [46], and two green-learning(GL)-based methods, GENHOP [66] and PAGER [4] in Table 5.1.
Table 5.1: Comparison of FID scores of our GIG model and representative GL-based and DL-based models.
The lowest FID scores are shown in bold, while the second lowest FID scores are underlined.
MNIST Fashion
MM GAN [36] 9.8 29.6
NS GAN [36] 6.8 26.5
LSGAN [85] 7.8 30.7
WGAN [3] 6.7 21.5
WGAN-GP [38] 20.3 24.5
DRAGAN [57] 7.6 27.7
BEGAN [7] 13.1 22.9
VAE [56] 23.8 58.7
GLO [8] 49.6 57.7
GLANN [46] 8.6 13.0
GenHop [66] 5.1 18.1
PAGER [4] 9.5 19.3
Ours 8.4 23.1
5.4.2 Visualization of Image Generation
Some example images generated by GIG are shown in Fig. 5.7 for visual quality inspection. Well-captured
structures and diverse examples are generated by GIG for MNIST and Fashion-MNIST datasets, respectively. The model can naturally synthesize fine details like texture on shoes and printing on T-shirts.
83
5.4.3 Model Size
The model size is measured by the number of parameters in a model. The FLOPs (floating-point operations)
needed for the generation process measures the generation speed. We compare the FLOPs and model size
of our GIG method with GANs and GenHop [66] in Table 5.2. The GANs listed in Table 5.1 are tested using
the network architecture of InfoGAN [13]. Thus, we only list the FLOPs and model size of InfoGAN in
Table 5.2. We also count the dictionary needed by LLE in GenHop in the model size calculation. GenHop
requires a huge amount of FLOPs due to searching the nearest neighbors based on the distances. Our GIG
method requires seven times fewer FLOPs and 14 times smaller model sizes than GANs.
Modules FLOPs Model size
InfoGAN [13] 92.34M (7X) 13.32M (14X)
GenHop [115] 1.187B (86X) 4.01M (4X)
Ours 13.74M (1X) 0.95M (1X)
Table 5.2: Comparison of FLOPs and model size of our proposed method and two deep-learning-based
methods for MNIST-to-USPS.
5.5 Conclusion and Future Work
We addressed two limitations of the previous methods and proposed a new green image generation method
named GIG. To further reduce the dimensions of the seed and thus reduce the model size, we conduct MFA
for seed learning and generation. To avoid using locally linear embedding (LLE) for detail generation,
which results in memory waste and slow generation speed, we proposed replacing LLE with an XGBoost regressor. Similar to previous methods, this method transforms data to a lower-dimensional white noise (i.e.,
latent space) through an analysis process while generating samples through a reverse generation process.
The analysis process reduces the dimension of the high-dimensional input image space to a set of lowerdimensional spaces by removing the correlations among pixels. The samples in the lowest-dimensional
space preserve the global structure of source images, and their sample distribution is easier to analyze than
84
in the input space owing to their lower dimensions and better perceptual understanding. The distribution
of samples in a lower-dimensional space can be modeled into a seed distribution. The generation process
is the reversed version of the forward process. It first generates the global structure of an image and then
adds details to it. The reverse generation process involves generating seeds by sampling from the seed distribution, followed by conditional detail generation to generate spectral channels and a coloring process
to increase spatial dimension.
GIG offered an explainable, efficient, and high-performance solution for image generative modeling
whose FID scores are comparable to deep-learning-based generative models against MNIST and FashionMNIST datasets. It has a small model size and is fast in training and inference. This alternative image
generation methodology is attractive since it has a closed-form solution in theory, is easy to implement,
and has a great potential for further generalization. We will investigate its power in the context of transfer
learning in the next chapter.
85
Figure 5.6: Conditional Detail Generation.
86
(a) MNIST (b) Fashion-MNIST
Figure 5.7: Some example images generated by GIG for (a) MNIST and (b) Fashion-MNIST.
87
Chapter 6
Green Image Label Transfer (GILT): Methodologies and Preliminary
Results
6.1 Introduction
Transfer learning has emerged as a transformative strategy in the field of machine learning, offering a
pragmatic approach to leveraging pre-existing knowledge for new but related tasks. Traditionally, machine
learning models are trained from scratch on a specific task, requiring extensive data and computational
resources. Transfer learning, however, deviates from this norm by utilizing a model trained on one task
(source task) as the starting point for learning a second related task (target task). This shift is particularly
useful when data is scarce, or training is computationally prohibitive. The core premise of transfer learning
lies in its ability to transfer knowledge from one domain to another. This idea is based on the observation
that some features or learned representations can be useful in different but related tasks. For instance, a
neural network that was trained to recognize objects in cartoon animations can use its learned features to
help identify objects in photographs. This adaptability helps speed up the training process and improves
the model’s performance on the target task, especially when there is limited data available.
The conceptual roots of transfer learning can be traced back to cognitive science and the understanding
of how humans apply knowledge from past experiences to solve new problems. In machine learning,
88
this concept has evolved with the development of advanced algorithms and deep learning techniques. Its
relevance has grown in parallel with the increasing availability of large pre-trained models, especially in
domains like computer vision and natural language processing. Despite its advantages, transfer learning
is not without challenges. Issues such as negative transfer, where irrelevant knowledge from the source
task hampers the learning of the target task, and domain discrepancy, where the difference between source
and target domains is substantial, pose significant hurdles. Moreover, determining the extent and method
of transferring knowledge, including which parts of a model to fine-tune and how to adjust them for the
target task, requires careful consideration. As the field of machine learning continues to evolve, transfer
learning stands as a cornerstone, offering a pathway to more efficient, effective, and accessible machine
learning models.
In this chapter, we focus on the domain adaptation (DA) task, which involves adapting a model trained
on one domain (source domain) to work effectively on a different but related domain (target domain). This
task is crucial in scenarios where labeled data is abundant in one domain but scarce in another. Domain
adaptation has become increasingly important with the realization that models often fail to generalize well
across different data distributions. The early approaches tackled the DA problem through feature engineering, where features were manually crafted to be invariant across domains. This approach was heavily
dependent on expert knowledge and was not scalable. Statistical alignment methods focused on reducing
the distributional discrepancies between the source and target domains. Techniques like Maximum Mean
Discrepancy (MMD) and Correlation Alignment (CORAL) [108] were used to align the statistical properties of the source and target datasets. The advent of deep learning then revolutionized domain adaptation.
Deep models can learn more transferable features, and techniques like domain-adversarial training were
developed. This involves training a model to learn domain-invariant features that cannot distinguish between source and target domain data. Recent advances focus on unsupervised domain adaptation, where
no labeled data is available in the target domain, and semi-supervised approaches, where only a small
89
amount of labeled data is available. However, deep learning-based DA models can be complex and prone
to overfitting, especially when the amount of target domain data is limited.
In this chapter, we proposed a new paradigm for the image unsupervised domain adaptation task,
taking the model efficiency and explainability into consideration. Our approach is named Green Image
Label Transfer (GILT), which can be categorized into statistical alignment methods. It does not rely on
neural networks to learn domain-invariant features. It first learns a joint discriminant subspace to extract
and select features that are invariant for both source and target domains. Then, we conduct source-totarget label transfer to transfer the labels from the source domain to the target domain. Finally, a weakly
supervised learning method is performed in the target domain.
The rest of the chapter is organized as follows. The related work is reviewed in Sec. 6.2 The GILT
method is described in Sec. 6.3. Experimental results are shown in Sec. 6.4. Finally, concluding remarks
and future extensions are provided in Sec. 6.5.
6.2 Related Work
6.2.1 Traditional Methods
Traditional methods focused on designing simple techniques to align the features in source and target
domains. [19] introduces a remarkably simple adaptation technique based on data augmentation for the
case with sufficient target data. [28] introduces a novel algorithm where data from both source and target
domains are mapped to a domain-invariant feature space through eigen-decomposition. It has fast computing speed due to solving with a closed-form solution. [108] introduce CORrelation ALignment (CORAL),
an unsupervised approach that aligns the second-order statistics of the source and target distributions to
minimize domain shift. Despite its simplicity in implementation, CORAL demonstrates remarkable performance on benchmark datasets.
90
6.2.2 Deep-learning based Methods
Deep-learning-based methods elevate the domain adaptation to a new level. [114] proposes to add a domain classifier, i.e., a fully connected layer. It predicts the binary domain label of the inputs and designs a
domain confusion loss to encourage its prediction to be as close as possible to a uniform distribution over
binary labels. The gradient reversal algorithm (ReverseGrad) proposed in [31] also treats domain invariance as a binary classification problem, but directly maximizes the loss of the domain classifier by reversing
its gradients. [32] presents a novel representation learning method for domain adaptation, leveraging features that are agnostic to the differences between domains. It introduces a new gradient reversal layer for
easy implementation with the existing DL frameworks. [79] introduces the Coupled Generative Adversarial Network (CoGAN) designed for learning joint distributions of images across multiple domains by
incorporating a weight-sharing constraint. [115] presents a unified framework for adversarial adaptation
and introduces a novel approach - Adversarial Discriminative Domain Adaptation (ADDA), which combines discriminative modeling, untied weight sharing, and GAN loss. [109] introduces an enhancement
of "Deep CORAL" with custom CORAL loss, building upon the foundation of CORAL from [108]. [75] introduces the Source HypOthesis Transfer (SHOT) framework, which employs information maximization
and self-supervised pseudo-labeling to align target domain representations with the source hypothesis.
The versatility of SHOT is demonstrated through its application to various adaptation scenarios, including
closed-set, partial-set, and open-set domain adaptations, and its protectivity to private datasets. [81] proposes a Deep Adaptation Network (DAN), where hidden representations in task-specific layers are placed
in a reproducing kernel Hilbert space, allowing for explicit matching of mean embeddings from different
domains, and optimal multi-kernel selection method is adopted for further reducing domain discrepancy.
[82] introduces conditional adversarial domain adaptation (CDAN), which conditions adversarial models on the discriminative details inherent in classifier outcomes by utilizing multilinear conditioning and
91
entropy conditioning. [22] introduces the Cluster Alignment with a Teacher (CAT) approach for unsupervised domain adaptation, which utilizes an implicit ensembling teacher model to identify the classconditional structure in the feature space for the unlabeled target domain and allows for forcing features
from both domains to form distinctive and aligned clusters. [124] introduces the Adaptive Feature Norm
method, which contains no any parameters. The innovative aspect of this approach lies in its capacity to
adaptively adjust the feature norms from both the source and target domains across a broad range of values,
which leads to significant transferability. [14] introduces Batch Spectral Penalization (BSP), which prioritizes these overshadowed eigenvectors to enhance feature discriminability and maintain the transferability. [121] introduces Transferable Normalization (TransNorm), a substitute for conventional normalization
methods, which can be effortlessly integrated into existing methods without additional hyper-parameters
or learnable parameters. [101] introduces the Adversarial Dropout Regularization (ADR) method, which
emphasizes the generation of more discriminative features. Unlike traditional methods, ADR employs
a critic designed to detect non-discriminative features using dropout on classifier network. [65] introduces sliced Wasserstein discrepancy (SWD), which offers a geometrically intuitive way to identify target
samples distanced from the source’s support, facilitating efficient distribution alignment in an end-to-end
training behavior.
6.3 Proposed Method
6.3.1 System Overview
The task of image unsupervised domain adaptation aims to effectively adapt a model trained on a source
domain to a different target domain. We apply our transfer paradigm to the digit image classification
task to elaborate on its effectiveness. The objective becomes finding the accurate class labels in the target
92
Figure 6.1: An overview of our proposed GIT method.
domain, given the images of both domains and the class labels of the source domain. During the training
process, no labels from the target domain are used.
Our proposed transfer paradigm contains three phases: (1) Joint Discriminant Subspace Learning, (2)
Source-to-Target Label Transfer, and (3) Weakly Supervised Learning in the Target Domain. An overview
is shown in Fig. 6.1. In the first phase, we extract features that are consistent for both the source and target
domains. Then, we select the most distinguishable features using the class labels of the source domain. In
the second phase, we aim to transfer the class labels from the source domain to the partial data points in
the target domain. To achieve this, we build the source label space and transfer the labels from the source
to the target. Finally, a weakly supervised learning method is performed in the target domain. We train a
model in the target domain with the labeled data obtained in phase 2 and label the remaining unlabeled
data points.
6.3.2 Joint Discriminant Subspace Learning
6.3.2.1 Preprocessing
We first align images from the source to those in the target domain at the image pixel level. This is crucial
because images from the source and target may be of different scales and centered at different locations.
93
We padded the images at boundaries to ensure they have a uniform scale. One example is given in Fig. 6.2.
The aligned images of the USPS dataset look similar to those of the MNIST dataset.
6.3.2.2 Feature Extraction
We apply PixelHop++ [16] to the aligned images from both the source and target domains to extract raw
features. Unlike the original PixelHop++, we replace the original max-pooling operation with the absolute max-pooing proposed in IPHOP [125] to achieve better classification performance. We utilize three
PixelHop++ units with a 3x3 kernel and take the representations from the last Hop as the raw features.
In addition to the raw features, we also generate highly discriminant complimentary features with
least-squares normal transform (LNT) [122] based on the raw features. LNT converts a C-class classification problem to several binary classification problems and learns an affine transform via least-square
regression. We utilize “1 vs. (C-1)” grouping scheme to convert the ten-class classification problem into
ten binary classification problems, each of which learns a normal vector, yielding ten transformed features.
To reduce the number of features and thus reduce the complexity, we conduct the discriminant feature
test (DFT) [126] in the source domain to select the most discriminant features from the raw features. We
rank these features based on their entropy scores from DFT analysis and select those features that rank in
the top 20. This results in 20 features.
We combine selected raw features with the complimentary features as our preliminary feature space
(30-D) and individually normalize each dimension.
6.3.2.3 Feature Selection
To further ensure the distinctiveness of the features, we compare the cosine similarity among features.
We convert all data points in one feature dimension into a vector and then calculate the cosine similarity
between any two feature dimensions. We only retain features that have low cosine similarity to all other
94
(a) MNIST
(b) USPS
(c) Aligned USPS
Figure 6.2: An illustration of the significance of the preprocessing step at image pixel level. (a) images in
MNIST dataset and (b) images in USPS dataset (c) Aligned images in the USPS dataset.
95
features, ensuring the uniqueness of features. As a result, we select eight feature dimensions. The pairwise
cosine similarity of the 30 features and eight selected features is shown in (a) and (b) of Fig. 6.3. n features
create
n
2
=
n!
2!(n−2)! =
1
2
n(n − 1) combinations for calculating pairwise cosine similarity. The selected
features generally have cosine similarity scores lower than 0.2.
6.3.3 Source-to-Target Label Transfer
We first build the source label space by forming clusters for each class and then assign class labels to target
data by finding the closest cluster. This aims to assign labels to partial target data points and prepare for
the final weakly supervised learning phase.
Initially, we divide the data points into ten classes based on the class labels in the source domain.
Considering the intra-class variability, we form K clusters in each class through KMeans clustering. Since
the number of modes is uncertain for each class, we consider multiple choices of K, i.e., K = 1, 2, 4, 8. For
example, K = 4 means we cluster samples in each class into 4 clusters, resulting in a total of 40 clusters
for ten classes. We calculate the centroid of each cluster by averaging all the data points assigned to the
cluster.
It is noted that a data point may be assigned to a different cluster if we assign cluster labels by minimizing the distance between the data point and centroids. This is because these centroids are obtained
with the guidance of class labels to ensure that each cluster has a certain class label. These centroids are
different from those obtained by iteratively updating centroids and assigning cluster labels as in the ordinary KMeans algorithm. This may lead to incorrect labeling when we assign labels to the target data
points based on the distances between a data point and the centroids.
To alleviate this issue, we decrease the radius of the coverage of each cluster for an increased purity
of class labels in each cluster. We use entropy as the criterion to measure the purity of class labels in a
cluster. For each cluster, we first collect all the data points that are within the coverage of radius r; that
9
(a) Before Feature Selection (30 features)
(b) After Feature Selection (8 features)
Figure 6.3: The pairwise cosine similarity among features before and after feature selection.
97
(a) Entropy change of cluster 0 for class 0 (b) Entropy change of cluster 1 for class 0
Figure 6.4: Examples of entropy change as the radius of a cluster increases.
is, the distance between a data point and its assigned cluster centroid is less than r. We then calculate
the entropy using their class labels. An example plot of the entropy versus radius is given in Fig. 6.4. We
observe that there exists a region on the plot that has very low entropy, which means the purity of class
labels is relatively high. We choose the largest radius corresponding to zero entropy as the boundary of
each cluster to maximize the class label purity of each cluster. The largest radius generally decreases as K
increases. For example, the largest radius is set as 25, 12, 6, and 3 for K = 1, 2, 4, 8 on MNIST dataset.
After defining the label space of the source domain with the clusters of a certain radius, we transfer
the labels to the target by assigning target data points to the clusters based on feature distances. Since
the target and source domains inherently have different distributions, directly comparing the distances
between source and target features results in the well-known domain shift problem. An example of is
given in Fig. 6.5. To address this issue, we conduct histogram matching on each dimension of the source
and target features. The target features are matched with the source feature space in order to calculate the
feature distances.
Since each cluster has a unique class label, we can easily obtain the class labels after assigning cluster
labels. Starting from K = 1, we assign the class label to the target data points located within the cluster’s
radius. For those target data points that are not assigned in the K = 1 scheme, we continue to assign
98
Figure 6.5: An example histogram of the source and target distributions. The figures from left to right
represent the source histogram, target histogram, and matched target histogram of the first dimension in
8-D feature vector.
them with the K = 2 scheme. We repeat this assignment process from K = 1 to K = 8 to achieve better
assignment accuracy. This step can label 63.4% target data and achieve 97.4% accuracy in the target
domain, as shown in Table. 6.1.
Table 6.1: The class labeling accuracy of 63.4% labeled data points in the target domain after source-totarget label transfer. The source-only column represents the accuracy if we only train with source data
and apply it to the target domain, which labels 23.7% target data in label transfer. The results are from
MNIST-to-USPS transfer.
Class MNIST-to-USPS source-only
0 99.8% 95.0%
1 99.7% 99.6%
2 97.7% 96.5%
3 99.2% 97.5%
4 97.9% 90.9%
5 98.9% 47.2%
6 99.4% 93.1%
7 98.6% 69.1%
8 82.7% 60.2%
9 100% 66.7%
Total Accuracy 97.4% 85.6%
Labeled Percentage 63.4% 23.7%
99
6.3.4 Weakly Supervised Learning in the Target Domain
We train an IPHOP system with the labeled data from the target domain and predict the labels for the
remaining unlabeled data, in a weakly supervised manner. The IPHOP system contains three parts: (1)
representation learning, (2) feature learning, and (3) decision learning. We repeat the process of training
and inference iteratively. First, we train the IPHOP system with labeled data and predict the class labels for
the remaining unlabeled data. We accept the predicted labels of the data points that have a high probability
score and add them to the training set. The data points that have low probability scores are put back into
the unlabeled set and will be predicted in the next round. We then train the IPHOP system with the new
training set and inference on the unlabeled set. We repeat this until all data points in the testing set achieve
desirable confidence.
6.4 Preliminary Experimental Results
We evaluate our proposed GILT method on digit classification. In all experiments, we assume the target
domain is unlabeled. We set the source domain as the MNIST dataset and the target domain as the USPS
dataset. Both datasets are gray-scale images. The MNIST dataset contains 60,000 training images and
10,000 testing images. The USPS dataset contains 9228 gray-scale images. Some example images from
both datasets are shown in (a) and (b) of Fig. 6.2. The digits in the USPS dataset are generally larger in
scale than those in the MNIST dataset. Due to the limited number of images in the USPS dataset, training
a deep-learning model directly on it is more difficult than on the MNIST dataset. Transfer learning is,
therefore, significant for the USPS dataset.
100
6.4.1 Accuracy
We compare our methods with several deep-learning(DL)-based methods on digit recognition from MNIST
to USPS. The classification accuracy of these DL-based methods is presented in [115]. On the MNIST-toUSPS adaptation, our method exceeds ReverseGrad [31] and DomainConfusion [114], and slightly falls
behind ADDA [115] and CoGANs [79].
Methods MNIST-to-USPS
ReverseGrad [31] 77.1%
DomainConfusion [114] 79.1%
CoGAN [79] 91.2%
ADDA [115] 89.4%
Source-only 78.0%
Ours 89.0 %
Table 6.2: Accuracy on digit classification for MNIST-to-USPS transfer.
6.4.2 FLOPs and Model Size
There are several typical procedures involved in calculating FLOPs and model size. FLOPs represent the
number of floating-point operations, e.g., addition and multiplication, needed to go through the entire
pipeline once. It is related to computation efficiency and memory consumption. It can reflect the theoretical
lower bound, i.e., optimal implementation, of latency and power consumption. Model size is the number
of parameters that need to be stored.
Pixel-wise operation. A single pixel-wise operation (addition or multiplication) on a set of number N
images (or patches) with height H, width W, and depth (number of channels) C requires H ×W ×C ×N
FLOPs. No parameter is needed for this operation.
Matrix Multiplication. The matrix multiplication between a K × K transform matrix and K × N data
matrix is (2 × K − 1) × K × N FLOPs and K × K parameters.
Convolution or Filtering. The convolution operation with 3-D kernel K ×K ×Ci and outputs a feature
map of size H×W×Co requires(2×Ci×K×K−1)×H×W×Co FLOPs and (Ci×K×K)×Co parameters.
101
If a bias term is required along with convolution, the total FLOPs is (2 × Ci × K × K) × H × W × Co,
and the model size is (Ci × K × K + 1) × Co.
We calculate our models’ FLOPs and model size in Table 6.3. Module 1 is Joint Discriminant Subspace
Learning, which contains a padding operation in preprocessing, three PixelHop++ units, an LNT unit,
and feature selection operations. Module 2 is Source-to-target Label Transfer, where pairwise distance is
calculated in clustering operations. Module 3 is weakly supervised learning in the target domain, which
contains an IPHOP system.
Modules FLOPs Model size
Module 1 41,110 3,553
Module 2 10,800 1,200
Module 3 101,400 98,805
Total 153,310 103,558
Table 6.3: FLOPs and model size of our proposed method for MNIST-to-USPS.
We compare the model sizes and FLOPs with two deep-learning-based methods in Table6.4. Our
method achieves much better classification accuracy with slightly more FLOPs and a slightly larger model
size on MNIST-to-USPS than the ReverseGrad [31] method. Compared to the ADDA [115], our method
achieves comparable accuracy on MNIST-to-USPS with ten times fewer FLOPs and about seven times
smaller model size.
Modules FLOPs Model size
ReverseGrad [31] 157.2K (1.03X) 78.7K (1X)
ADDA [115] 1578.8K (10X) 704.1K (9X)
Ours 153.3K (1X) 103.5K (1.3X)
Table 6.4: Comparison of FLOPs and model size of our proposed method and two deep-learning-based
methods for MNIST-to-USPS.
102
6.5 Conclusion
We focused on unsupervised domain adaptation (DA), which adapts a model trained on a source domain to
work on a related target domain. We addressed the limitations in deep learning-based DA models, which
can be complex and prone to overfitting, especially when the amount of target domain data is limited.
We introduced a new method for unsupervised domain adaptation that emphasizes model efficiency and
explainability. Our approach is called Green Image Label Transfer (GILT). Unlike other methods, it doesn’t
rely on neural networks to learn domain-invariant features. Instead, it first learns a joint discriminant
subspace to identify and choose features that are invariant in both the source and target domains. It
then transfers labels from the source domain to the target domain, followed by using a weakly supervised
learning method in the target domain. Experiments on MNIST-to-USPS transfer demonstrated the potential
of our proposed method in terms of comparable accuracy on digit classification, small model sizes, and
efficient inference speed.
103
Chapter 7
Conclusion and Future Work
7.1 Summary of the Research
In this thesis, we focus on two tasks to examine our novel image generative modeling pipeline, which is
texture generation and image generation, and further demonstrate the generalization ability of our green
generative models on image label transfer learning.
An explainable, efficient, and lightweight texture generation method, called TGHop, was proposed
for texture synthesis. Texture can be effectively analyzed using the multi-stage c/w Saab transforms and
expressed in the form of joint spatial-spectral representations. The distribution of sample texture patches
was carefully studied so that we could generate samples in the core. We can go through the reverse path
to increase its spatial dimension based on the generated core samples. Finally, patches can be stitched to
form texture images of a larger size. Experimental results demonstrated that TGHop can generate texture
images of superior quality with small model sizes and at a fast speed.
We extended our SSL-based image generative modeling pipeline to the natural image generation task.
GenHop attempts to capture sample distributions in various joint spatial-spectral subspaces explicitly.
GenHop analyzed the sample distributions in each subspace in the fine-to-coarse analysis procedure. A
new sample called seed was first generated in the coarsest subspace to determine the global structure of an
image. To enhance the quality of the generated sample, locally linear embedding was adopted to fine-tune
104
the generated sample and generate fine details in the coarse-to-fine procedure. GenHop offered an explainable and high-performance solution for image generative modeling whose FID scores are comparable to
deep-learning-based generative models against MNIST, Fashion-MNIST, and CelebA datasets. It is robust
enough to train with fewer training samples. The proposed alternative image generation methodology is
attractive since it has a closed-form solution in theory, is easy to implement, and has a great potential for
further generalization.
To address two limitations of the previous methods, we proposed a new green image generation method
named GILT. To reduce the model size, we use Mixture of Factor Analyzers (MFA) for seed learning and
generation and an XGBoost regressor instead of locally linear embedding (LLE) for detail generation. The
method transforms data into a lower-dimensional space and generates samples through a reverse generation process. The samples preserve the structure of the source images, and their distribution can be
modeled into a seed distribution. The reverse generation process generates seeds by sampling from the
seed distribution and generates the overall structure of an image, followed by adding details to it. GILT
offered an explainable, efficient, and high-performance solution for image generative modeling whose
FID scores are comparable to deep-learning-based generative models against MNIST and Fashion-MNIST
datasets. It has a small model size and is fast in training and inference.
We further investigate our generative paradigm on transfer learning. In particular, we have explored
unsupervised domain adaptation (DA), which involves adapting a model that was trained on a source
domain to work on a related target domain. However, deep learning-based DA models can be complex
and prone to overfitting, especially when there is limited target domain data available. To address these
limitations, we have developed a new method for unsupervised domain adaptation called Green Image
Label Transfer (GILT). Our approach emphasizes model efficiency and explainability. GILT does not rely on
neural networks to learn domain-invariant features. Instead, it first learns a joint discriminant subspace to
identify and select features that are invariant in both the source and target domains. It then transfers labels
105
from the source domain to the target domain, followed by using a weakly supervised learning method in
the target domain.
7.2 Future Research Topics
The proposed green image generative modeling pipeline has advantages in terms of having a closed-form
solution in theory, being easy to implement, and having great potential for further generalization. We
are interested in generalizing it to conditional generative modeling. We bring up the following research
problems:
• Style transfer. Given a random pair of two images, called style image and content image, a style
transfer method extracts the feel from the style image and generates an output that contains the
look of the content image and the feel of the style image as shown in Fig. 7.1 (a).
• Unpaired image-to-image translation. Given unlabeled image data from two domains without
any data characterizing correct translations, an unsupervised learning framework can convert images from one domain to images from the other. One example is shown in Fig. 7.1 (b).
7.2.1 Style Transfer
Transferring the style from one image to another image is an appealing yet challenging problem. Many
efforts have been made to develop automatic style transfer methods [1, 26, 27, 106, 63]. Early-stage methods
use simple measurement to build the correspondence between two images, such as the luminance used
in [26]. Recent DL-based methods extract the style of artistic images and transfer it to a content image
using Convolutional Neural Networks (CNN). Various statistic measurements are adopted to represent
styles, such as Gram matrix [34, 116] and covariance matrix [74] of feature maps or second-order statistics
of Batch Normalization layers [72]. A trainable transformation module is designed in [70], fulfilling fast
106
(a) Style Transfer
(b) General-purpose Image-to-Image Translation
Figure 7.1: Examples of (a) style transfer in [70] and (b) general purpose image-to-image translation in
[51].
image and video style transfer. It is proved in [71] that matching Gram matrices in neural style transfer
can be seen as a Maximum Mean Discrepancy (MMD) process with a second-order polynomial kernel. It
is observed in [71] that the style of an image can be intrinsically represented by feature distributions in
different layers of a CNN, and the style transfer can be seen as a distribution alignment process from the
content image to the style image. Thus, the style transfer problem can be decomposed into two parts:
1. represents the style of an image by feature distributions,
2. conducts distribution alignment from the content image to the style image.
107
We review two works [34, 71] as follows. The method in [33] iteratively optimizes a content loss and
a style loss:
L = αLcontent + βLstyle, (7.1)
Where α and β are two weights to control the degree of transferring. Lcontent is calculated as the squared
error between the feature maps of a specific layer l:
Lcontent =
1
2
X
Nl
i=1
X
Ml
j=1
(F
l
ij − P
l
ij )
2
, (7.2)
where F
l
ij is the value at jth position on ith feature maps of the synthesized image in the layer l while P
l
ij
is that of content image. Nl
is the total number of feature maps, and Ml
is the height times the width of
the feature map. Lstyle is computed as the weighted sum of styled loss of different layers:
Lstyle =
X
L
l=1
wlEl
, (7.3)
where El
is the difference between feature correlations expressed by the Gram matrices of the content
image and the style image:
E
l =
1
4N2
l M2
l
X
Nl
i=1
X
Ml
j=1
(G
l
ij − A
l
ij )
2
, (7.4)
where Gl
ij ∈ RNl×Nl
is the Gram matrix of the synthesized image computed as the inner product between
the vectorized feature maps:
G
l
ij =
X
Ml
k=1
F
l
ikF
l
jk, (7.5)
and Al
ij is the Gram matrix of the style image. El
can be reformulated in terms of MMD process with
second order polynomial kernel or other kernels such as linear kernel or Gaussian kernel as in [71] to
achieve diverse style transfer results.
108
7.2.2 Unpaired Image-to-Image Translation
Many image processing and computer vision tasks, e.g., image segmentation, stylization, and abstraction,
can be posed as image-to-image translation problems as shown in Fig. 7.1 (b), which convert one visual
representation of an object or scene into another. Some image-conditional models are developed for specific applications such as super-resolution, texture synthesis, style transfer from normal maps to images,
and video prediction, whereas some are aiming for general-purpose processing [51, 110, 127, 134]. Here,
we aim to develop an unsupervised learning framework for general-purpose unpaired image-to-image
translation, which only relies on unlabeled and unpaired image data. This is because human labeling is
expensive and even impractical for large quantities of data, and paired training data will not be available
for many tasks.
The challenge of the task is how to train a translator without any data characterizing correct translations. DualGAN [127] and CycleGAN [134] propose the same idea for unpaired image-to-image translation.
Both of them contain two GANs: one learns to translate images from domain A to those in domain B, and
the other learns to translate from domain B to domain A. An overview of the training process is given
in Fig. 7.2. In addition to the adversarial loss in two GANs, two reconstruction losses, called cycle consistency losses, are adopted to capture the intuition that if we translate from one domain to the other and
back again, we should arrive where we started. This also forces the translated samples to obey the domain
distribution.
This bidirectional training protocol can be incorporated into our SSL-based image generative modeling
pipeline. Since distribution alignment is adopted in each subspace, we can train two simple transforms,
one from domain A to domain B and the other from domain B to domain A, using the reconstruction loss
and adversarial loss to avoid end-to-end optimization and thus reduce computational costs.
109
Figure 7.2: Overview of the training process in DualGAN [127].
110
Bibliography
[1] Hertzmann Aaron, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. “Image
analogies”. In: Proceedings of ACM SIGGRAPH. ACM Press, New York (2001).
[2] S Arivazhagan and Lakshmanan Ganesan. “Texture classification using wavelet transform”. In:
Pattern recognition letters 24.9-10 (2003), pp. 1513–1521.
[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein generative adversarial
networks”. In: International Conference on Machine Learning. 2017, pp. 214–223.
[4] Zohreh Azizi, C-C Jay Kuo, et al. “PAGER: Progressive attribute-guided extendable robust image
generation”. In: APSIPA Transactions on Signal and Information Processing 11.1 (2022).
[5] Marian Stewart Bartlett, Javier R Movellan, and Terrence J Sejnowski. “Face recognition by
independent component analysis”. In: IEEE Transactions on neural networks 13.6 (2002),
pp. 1450–1464.
[6] Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. “Deep generative stochastic
networks trainable by backprop”. In: International Conference on Machine Learning. PMLR. 2014,
pp. 226–234.
[7] David Berthelot, Thomas Schumm, and Luke Metz. “Began: Boundary equilibrium generative
adversarial networks”. In: arXiv preprint arXiv:1703.10717 (2017).
[8] Piotr Bojanowski, Armand Joulin, David Lopez-Pas, and Arthur Szlam. “Optimizing the Latent
Space of Generative Networks”. In: International Conference on Machine Learning. 2018,
pp. 600–609.
[9] Andrew Brock, Jeff Donahue, and Karen Simonyan. “Large scale gan training for high fidelity
natural image synthesis”. In: arXiv preprint arXiv:1809.11096 (2018).
[10] Hong Chang, Dit-Yan Yeung, and Yimin Xiong. “Super-resolution through neighbor embedding”.
In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2004. CVPR 2004. Vol. 1. IEEE. 2004, pp. I–I.
[11] Tianhorng Chang and C-C Jay Kuo. “Texture analysis and classification with tree-structured
wavelet transform”. In: IEEE Transactions on image processing 2.4 (1993), pp. 429–441.
111
[12] Hong-Shuo Chen, Mozhdeh Rouhsedaghat, Hamza Ghani, Shuowen Hu, Suya You, and
C-C Jay Kuo. “DefakeHop: A Light-Weight High-Performance Deepfake Detector”. In: 2021 IEEE
International Conference on Multimedia and Expo (ICME). IEEE. 2021, pp. 1–6.
[13] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. “Infogan:
Interpretable representation learning by information maximizing generative adversarial nets”. In:
Advances in neural information processing systems. 2016, pp. 2172–2180.
[14] Xinyang Chen, Sinan Wang, Mingsheng Long, and Jianmin Wang. “Transferability vs.
discriminability: Batch spectral penalization for adversarial domain adaptation”. In: International
conference on machine learning. PMLR. 2019, pp. 1081–1090.
[15] Yueru Chen and C-C Jay Kuo. “PixelHop: A Successive Subspace Learning (SSL) Method for
Object Recognition”. In: Journal of Visual Communication and Image Representation (2020),
p. 102749.
[16] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. “PixelHop++:
A Small Successive-Subspace-Learning-Based (SSL-based) Model for Image Classification”. In:
arXiv preprint arXiv:2002.03141 (2020).
[17] Jaegul Choo and Shixia Liu. “Visual analytics for explainable deep learning”. In: IEEE computer
graphics and applications 38.4 (2018), pp. 84–92.
[18] Michael F Cohen, Jonathan Shade, Stefan Hiller, and Oliver Deussen. “Wang tiles for image and
texture generation”. In: ACM Transactions on Graphics (TOG) 22.3 (2003), pp. 287–294.
[19] Hal Daumé III. “Frustratingly easy domain adaptation”. In: arXiv preprint arXiv:0907.1815 (2009).
[20] Jeremy S De Bonet. “Multiresolution sampling procedure for analysis and synthesis of texture
images”. In: Proceedings of the 24th annual conference on Computer graphics and interactive
techniques. 1997, pp. 361–368.
[21] Gustavo Deco and Wilfried Brauer. “Nonlinear higher-order statistical decorrelation by
volume-conserving neural architectures”. In: Neural Networks 8.4 (1995), pp. 525–535.
[22] Zhijie Deng, Yucen Luo, and Jun Zhu. “Cluster alignment with a teacher for unsupervised domain
adaptation”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019,
pp. 9944–9953.
[23] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using real nvp”. In:
arXiv preprint arXiv:1605.08803 (2016).
[24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
[25] Yilun Du and Igor Mordatch. “Implicit Generation and Modeling with Energy Based Models”. In:
Advances in Neural Information Processing Systems. 2019, pp. 3603–3613.
112
[26] Alexei A Efros and William T Freeman. “Image quilting for texture synthesis and transfer”. In:
Proceedings of the 28th annual conference on Computer graphics and interactive techniques. 2001,
pp. 341–346.
[27] Alexei A Efros and Thomas K Leung. “Texture synthesis by non-parametric sampling”. In:
Proceedings of the seventh IEEE international conference on computer vision. Vol. 2. IEEE. 1999,
pp. 1033–1038.
[28] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. “Subspace alignment for
domain adaptation”. In: arXiv preprint arXiv:1409.5241 (2014).
[29] William T Freeman, Thouis R Jones, and Egon C Pasztor. “Example-based super-resolution”. In:
IEEE Computer graphics and Applications 22.2 (2002), pp. 56–65.
[30] Brendan J Frey, J Frey Brendan, and Brendan J Frey. Graphical models for machine learning and
digital communication. MIT press, 1998.
[31] Yaroslav Ganin and Victor Lempitsky. “Unsupervised domain adaptation by backpropagation”. In:
International conference on machine learning. PMLR. 2015, pp. 1180–1189.
[32] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,
François Laviolette, Mario Marchand, and Victor Lempitsky. “Domain-adversarial training of
neural networks”. In: The journal of machine learning research 17.1 (2016), pp. 2096–2030.
[33] Leon Gatys, Alexander S Ecker, and Matthias Bethge. “Texture synthesis using convolutional
neural networks”. In: Advances in neural information processing systems. 2015, pp. 262–270.
[34] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. “Image style transfer using convolutional
neural networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 2414–2423.
[35] Zoubin Ghahramani, Geoffrey E Hinton, et al. The EM algorithm for mixtures of factor analyzers.
Tech. rep. Technical Report CRG-TR-96-1, University of Toronto, 1996.
[36] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. “Generative adversarial nets”. In: Advances in neural
information processing systems. 2014, pp. 2672–2680.
[37] Niv Granot, Ben Feinstein, Assaf Shocher, Shai Bagon, and Michal Irani. “Drop the gan: In
defense of patches nearest neighbors as single image generative models”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 13460–13469.
[38] Ishaan Gulrajani, Faruk Ahmed, Martıén Arjovsky, Vincent Dumoulin, and Aaron C Courville.
“Improved Training of Wasserstein GANs”. In: NIPS. 2017.
[39] James Hays and Alexei A Efros. “Scene completion using millions of photographs”. In: ACM
Transactions on Graphics (ToG) 26.3 (2007), 4–es.
113
[40] David J Heeger and James R Bergen. “Pyramid-based texture analysis/synthesis”. In: Proceedings
of the 22nd annual conference on Computer graphics and interactive techniques. 1995, pp. 229–238.
[41] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
“Gans trained by a two time-scale update rule converge to a local nash equilibrium”. In: Advances
in Neural Information Processing Systems. 2017, pp. 6626–6637.
[42] Geoffrey E Hinton. “Learning multiple layers of representation”. In: Trends in cognitive sciences
11.10 (2007), pp. 428–434.
[43] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. “A fast learning algorithm for deep belief
nets”. In: Neural computation 18.7 (2006), pp. 1527–1554.
[44] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. “Flow++: Improving
flow-based generative models with variational dequantization and architecture design”. In: arXiv
preprint arXiv:1902.00275 (2019).
[45] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models”. In:
Advances in neural information processing systems 33 (2020), pp. 6840–6851.
[46] Yedid Hoshen, Ke Li, and Jitendra Malik. “Non-adversarial image synthesis with generative latent
nearest neighbors”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2019, pp. 5811–5819.
[47] Chih-Chung Hsu, Li-Wei Kang, and Chia-Wen Lin. “Temporally coherent superresolution of
textured video via dynamic texture synthesis”. In: IEEE Transactions on Image Processing 24.3
(2015), pp. 919–931.
[48] Chun-Ting Huang, Zhengning Wang, and C-C Jay Kuo. “Visible-light and near-infrared face
recognition at a distance”. In: Journal of Visual Communication and Image Representation 41
(2016), pp. 140–153.
[49] Aapo Hyvärinen, Patrik O Hoyer, and Erkki Oja. “Sparse code shrinkage: Denoising by nonlinear
maximum likelihood estimation”. In: Advances in Neural Information Processing Systems. 1999,
pp. 473–479.
[50] Aapo Hyvärinen and Erkki Oja. “Independent component analysis: algorithms and applications”.
In: Neural networks 13.4-5 (2000), pp. 411–430.
[51] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. “Image-to-image translation with
conditional adversarial networks”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2017, pp. 1125–1134.
[52] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “R-PointHop: A Green, Accurate and
Unsupervised Point Cloud Registration Method”. In: arXiv preprint arXiv:2103.08129 (2021).
[53] Pranav Kadam, Min Zhang, Shan Liu, and C-C Jay Kuo. “Unsupervised Point Cloud Registration
via Salient Points Analysis (SPA)”. In: 2020 IEEE International Conference on Visual
Communications and Image Processing (VCIP). IEEE. 2020, pp. 5–8.
114
[54] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs for
Improved Quality, Stability, and Variation”. In: International Conference on Learning
Representations. 2018.
[55] Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative
adversarial networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2019, pp. 4401–4410.
[56] Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: 2nd International
Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014,
Conference Track Proceedings. 2014. arXiv: http://arxiv.org/abs/1312.6114v10 [stat.ML].
[57] Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. “On convergence and stability of
gans”. In: arXiv preprint arXiv:1705.07215 (2017).
[58] C-C Jay Kuo. “The cnn as a guided multilayer recos transform [lecture notes]”. In: IEEE signal
processing magazine 34.3 (2017), pp. 81–89.
[59] C-C Jay Kuo. “Understanding convolutional neural networks with a mathematical model”. In:
Journal of Visual Communication and Image Representation 41 (2016), pp. 406–413.
[60] C-C Jay Kuo and Yueru Chen. “On data-driven saak transform”. In: Journal of Visual
Communication and Image Representation 50 (2018), pp. 237–246.
[61] C-C Jay Kuo and Azad M Madni. “Green learning: Introduction, examples and outlook”. In:
Journal of Visual Communication and Image Representation 90 (2023), p. 103685.
[62] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. “Interpretable convolutional
neural networks via feedforward design”. In: Journal of Visual Communication and Image
Representation (2019).
[63] Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. “Texture optimization for
example-based synthesis”. In: ACM SIGGRAPH 2005 Papers. Association for Computing
Machinery, New York, NY, United States, 2005, pp. 795–802.
[64] Vivek Kwatra, Arno Schödl, Irfan Essa, Greg Turk, and Aaron Bobick. “Graphcut textures: image
and video synthesis using graph cuts”. In: ACM Transactions on Graphics (ToG) 22.3 (2003),
pp. 277–286.
[65] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. “Sliced wasserstein
discrepancy for unsupervised domain adaptation”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2019, pp. 10285–10295.
[66] Xuejing Lei, Wei Wang, and C-C Jay Kuo. “GENHOP: An image generation method based on
successive subspace learning”. In: 2022 IEEE International Symposium on Circuits and Systems
(ISCAS). IEEE. 2022, pp. 3314–3318.
115
[67] Xuejing Lei, Ganning Zhao, and C-C Jay Kuo. “NITES: A Non-Parametric Interpretable Texture
Synthesis Method”. In: 2020 Asia-Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA ASC). IEEE. 2020, pp. 1698–1706.
[68] Xuejing Lei, Ganning Zhao, Kaitai Zhang, and C-C Jay Kuo. “TGHop: An Explainable, Efficient
and Lightweight Method for Texture Generation”. In: arXiv preprint arXiv:2107.04020 (2021).
[69] Ke Li and Jitendra Malik. “Implicit maximum likelihood estimation”. In: arXiv preprint
arXiv:1809.09087 (2018).
[70] Xueting Li, Sifei Liu, Jan Kautz, and Ming-Hsuan Yang. “Learning linear transformations for fast
image and video style transfer”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. 2019, pp. 3809–3817.
[71] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. “Demystifying neural style transfer”. In:
arXiv preprint arXiv:1701.01036 (2017).
[72] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. “Revisiting batch
normalization for practical domain adaptation”. In: arXiv preprint arXiv:1603.04779 (2016).
[73] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. “Diversified
texture synthesis with feed-forward networks”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2017, pp. 3920–3928.
[74] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. “Universal style
transfer via feature transforms”. In: Advances in neural information processing systems. 2017,
pp. 386–396.
[75] Jian Liang, Dapeng Hu, and Jiashi Feng. “Do we really need to access the source data? source
hypothesis transfer for unsupervised domain adaptation”. In: International conference on machine
learning. PMLR. 2020, pp. 6028–6039.
[76] Lin Liang, Ce Liu, Ying-Qing Xu, Baining Guo, and Heung-Yeung Shum. “Real-time texture
synthesis by patch-based sampling”. In: ACM Transactions on Graphics (ToG) 20.3 (2001),
pp. 127–150.
[77] Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, and
Hwann-Tzong Chen. “Coco-gan: Generation by parts via conditional coordinating”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 4512–4521.
[78] Gang Liu, Yann Gousseau, and Gui-Song Xia. “Texture synthesis through convolutional neural
networks and spectrum constraints”. In: 2016 23rd International Conference on Pattern Recognition
(ICPR). IEEE. 2016, pp. 3234–3239.
[79] Ming-Yu Liu and Oncel Tuzel. “Coupled generative adversarial networks”. In: Advances in neural
information processing systems 29 (2016).
116
[80] Xiaofeng Liu, Fangxu Xing, Chao Yang, C-C Jay Kuo, Suma Babu, Georges El Fakhri,
Thomas Jenkins, and Jonghye Woo. “VoxelHop: Successive Subspace Learning for ALS Disease
Classification Using Structural MRI”. In: arXiv preprint arXiv:2101.05131 (2021).
[81] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. “Learning transferable features
with deep adaptation networks”. In: International conference on machine learning. PMLR. 2015,
pp. 97–105.
[82] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. “Conditional adversarial
domain adaptation”. In: Advances in neural information processing systems 31 (2018).
[83] Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. “Are gans
created equal? a large-scale study”. In: Advances in neural information processing systems. 2018,
pp. 700–709.
[84] Abinaya Manimaran, Thiyagarajan Ramanathan, Suya You, and C-C Jay Kuo. “Visualization,
Discriminability and Applications of Interpretable Saak Features”. In: Journal of Visual
Communication and Image Representation 66 (2020), p. 102699.
[85] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.
“Least squares generative adversarial networks”. In: Proceedings of the IEEE international
conference on computer vision. 2017, pp. 2794–2802.
[86] Nikolaos Mitianoudis and Tania Stathaki. “Pixel-based and region-based image fusion schemes
using ICA bases”. In: Information fusion 8.2 (2007), pp. 131–142.
[87] Vishal Monga, Yuelong Li, and Yonina C Eldar. “Algorithm unrolling: Interpretable, efficient deep
learning for signal and image processing”. In: IEEE Signal Processing Magazine 38.2 (2021),
pp. 18–44.
[88] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. “Texture
fields: Learning texture representations in function space”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2019, pp. 4531–4540.
[89] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. “Wavenet: A generative model for
raw audio”. In: arXiv preprint arXiv:1609.03499 (2016).
[90] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. “Pixel recurrent neural
networks”. In: arXiv preprint arXiv:1601.06759 (2016).
[91] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and
Koray Kavukcuoglu. “Conditional image generation with pixelcnn decoders”. In: arXiv preprint
arXiv:1606.05328 (2016).
[92] Javier Portilla and Eero P Simoncelli. “A parametric texture model based on joint statistics of
complex wavelet coefficients”. In: International journal of computer vision 40.1 (2000), pp. 49–70.
117
[93] Alec Radford, Luke Metz, and Soumith Chintala. “Unsupervised representation learning with
deep convolutional generative adversarial networks”. In: arXiv preprint arXiv:1511.06434 (2015).
[94] Douglas A Reynolds. “Gaussian mixture models.” In: Encyclopedia of biometrics 741 (2009),
pp. 659–663.
[95] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochastic backpropagation and
approximate inference in deep generative models”. In: International conference on machine
learning. PMLR. 2014, pp. 1278–1286.
[96] Eitan Richardson and Yair Weiss. “On gans and gmms”. In: Advances in neural information
processing systems 31 (2018).
[97] Eric Risser, Pierre Wilmot, and Connelly Barnes. “Stable and controllable neural texture synthesis
and style transfer using histogram losses”. In: arXiv preprint arXiv:1701.08893 (2017).
[98] Mozhdeh Rouhsedaghat, Masoud Monajatipoor, Zohreh Azizi, and C-C Jay Kuo. “Successive
Subspace Learning: An Overview”. In: arXiv preprint arXiv:2103.00121 (2021).
[99] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo.
“Facehop: A light-weight low-resolution face gender classification method”. In: arXiv preprint
arXiv:2007.09510 (2020).
[100] Sam T Roweis and Lawrence K Saul. “Nonlinear dimensionality reduction by locally linear
embedding”. In: science 290.5500 (2000), pp. 2323–2326.
[101] Kuniaki Saito, Yoshitaka Ushiku, Tatsuya Harada, and Kate Saenko. “Adversarial dropout
regularization”. In: arXiv preprint arXiv:1711.01575 (2017).
[102] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
“Improved techniques for training gans”. In: Advances in neural information processing systems.
2016, pp. 2234–2242.
[103] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. “Pixelcnn++: Improving the
pixelcnn with discretized logistic mixture likelihood and other modifications”. In: arXiv preprint
arXiv:1701.05517 (2017).
[104] Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. “Singan: Learning a generative model from
a single natural image”. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision. 2019, pp. 4570–4580.
[105] Wu Shi and Yu Qiao. “Fast Texture Synthesis via Pseudo Optimizer”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 5498–5507.
[106] YiChang Shih, Sylvain Paris, Connelly Barnes, William T. Freeman, and Frédo Durand. “Style
Transfer for Headshot Portraits”. In: ACM Trans. Graph. 33.4 (July 2014). issn: 0730-0301. doi:
10.1145/2601097.2601137.
118
[107] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. “Deep
Unsupervised Learning using Nonequilibrium Thermodynamics”. In: Proceedings of the 32nd
International Conference on Machine Learning. Ed. by Francis Bach and David Blei. Vol. 37.
Proceedings of Machine Learning Research. Lille, France: PMLR, July 2015, pp. 2256–2265. url:
https://proceedings.mlr.press/v37/sohl-dickstein15.html.
[108] Baochen Sun, Jiashi Feng, and Kate Saenko. “Return of frustratingly easy domain adaptation”. In:
Proceedings of the AAAI conference on artificial intelligence. Vol. 30. 2016.
[109] Baochen Sun and Kate Saenko. “Deep coral: Correlation alignment for deep domain adaptation”.
In: Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16,
2016, Proceedings, Part III 14. Springer. 2016, pp. 443–450.
[110] Yaniv Taigman, Adam Polyak, and Lior Wolf. “Unsupervised cross-domain image generation”. In:
arXiv preprint arXiv:1611.02200 (2016).
[111] Tzu-Wei Tseng, Kai-Jiun Yang, C-C Jay Kuo, and Shang-Ho Tsai. “An interpretable compression
and classification system: Theory and applications”. In: IEEE Access 8 (2020), pp. 143962–143974.
[112] Mihran Tuceryan and Anil K Jain. “Texture analysis”. In: Handbook of pattern recognition and
computer vision (1993), pp. 235–276.
[113] Matthew A Turk and Alex P Pentland. “Face recognition using eigenfaces”. In: Proceedings. 1991
IEEE computer society conference on computer vision and pattern recognition. IEEE Computer
Society. 1991, pp. 586–587.
[114] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. “Simultaneous deep transfer across
domains and tasks”. In: Proceedings of the IEEE international conference on computer vision. 2015,
pp. 4068–4076.
[115] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. “Adversarial discriminative domain
adaptation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017,
pp. 7167–7176.
[116] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S Lempitsky. “Texture Networks:
Feed-forward Synthesis of Textures and Stylized Images”. In: Proceedings of The 33rd International
Conference on Machine Learning. Ed. by Maria Florina Balcan and Kilian Q. Weinberger. Vol. 48.
Proceedings of Machine Learning Research. New York, New York, USA: PMLR, 20–22 Jun 2016,
pp. 1349–1357. url: https://proceedings.mlr.press/v48/ulyanov16.html.
[117] Ivan Ustyuzhaninov, Wieland Brendel, Leon A Gatys, and Matthias Bethge. “What does it take to
generate natural textures?” In: ICLR (Poster). 2017.
[118] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
“Conditional image generation with pixelcnn decoders”. In: Advances in neural information
processing systems. 2016, pp. 4790–4798.
[119] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. “Pixel recurrent neural networks”.
In: International Conference on Machine Learning. PMLR. 2016, pp. 1747–1756.
119
[120] Jian Wang, Yunshan Zhong, Yachun Li, Chi Zhang, and Yichen Wei. “Re-identification supervised
texture generation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2019, pp. 11846–11856.
[121] Ximei Wang, Ying Jin, Mingsheng Long, Jianmin Wang, and Michael I Jordan. “Transferable
normalization: Towards improving transferability of deep neural networks”. In: Advances in
neural information processing systems 32 (2019).
[122] Xinyu Wang, Vinod K Mishra, and C-C Jay Kuo. “Enhancing Edge Intelligence with Highly
Discriminant LNT Features”. In: arXiv preprint arXiv:2312.14968 (2023).
[123] Li-Yi Wei and Marc Levoy. “Fast texture synthesis using tree-structured vector quantization”. In:
Proceedings of the 27th annual conference on Computer graphics and interactive techniques. 2000,
pp. 479–488.
[124] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. “Larger norm more transferable: An adaptive
feature norm approach for unsupervised domain adaptation”. In: Proceedings of the IEEE/CVF
international conference on computer vision. 2019, pp. 1426–1435.
[125] Yijing Yang, Hongyu Fu, and C-C Jay Kuo. “Design of supervision-scalable learning systems:
Methodology and performance benchmarking”. In: arXiv preprint arXiv:2206.09061 (2022).
[126] Yijing Yang, Wei Wang, Hongyu Fu, C-C Jay Kuo, et al. “On supervised feature selection from
high dimensional feature spaces”. In: APSIPA Transactions on Signal and Information Processing
11.1 (2022).
[127] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. “Dualgan: Unsupervised dual learning for
image-to-image translation”. In: Proceedings of the IEEE international conference on computer
vision. 2017, pp. 2849–2857.
[128] Kaitai Zhang, Hong-Shuo Chen, Ye Wang, Xiangyang Ji, and C-C Jay Kuo. “Texture analysis via
hierarchical spatial-spectral correlation (hssc)”. In: 2019 IEEE International Conference on Image
Processing (ICIP). IEEE. 2019, pp. 4419–4423.
[129] Kaitai Zhang, Hong-Shuo Chen, Xinfeng Zhang, Ye Wang, and C-C Jay Kuo. “A data-centric
approach to unsupervised texture segmentation using principle representative patterns”. In:
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE. 2019, pp. 1912–1916.
[130] Kaitai Zhang, Bin Wang, Wei Wang, Fahad Sohrab, Moncef Gabbouj, and C-C Jay Kuo.
“AnomalyHop: An SSL-based Image Anomaly Localization Method”. In: arXiv preprint
arXiv:2105.03797 (2021).
[131] Min Zhang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “Unsupervised Feedforward Feature
(UFF) Learning for Point Cloud Classification and Segmentation”. In: 2020 IEEE International
Conference on Visual Communications and Image Processing (VCIP). IEEE. 2020, pp. 144–147.
120
[132] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “Pointhop++: A lightweight
learning model on point sets for 3d classification”. In: 2020 IEEE International Conference on Image
Processing (ICIP). IEEE. 2020, pp. 3319–3323.
[133] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. “PointHop: An Explainable
Machine Learning Method for Point Cloud Classification”. In: IEEE Transactions on Multimedia
(2020).
[134] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. “Unpaired image-to-image
translation using cycle-consistent adversarial networks”. In: Proceedings of the IEEE international
conference on computer vision. 2017, pp. 2223–2232.
[135] Song Chun Zhu, Yingnian Wu, and David Mumford. “Filters, random fields and maximum
entropy (FRAME): Towards a unified theory for texture modeling”. In: International Journal of
Computer Vision 27.2 (1998), pp. 107–126.
121
Abstract (if available)
Abstract
Image generative modeling has been a long-standing problem that has received increasing attention. Generative models try to learn an estimate of an unknown distribution and generate new instances by sampling from it. There is a resurgence of interest in generative modeling due to the performance breakthrough achieved by deep learning in the last decade. Nevertheless, these models are usually large in size, intractable to explain in theory, and computationally expensive in training or generation. Developing an efficient generation method that addresses small model sizes, mathematical transparency, and efficiency in training and generation attracts more and more attention. In this dissertation, we design several generative modeling solutions for texture synthesis, image generation, and image label transfer. Unlike deep-learning-based methods, our developed generative methods address small model sizes, mathematical transparency, and efficiency in training and inference. We first present an efficient and lightweight solution for texture synthesis, which can generate diverse texture images given one exemplary texture image. Then, we propose a novel generative pipeline named GenHop and reformulate it to improve its efficiency and model sizes, yielding our final Green Image generation method. To demonstrate the generalization ability of our generative modeling concepts, we finally adapt it to an image label transfer task and propose a method called Green Image Label Transfer for unsupervised domain adaptation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
Green learning for 3D point cloud data processing
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Efficient graph learning: theory and performance evaluation
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Towards more human-like cross-lingual transfer learning
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Explainable and green solutions to point cloud classification and segmentation
PDF
Visual knowledge transfer with deep learning techniques
PDF
Object classification based on neural-network-inspired image transforms
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
Asset Metadata
Creator
Lei, Xuejing
(author)
Core Title
Green image generation and label transfer techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-05
Publication Date
01/30/2024
Defense Date
01/23/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
generative model,green learning,image generation,transfer learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
xjlei1117@gmail.com,xuejing@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113814336
Unique identifier
UC113814336
Identifier
etd-LeiXuejing-12639.pdf (filename)
Legacy Identifier
etd-LeiXuejing-12639
Document Type
Dissertation
Format
theses (aat)
Rights
Lei, Xuejing
Internet Media Type
application/pdf
Type
texts
Source
20240130-usctheses-batch-1123
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
generative model
green learning
image generation
transfer learning