Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced techniques for green image coding via hierarchical vector quantization
(USC Thesis Other)
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Advanced Techniques for Green Image Coding via Hierarchical Vector
Quantization
by
Yifan Wang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2024
Copyright 2024 Yifan Wang
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Traditional Standardized Codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Deep-Learning-Based Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 3: DCST: A Data-Driven Color/Spatial Transform . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 PQR Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Two-Directional Two-Dimensional PCA Block Transform . . . . . . . . . . . . . . . . . . . . 19
3.3 Quantization Based on Human Visual System . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Machine Learning Inverse PCA Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 4: GIC-v1: Green Image Codec With Multi-Grid Representation And Multi-Block-Size
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Multi-Grid (MG) Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Multi-Block-Size Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Adaptive Codebook Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Large Block Size Wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 5: GIC-v2: Green Image Codec With Channel-Wise Transform And Dimension Progressive
Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Channel-Wise Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Dimension Progressive Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Skip Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Content Adaptive Codebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 Color Channel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8 Rate Distortion Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.9 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ii
Chapter 6: GIC-v3: Green Image Codec With Rate Control . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Rate Control Tools of Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.1 Skip and Variable Bitrate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.2 Adaptive Codebook Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.3 Global and Local Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2.4 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Joint Rate Control over Multi-Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2 Search of Lagrangian Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3.3 Determination of Target Bit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4 Embedded Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4.1 Inter-Grid Embedded Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 7: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Further Improvement On Green Image Codec (GIC) . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Extension to Green Video Code (GVC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
iii
List of Tables
3.1 HSV based 8 × 8 quantization matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Performance of optimal inverse kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 BD rate for proposed DCST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 MGBVQ parameter setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 GIC-v1’s bitrate allocation among different grids . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Energy distribution after spatial to spectrum transform . . . . . . . . . . . . . . . . . . . . . 50
5.2 Vanilla VQ’s complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Residual’s STD change vs. #codewords for vanilla VQ . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Performance of dimension progressive VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Energy threshold vs. #components send to next level in c/w transform . . . . . . . . . . . . 61
5.6 GIC-v2’s model complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 Performance of context adaptive entropy coding . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Dimension progressive VQ’s affect on standard deviation . . . . . . . . . . . . . . . . . . . . . 78
6.3 Energy threshold vs. #components send to next level for a 16 × 16 block . . . . . . . . . . . 79
6.4 GIC-v3 model size and complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
iv
List of Figures
2.1 VQ visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 DCST framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Correlation after DCT/PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 L2 distance between learned kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Performance of DCST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 DCST result images on Kodak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 DCST result images on DIV2K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 DCST result images on MBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8 Performance of proposed IDCT kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.9 Performance of proposed IDCT kernel on mismatched test dataset . . . . . . . . . . . . . . . 30
4.1 Forward and Backward framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Multi-grid Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Long range correlation on luma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 RD curves in different grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Framework of MBVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 RD curve for single grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.7 Overall RD curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8 Adaptive codebook selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.9 Large block size wrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 MGBVQ’s performance on Lena 256 × 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.11 MGBVQ’s performance on Lena 512 × 512 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
v
4.12 MGBVQ’s performance curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 GIC-v2 framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Channel wise transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 RD curve for different block sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Dimension Progressive VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 MSE saturation cure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 Context model design for VQ skip flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 Bitrate estimation based on zero percentage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.8 GIC-v2 performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.9 Performance comparsion for single image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.1 Performance of adaptive codebook selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 Global and local VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Performance of global and local VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.4 Context model design for VQ skip flag coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.5 Entropy with/without context model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.6 Lambda vector interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.7 Curve fitting results for R-D and TH-R relationship. . . . . . . . . . . . . . . . . . . . . . . . 76
6.8 Grid-8’s R-D curve shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.9 Performance of our GIC-v3 on Kodak dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
vi
Abstract
Image compression techniques play an increasingly important role in our daily lives as we can easily take
pictures with our cameras or phones and view millions of images online. Compression techniques are required to store more photos on one device. Lossy and lossless compression are two solutions. In lossless
compression, no information is lost during the encoding time, which results in ideal reconstruction. However,
the lossless algorithm can only reduce the file size to one-half. Conversely, lossy compression will discard
some information to yield a much larger compression ratio. The discarded information is not so important
that the viewer may not be aware.
In the traditional compression field, many well-designed codecs, like JPEG, JPEG2000, WebP, BPG,
etc., have been proposed to compress the image efficiently to a small size. They are standardized and
widely used in our lives. Many improvements, like intra-prediction, coding tree unit partition, etc., are being
adopted to the standards to improve the rate-distortion gain. These improvements increase the burden for
rate-distortion optimization. The coding complexity increases exponentially against the additional RD gain.
VVC intra recently reported state-of-the-art performance with unprecedented complexity.
A newly emerging compression approach is the deep-learning-based codec. It achieves astonishing performance compared with traditional codecs. The variational auto-encoder-based framework is the most popular
deep learning-based compression algorithm. It transforms the input images into some latent representation.
Non-linear transform, end-to-end optimization, attention mechanism, etc., are introduced to enhance coding
results further. On the other hand, their complexity is a huge issue. Millions of operations are required to
decode one single pixel.
To reduce the complexity and model size while maintaining the performance, we propose the green
image codec (GIC) in this dissertation. It attempts to merge the advantages of traditional codecs and
vii
deep-learning-based codecs. The multi-grid representation, regarded as the critical point of the success of
deep learning-based codec, serves as the foundation of GIC. We realize it through the Lanczos interpolation
and channel-wised transform. Based on this, we propose a series of coding tools like dimension progressive
vector quantization, PQR color space, etc. Rate distortion optimization is redesigned for our multi-grid
framework. A VQ-specialized RDO through lambda vector interpolation is proposed. We also adopt tools
from traditional coding standards, like CABAC, Huffman, and RDO in scalable video coding, etc. This new
framework results in a low-complexity and small-decoder-size codec with moderate performance compared
with others.
viii
Chapter 1
Introduction
1.1 Significance of the Research
Images and videos contribute more than half of the internet traffic. Compression plays an essential role
in reducing file size and improving transmission efficiency. High quality and high efficiency are two of the
critical features required by the world for real-time encoding and decoding of good-quality images. Starting
from JPEG [73, 128], many efforts have been spent on image compression standards for better RD gain and
different user scenarios like progressive decoding and high-resolution images. Many of them have been widely
researched, used in production, and are well-supported by both hardware and software like JPEG2000 [90],
WebP [44], H.264/AVC Intra [141], HEIC (H.265/HEVC intra) [71, 118, 152], BPG [16], AV1 intra [20], and
H.266/VVC intra [24,131]. With the improvement of the codecs, the compressed file size continues to shrink
while the same image quality is retained [46].
VVC intra recently reported the state-of-the-art performance that can reduce the size of files by more
than half when compared with JPEG. However, the majority of these improvements come from the sacrifice
of complexity.
Many traditional codecs follow the transform coding scheme starting from JPEG. Thousands of people
contribute to it to increase the performance by proposing different modules/modes to handle exceptional
cases or specific contents that typically result in a 1-2% rate distortion gain in each contribution. A new
standard consists of many small gains to achieve an impressive overall RD gain.
1
This causes a problem that almost everyone in the traditional compression field would take this framework
for granted and try to keep increasing its performance by adding more functions. In that case, we have a
more extensive codec that needs much more time to encode or decode an image than JPEG.
According to the leaderboard of the ”Challenge of Learned Compression.” in the computer vision and
pattern recognition conference (CVPR), the VVC’s decoder size is nearly four times larger than HEVC’s
decoder size. VVC’s decoder size is more than 26,000 times larger than JPEG’s decoder size and the decoding
time is 15 times slower. This large decoder size and low decoding speed make it difficult to use in production
unless we have many powerful energy-efficient processors, especially for mobile devices, which are sensitive
to battery usage.
Only recently did we see that companies have used HEVC in small devices like cell phones. This is almost
ten years after it was standardized. One primary reason is that it is so complex that only powerful PCs and
servers can accept the power consumption. Even though the chip manufacturers will go into a 3nm era in
the next several years that can produce much more powerful processors, the complexity increment of the
coding standards is far ahead of the growth of computation power (of consumer devices) we have. There
may be more years before we can see VVC used in edge devices.
The above concern led to a new standardization effort called the ”Alliance of Open Media (AOMedia).”
It aims to develop high-efficiency, low-complexity, and royalty-free code. Efforts focus on speeding up the
RD optimization with an excellent decoding speed but a larger decoder size than HEVC.
The same coding scheme dominated the compression field for four decades until the recent emergence of
deep-learning-based codecs. The VAE-, conditional GAN-, and RNN-based codecs enrich the model zoo and
bring new viewpoints on compression. Furthermore, these learning-based schemes introduce the attention
module that helps allocate the bit rate better. It is helpful but takes time to realize in traditional codecs.
The deep-learning-based codec has a fatal drawback: massive energy consumption because of high complexity. It remains to be addressed by researchers. In most cases, even an encoder or decoder would require
a GPU to speed up the process. Simultaneously, their model size is much larger than the traditional codec
because the deep network has millions of trainable parameters. Besides, the deep-learning-based codec is
2
vulnerable to adversarial attacks that can generate some images with no sense or fail when they meet a
specific condition.
Recent development in coding standards and learning-based codecs results in an impressive RD gain
while the coding complexity increases simultaneously. Besides improving the RD gain, reducing the coding
complexity has become one of the main tasks for image compression. Much effort has been spent on the
hardware encoder/decoder with improved encoding/decoding speed. High complexity will prevent the codec
from being used in production, especially for small devices like cell phones.
We aim to build a new codec that combines the advantages of traditional and learning-based codecs. It is
low in complexity, efficient in decoding, and mathematically transparent. It is called the green Image codec
(GIC) due to its low power consumption. We analyze previous coding methods to better understand their
advantages and disadvantages to achieve this goal. We can identify key ingredients for a successful critical
by comparing different coding solutions.
Finding the most suitable coding framework requires carefully designing the module and verifying its
feasibility. The difficulty comes when we attempt to incorporate deep-learning-based codec’s advantages in
our framework. Their advantages may come from the non-linear transform and back-propagation. We need
to find solutions that are both competitive and transparent. We may discover reusable modules and integrate
them into one framework. It is technically challenging and time-consuming. For example, we need to modify
or rewrite the codes for these modules. To have the desired property, new modules are also needed. They
require lots of tests and verification.
1.2 Background of the Research
After carefully reviewing prior arts in traditional codecs and deep-learning-based codecs, we found some
key points that led to the excellent performance of learning-based codecs: (1) Inter-image correlation, (2)
multi-scale representation, and (3) kernel weight learning.
Inter Image Correlation
3
Traditional image codec is based upon a single image. Every decision is made based on the same input
image to be encoded. It does not exploit the information from other images. In contrast, the learningbased codec uses a large image training set to summarize the information shared by numerous photos.
When encoding the target image, it will have some prior knowledge that can help to reduce redundancy
and further compress the images. As deep/machine learning has dramatically changed our daily lives, the
learning-based codec is a trend that all future codecs will follow.
Multi-Scale Representation:
In traditional image codec, the images are partitioned into blocks to encode. Even though current
standards support large blocks up 128 × 128, they failed to summarize the long-range correlation (especially
the low frequencies) larger than the coding block size. The deep-learning-based codec has an efficient pooling
layer and a convolutional layer that will reduce the images to a series of different resolutions until a small
spatial resolution. The information is compacted during this process. The long-range correlations not
leveraged by the traditional codec are also included.
Kernel Weight Learning
The traditional codec will have fixed transform kernels. Even though there are many pre-defined candidates, the best one can be found through RD optimization. Its performance could be further enhanced.
Take the color space transformation as an example. All coding standards widely use the YUV color
space. It works well for a wide range of diversified images. However, we can find correlations among different
color channels for a single image, leading to tools such as predicting chroma from luma. These tools can be
removed if the three color channels are de-correlated, which helps reduce the complexity.
On the other hand, the deep-learning-based framework has millions of learnable parameters. They are
determined in the training process and can adapt to various input data well. They are jointly optimized with
the RD loss function. This differs from traditional codecs, where the RD is only used to find the optimal
coding modes.
After observing these key points contributing to the deep-learning-based codec’s success, we propose the
GIC. It is built upon vector quantization and multi-scale representation. It is mathematically transparent
and low in complexity.
4
We develop various tools in GIC, such as the PQR color space, the HVS-based quantization table,
learning-based optimal inverse transform kernel design, etc. They have been verified in the JPEG framework.
Coding tools such as CABAC, Huffman, etc., are also adapted to the GIC framework.
The GIC framework is getting mature after intensive efforts for three years. It is a multi-scale block
size vector quantization-based framework with moderate performance. To reduce the coding complexity
of GIC, we adopt 1) the Saab transform kernel, 2) content adaptive, and 3) dimension-progressive vector
quantization. This framework has good performance and very low encoding/decoding complexity.
1.3 Contribution of the Research
In Chapter 3, we proposed the machine learning-based inverse DCT transform to address the altered distribution resulting from quantization. Instead of adding one post-processing stage to compensate for the
quantization error, the proposed inverse transform kernel assists in modeling the quantization errors along
with the inverse transform kernel, leading to much better-reconstructed results and lower complexity. The
learned kernel is image-dependent and is signaled along with the bitstream.
In Chapter 4, we proposed our first green image coding framework based on the fundamental idea of multigrid representation. We employed Lanczos interpolation to derive the multi-grid representations, which can
retain more energy in the down-sampled representation compared with traditional approaches to get the
multi-grid representation. Then, we encoded the image/residual starting from the minor grid. We point out
that the small grids help capture the long-range correlation (low frequencies), and the large grids capture
the short-range correlation (high frequencies). The bits spent in small grids give more efficiency in RD
performance, and with more energy sent to the small grids through Lanczos interpolation, we can encode
more energy in the early stage and achieve better RD gain.
Spatial vector quantization was directly used to encode the residuals, yielding good performance. We
perform different block partitions on the input and perform VQ starting from the largest block size with the
motivation to capture the low frequencies first and multiple times. However, the complexity of spatial vector
quantization increases dramatically as the number of codewords or VQ block size increases.
5
In Chapter 5, we addressed the complexity issue of VQ by utilizing the channel-wise transform and the
proposed dimension progressive vector quantization. The channel-wise transform helps decompose an input
block into different levels, with each level having much lower dimensions than performing spatial VQ directly.
The channel-wise transform and VQ combination helps encode the image with very low complexity.
With the help of the multi-grid representation partition, the input images to many grids and the channelwise transform splits each grid into multiple levels. We can divide the input image into much finer subbands/levels (much more than the traditional Laplacian pyramid-based coding). Each level will be encoded
by a VQ, which gives us more flexibility to perform rate-distortion optimization in each VQ to maximize the
overall performance.
Dimension-progressive VQ assists in splitting a single VQ with many codewords into several cascaded
VQs with much fewer codewords. Unlike the traditional multistage VQ [27], we will perform early-stage VQ
on a subspace of the entire vector rather than VQ over the whole vector several times. With our approach,
model size will be reduced more, and the performance will not degrade a lot. Other coding tools, such as
adaptive codebook selections, were proposed to achieve better RD gain.
Unlike traditional VQ, people tend to improve the coding efficiency by finding better codewords and
controlling the bitrate through the codebook size, which lacks systemic rate-distortion optimization and
cannot achieve continuous bitrate. With the skip, we can control the bitrate by skip performing certain VQ.
A simple rate-distortion optimization method based on selecting the minimum coding cost at a given lambda
can be used to find the optimal coding parameters, which makes it easier to operate the codec.
Chapter 6 added two new rate-distortion optimization methods to enhance the green image codec and
provided the VQ-based coding framework with a systemic RDO.
A two-pass RDO method, utilizing lambda vector interpolation specialized for the GIC framework, was
proposed to get the continuous bitrate from the existing parameter settings. The lambda vector interpolation
with standard RD curve mapping helps to achieve the target bitrate and continuous RD curve.
A three-pass RDO, inspired by the RDO in scalable video coding, is also designed and implemented. It
simply models the RD curve in each grid and helps to compute the suitable parameters to achieve the target
bitrate.
6
We further enhanced the adaptive codebook selection by adding codeword filtering to generate the candidate codebooks and minimize the size of candidate codebooks by saving the parent codebook and sorting
the index only. Through codeword filtering, we can focus on the subspace that can provide good RD gain
only, which differs from the most traditional VQ-based methods in that the codebook will cover the entire
vector space.
Global and local VQ have also been added to improve the content adaptive VQ further. With this
framework, we can achieve reasonable performance when the bitrate is lower than 0.25bpp. For higher
bitrate, the VQ’s sparsity will make it inefficient.
Besides, after developing the image codec, we will further extend our work to green video coding based
on the proposed GIC framework and more new coding tools to achieve low complexity and efficient video
coding.
We aim to demonstrate the advantages of the proposed framework in future coding standards and attract
other researchers to contribute to its development. The combination of our framework with a deep learning
network is particularly intriguing due to the widespread use of VQ in this domain. Utilizing our framework
can alleviate the encoding complexity of the DNN and enhance fidelity.
1.4 Organization of the Dissertation
The rest of the dissertation is organized as follows. Some background coding knowledge (e.g., traditional
coding standards, the deep-learning-based coding methods, vector quantization) is reviewed in Chapter 2.
The data-driven color/spatial transform (DCST) is introduced in Chapter 3. The first and the second
versions of green image codecs, denoted by GIC-v1 and GIC-v2 are presented in Chapter 4 and Chapter 5,
respectively. The latest version GIC-v3 will be introduced in Chapter 6. Concluding remarks and ongoing
research work are described in Chapter 7.
7
Chapter 2
Background
This chapter will review the traditional codec and its evolution in Section 2.1. Then, the popular deeplearning-based codec will be reviewed in Section 2.2, where we classify deep-learning-based codecs as end-toend and hybrid codecs. We analyze the state-of-the-art models and explore the possibility and limitations
being incorporated into standards. Section 2.3 reviews the recent development and applications in vector
quantization and analyzes why it is not widely used in image coding.
2.1 Traditional Standardized Codecs
JPEG 2000 uses the discrete wavelet transform (DWT) to transform the images into different frequency
bands before using a complex entropy model to encode the coefficients. Due to the multi-band partition,
it can progressively decode the image, making it ideal to use as a web image format. However, it was less
successful than expected because of its relatively high coding complexity (at that time) and the popularity
of JPEG.
Most imaged codecs are based on the block transform and scalar quantization structured framework.
Take JPEG as an example. Images are first partitioned into non-overlapping blocks of size 8 × 8. Next,
the discrete cosine transform (DCT) [9] is conducted on each block for spatial domain de-correlation and
energy compaction. Then, DCT coefficients are quantized based on a well-designed quantization table and
the quality factor needed. Finally, quantized DCT coefficients are zig-zag scanned and entropy coded to
8
yield the output bit stream. Besides, some unique codec versions are being developed to achieve special
properties, like JPEG-progressive, H.264/SVC for progressive coding.
Later standards follow the same skeleton and keep extending the framework. More flexibility is being introduced to the codec to achieve better RD gain. For example, the number of intra-prediction angles/modes increased from 9 (luma 4 × 4 block in H.264/AVC) to 35 (H.265/HEVC) and recently became
67 (H.266/VVC); more transform types (DCT-2, DCT-7, DST, ADST, etc.) are introduced to allow better
energy compaction on the different kinds of blocks, more coding unit partitions, more prediction/transform
types, more taps in post-processing filters, etc. This flexibility helps to improve the rate-distortion performance by a significant margin against the previously fixed block size and single transform kernel coding
framework.
HEVC intra allows a maximum coding block size of 64 x 64 pixels, with a depth level ranging from 0 to 3,
to handle regions with different textures. It has 18 modes for 4 × 4 prediction unit (PU); 35 modes for 8 × 8,
16×16, and 32×32 PUs; and four modes for 64×64 PUs for intra-prediction. The RD cost of each prediction
mode belonging to this subset is computed, and the mode with the best RD cost is selected to (intra) encode
the PU. However, with so many candidates available, choosing the best coding modes from all these possible
candidates is not a trivial task. Exhaustive searches would not be practical since they require much time to
compare all the candidates (including entropy coding). To speed up the RD search, rate-distortion estimation
is used to predict the bit rate and distortion as a function of quantization step size or percentage of zero
coefficients. The encoder estimates the RD costs for many possible candidates and dramatically reduces the
search space. Even though estimation can significantly reduce the coding complexity, they will still need
to perform several actual encodings before making the final decision. Regardless of our powerful CPUs and
GPUs, this complexity remains a considerable burden, especially for mobile devices like cell phones. In
addition, with the development/improvement of CMOS, 8K videos and images with 100 million+ pixels can
be taken from some consumer-level equipment, which raises the pressure on efficient compression techniques.
To further speed up the codec, many efforts have been spent on optimizing the encoder and decoder,
increasing the accuracy of RD estimation, and developing the hardware encoder/decoder/accelerator [3, 31,
68,116]. With the help of hardware devices [11], one can achieve real-time encoding and decoding for HEVC.
9
Some hardware/chips like real-time encoder MB86M31 from Socionext and quick sync video technology from
Intel [60] have been used in production already. However, for many hardware implementations, there remains
a long path before maturity and tuning into mass production because of cost and energy efficiency.
Besides, much effort is spent on in-loop filters (for video) and post-processing filters to reduce the coding
artifacts like blurring, ringing, etc., and improve the reconstruction quality. For example, sample adaptive
offset (SAO) [39], and de-blocking filter [98] in HEVC, constrained directional enhancement filter (CDEF) [93]
in AV1. They are becoming more complex and may only filter several frequency bands or histogram bins.
Additional taps or more directional adaptive modes are added to improve compression artifact removal
results and to retain edge information.
The YCbCr color space has been used in most compression standards because of its highly de-correlated
characteristic over the statics [56]. However, it could be more optimal image-wise, leading to techniques like
predicting chroma from luma [124, 153] to further reduce the correlation between the three color channels.
Efforts on PCA color transform [26] [79] have shown better de-correlation performance to reduce the bit rate
further.
Quantization is another essential part of lossy compression. JPEG uses a product quantization matrix
based on the human visual system [129]. While in the most recent compression standard, uniform quantization is used for the sake of low complexity. Many researchers have investigated human visual systems
(HVS), and the first breakthrough to incorporate HVS in image coding is [89], which represents HVS as a
non-linear point transform followed by the modulation transfer function (MTF).
Research on inverse DCT (IDCT) (or other inverse transforms) has focused on complexity reduction
via new computational algorithms and efficient software/hardware implementation. An adaptive algorithm
is proposed in [101] to reduce IDCT’s complexity. A zero-coefficient-aware butterfly IDCT algorithm is
investigated in [104] for faster decoding. As to practical implementations, a fast multiplier implementation
is studied in [12], where the modified Loeffler algorithm is implemented on FPGA to optimize the speed and
the area.
More work needs to be done on analyzing the quantization effect on coefficients. In image/video coding,
DCT coefficients undergo quantization and de-quantization before inverse transform. When the coding bit
10
rate is high, quantized-and-then-dequantized DCT coefficients are very close to input DCT coefficients.
Consequently, IDCT can reconstruct input image patches reasonably well. However, the quantization effect
is not negligible when the coding bit rate is low. The distribution of quantized DCT coefficients deviates
from that of original DCT coefficients. The traditional IDCT kernel, derived from the forward DCT kernel,
could be more optimal.
For entropy coding, it helps to reduce further the number of bits required for saving the coefficients. In
JPEG, a well-designed Huffman table [55] is employed to encode the AC coefficient after run-length coding.
With the introduction of the QMcoder, the binary arithmetic coder (BAC) and context adaptive binary
arithmetic coder have become widely used. It has moderate complexity compared with an arithmetic coder
with multiple symbols. It uses neighbor information/correlation to build many context models to improve
the compression ratio further. CABAC became a main entropy coder in the later standards. Many mapping
functions map different values to binary strings as input to CABAC [91]. Typically, direct mapping, Huffman
mapping, etc., are used.
Many quality assessment tools have been proposed to align image better quality with human opinions and
guide RD optimizations. SSIM [51,137] and MS-SSIM [139] are metrics being widely adopted in recent years
because it would give more weight to distortions that the human the visual system would care more about.
Learning-based quality assessment tools like VMAF [110] and Apple advanced video quality tool (AVQT)
has been used extensively in video quality assessment. Increasingly, more metrics are being proposed to
evaluate compressed image/video quality to match the human subjective test, which considers properties of
the display, including the resolution, display size, viewing distance, etc. However, PSNR and MSE remain
widely used in coding standard development and RD search because of their simplicity, effectiveness, and
the difficulty in finding widely accepted alternatives [136] regardless of the weakness that they cannot align
well with the human eye.
2.2 Deep-Learning-Based Codec
Deep learning (DL) [74] has performed phenomenally on various image processing tasks like classification,
segmentation, object detection, etc. We are comparing it with the traditional approach. We no longer need
11
to handcraft the features as we did in the past with conventional computer vision research. Through backpropagation and non-linearity, it is possible to derive much more powerful features after millions of iterations.
For example, SIFT [97] based object classification requires careful descriptor and matching criteria design.
On the other hand, DL-based classification only requires designing the model structure and the loss functions.
Then, the GPU will take over all the training tasks. This can easily double the prediction accuracy for some
classification tasks. Many DL models, such as face recognition and object detection, have already been
applied daily.
The development of deep learning algorithms motivated the development of DL-based compression. It
has attracted much attention from computer vision, machine learning, and traditional compression fields. It
is trying to break the monopolization of traditional codecs. There have been many models and architectures
published for image/video compression. They have achieved astonishing RD gain compared to traditional
codecs, especially on the subjective human test.
We can generally classify DL-based image coding into two categories: (1) end-to-end codecs and (2)
hybrid codecs. In end-to-end codecs, the encoder and decoder are trained end-to-end jointly, guided by some
specific loss function.
Among all the frameworks, autoencoder (AE) [121], or variational autoencoder (VAE) based codecs
[14,65] are the models that attract the majority of attention. In [15], it performed close to BPG. Later, with
deeper models and more specific designs introduced, VAE and its variants like VAE with hyper-prior [15]
and deep convolutional VAE [22] are proposed. They will transform the input images into some latent
representation, quantized and decoded by the decoder network. The hyper-prior will model the latent
coefficients’ distribution to help encode them better. VAE with adversarial training [7,80,146] has performed
well against the existing codec but failed to show the advantage of adversarial loss. Commonly, most
methods involve increasing layers and making deeper networks to achieve compression results. These methods
substantially increased model complexity and computational cost.
The GAN-based codec, which gave impressive visualization results in an extremely low bit rate by transforming the structure information in the bit stream and synthesizing the texture at the decoder/generator
side, reduces the blurring and coding artifacts. [7] uses a conditional GAN based on MSE/SSIM and reports
12
a very nice performance that significantly outperforms BPG. Selective generative compression idea is proposed to let the network know what needs to be compressed and what needs to be generated using pixel-wise
distortion mask and random instance. According to their ablation study, many backgrounds are generated
rather than compressed. At the same time, the objects that attract attention are compressed.
The recurrent neural network (RNN) based work [122, 123] using GRU/LSTM and a binary network can
perform predictive coding and results in a progressive bitstream. It is a simple solution to achieve a single
model for multiple-bit rates, but we have yet to see much work in high-resolution image compression. Only
a few works are proposed within this scope, and many of them are limited to small block sizes like 32 × 32.
The transformer-based model [86] has been proposed for compression in the last several years and shows
excellent performance. It enriches the model zoo of DL-based compression. However, considering its data
hunger property and large model size, more effort is needed to improve its frameworks.
In most cases, deep networks transform images into a tight latent representation. The network model
can capture some common information shared by many training images. With the help of the bottleneck
layer, the model can reduce the amount of data transmitted in the bitstream with a learned entropy coder.
This bitstream contains a small number of latent variables in the interface of the deep encoder and the
deep decoder. The decoder will recover the image based on the latent variables. It is highly optimized
since it is jointly trained with the encoder, which generates the latent representation. The RD optimization
is performed on all the encoder parts, including the transform and entropy coding. This enables it to
outperform the traditional codec, where RDO is used only to select the optimal modes. Besides, the efforts
for complex entropy coder and transform design are reduced.
Furthermore, specialized layers like generalized normalization transformation (GDN) [13] help to perform
better normalization for pixel-wise prediction/reconstruction instead of using batch normalization (BN) and
rectified linear unit (ReLU). Because BN and ReLU are regarded as adding noise, which is not optimal for
pixel-wise prediction tasks like compression or segmentation. A better non-local layer realized through 1 × 1
convolution is employed to explore the long-range correlations better.
13
To overcome the non-differential quantization layer, uniform noise is added during the training stage to
let the gradient path through. Furthermore, to eliminate the mismatch between training and testing phases
in quantization, universal quantization [5] is introduced in recent works.
Entropy coding plays a vital role in DL-based codecs, especially the VAE-based framework. Apart from
accurate coding for the latent representation, it also estimates rates during rate-distortion optimization. A
probabilistic model is proposed in [15], which uses a hyper-prior to capture the dependencies in the latent
space. Auto-regressive [76, 92, 94] model to establish the context relationship between coded elements and
the element to be coded in the latent spatial space to improve the accuracy of probability estimation. In [49],
a 3D-auto-regressive model is proposed to combine spatial and channel relationships further to capture the
elements’ correlations . However, there is still a gap between the estimated and actual marginal distribution
of latent representation.
Many traditional quality assessment metrics like PSNR/MSE/SSIM, which have a simple analytic form
of the derivative, are widely used as loss functions. They can provide an excellent performance against
traditional codecs. For the complex evaluation tools like VMAF or [107]. They are not differentiable
because random forest regressor or support vector regressor (SVR) is used to predict the final quality score.
Even though these metrics can provide better visual performance, they cannot be used in loss functions for
DL models. [19] propose proxyIQA to approach the image quality assessment tool like VMAF using a DL
model. These works brought these non-differentiable IQA metrics into the DL-based models and provided a
better performance on a subjective test.
Besides, the DL-based codecs try to add more advanced loss functions that are not used widely in traditional coding fields. These loss functions are designed to exploit the human visual system [37] or attention
mechanism [10, 155] to provide better perceptual quality compared with the standards. Especially for the
attention mechanism, it is difficult to add to traditional codec because of the difficulty of performing saliency
detection accurately and efficiently using traditional methods. With the help of DL, we can quickly achieve
high-accuracy masks and perform RD efficiently. These advanced losses extend the scope compression, which
may be used in future standard designs.
14
To address the lack of multiple bit rates from one DL model, [48, 122, 123] use recurrent neural networks
[147] and a binary network to create a progressive bitstream that can be truncated at any point. Interpolating
the gain vector through Lagrange relaxation is another alternative, but it can only achieve RD points around
the trained one. For long-range points, several models remain needed.
The performance of the DL-based models is impressive and can easily outperform HEVC intra a lot.
However, these end-to-end trained codecs have some disadvantages which hinder them from practical applications:
1. Large model size and complexity: Network training, image encoding, and decoding all demand a
large amount of time and energy. The training process requires many training data and usually takes
millions of iterations to train a model. Even though some of them declare that they can achieve
real-time encoding/decoding. In most cases, GPUs are required.
2. Lack of mathematical transparency, vulnerability to adversarial attack, and domain shifting: Once
the input image has a different distribution than training samples, the performance may degrade
significantly. It is difficult to predict when the model will fail, so it cannot be a one-to-all solution.
On the other hand, the hybrid codec is the approach that is more likely to turn into production shortly.
The idea is to use both the traditional codec and the DL-based models to form a powerful codec. There are
many works related to this scope as well. It can be further classified into two categories based on whether
the traditional codec has been modified.
The first category is where the traditional codec is untouched/unmodified. DL is used to pre/postprocess the image/video in a sandwich structure. Super resolution-based framework [59, 77] is a typical
representative. In this framework, the input image is down-sampled using the DL model to a smaller
resolution; then, the traditional codec is used to encode/decode this small-scale representation. During the
decoding time, the up-sampling DL model performs super-resolution on it. The DL-based down-sampling
and up-sampling perform much better than the traditional bi-linear or bi-cubic interpolation. Besides, the
work also tries to generate a compact representation [47] by projecting all information (including color) into
one gray-scaled representation/image and encoding it using the traditional codec. With DL and non-linear
projection/transform can be easily achieved to make the representation more compact and informative.
15
In another category, DL replaces part of the functions in traditional codecs. We can find some applications related to this category in standards like AV1. DL models in this category primarily aim to increase
the RD gain or reduce the encoding/decoding complexity/time. With the data-driven property, block partition, RD search, and the massive effort of designing in-loop filters can be significantly simplified and sped
up. When the DL model prediction fails, there is also an auto-correction mechanism to avoid destructive
performance/adversarial attacks by switching back to the original modules. Typically, a shallow DL model
with two to three layers, which is easy to train and use, will be used to prevent massive computation.
Examples like fast coding tree unit (CTU) partition [81, 84] help to reduce the time for the coding tree
unit partition search to avoid massive computation for RD cost. Work that either directly predicts the
CU depth or uses a series of binary predictors to decide whether a split can provide impressive RD search
time reduction. Similarly, transform type search is used to limit the number of possible candidates that
will perform RD cost estimation [117] when coding the residual blocks because different transforms would
have extra energy compaction effects on different residual blocks. It helps to eliminate some low-probability
transforms based on the residual contents.
CNN-based in-loop filter [32, 105] and post-processing filter [78, 149] help to reduce the coding artifacts
and the distortions much more than traditional handcraft-designed filters. It also reduces the effort for filter
design, especially when the new standards tend to have more directional filters with more taps to filter out
the block boundaries and retain the edges.
2.3 Vector Quantization
Vector quantization (VQ) [40, 45] has been intensively studied for lossy image coding in the 80s and 90s
[41, 96, 109]. Lots of varients have been proposed, like entropy-constrained VQ [25, 154], tree-structured
VQ [64], etc. Transform-domain weighted interleave vector quantization (TwinVQ) has been successfully
applied in audio coding [50, 100] standards. Additionally, VQ for data hiding [36, 144] provides better data
and communication protection by encoding information into a media carrier to form a media file or an
unrecognizable code stream. It has been widely researched in recent years.
16
Figure 2.1: TSNE visualization of VQ classes assignment for the Gaussian data.
VQ for image compression was researched a lot many years ago. Multistage VQ [27] was used to reduce
the complexity while the performance will start to degrade with the increase of N.
Yet, almost all the image/video compression standards use scalar quantization (SQ) instead of VQ for
residual prediction quantization. This is attributed to the fact that SQ has lower complexity and a smaller
parameter size than VQ, making it much easier to fine-tune during RD optimization. While VQ needs to
train the codebook and the encoder and decoder need to save the codebooks, this memory overhead would
become a problem in the old days. Although VQ is proven to be a universal coding scheme, its codebook
training is a challenging task.
Typically, the VQ codebook is trained through LBG algorithm [106] or KMeans algorithm [8], an NPhard problem. We can only achieve the local minima rather than the optimal solution within a limited time.
The final codebook is also sensitive to initialization, especially when the vector dimension is enormous.
Although multiple initializations partially help solve the problem, the run time increases as the number of
initializations increases. Another problem is related to the codebook size and the quantization quality. As
17
Figure 2.1 shows, each color represents a group that will use the same codeword to quantify. To achieve the
high bitrate, we must have many codewords and training samples.
There are several recent developments in VQ-related work: (1) intra/inter prediction, (2) gain-shape
VQ [75], as used by the Daala codec [125]. (3) Some DL-based works like VQ-VAE [111] and soft-to-hard
vector quantization [4]. Intra-prediction is based on pre-defined modes and pre-decoded neighbor blocks’
pixel value (reference), and inter-prediction is based on the motion vector (MV), which points to the predecoded blocks in reference frames. The saved modes/MVs are patterns with different reference pixel values.
As a result, one single mode can yield thousands of codewords, making it much better than the traditional
VQ codebook.
In gain-shape VQ, a block is split into a gain scalar and a shape vector with unit gain. The gain is encoded
by scalar quantization. The number of impulses in normalized block coefficients determines the quantization
quality of the shape vector. Codewords are generated on the unit sphere [126], which is easy to compute;
we no longer need to save the codebook explicitly. Codeword contents/impulse vectors (rather than their
indices) are protected directly in the bitstream rather than signal the VQ codebook index (obviously, it is
inefficient to signal the index of a codebook with millions of codewords). They allow multiple impulses for
other blocks based on different contents and overlapped blocks to ensure better quality on block boundaries.
Daala can offer better perceptual quality because of this gain-shape VQ.
The DL uses vector quantization to generate the discrete latent representation. VQ-VAEs can model
very long-term dependencies through their compressed discrete latent space. It performs well for tasks like
image generation. Because it can efficiently use latent space and successfully model important features that
usually span many dimensions in data space, The jointly trained codebook can generate images with better
fidelity. It is successfully used in video generation as well. With the given sequence of frames, the model can
yield good results without any degradation in the visual quality while keeping the local geometry correct.
Soft-to-hard vector quantization [4] proposes a substitution of the vector quantization for better performance.
18
Chapter 3
DCST: A Data-Driven Color/Spatial Transform
In this chapter, we will illustrate the DCST framework shown in Figure 3.1 (which is based on JPEG and can
be modified to other coding standards). The data-driven color transform will be talked about in Section 3.1.
Two-directional two-dimensional PCA will be shown in Section 3.2. A new quantization matrix designed
based on HSV is demonstrated in Section 3.3. The inverse transform kernel that considers quantization error
is proposed in Section 3.4. Some experimental results are provided in Section 3.5
3.1 PQR Color Space
The input RGB image is de-correlated using PCA into a new color space, which we call PQR. PCA color
transform is performed for every single image during the encoding time, which helps to give better energy
compaction than YCbCr. P channel would account for more than 90% of total energy and is quite similar
to the Y channel, while the R channel only accounts for about 2% or even less energy, which would be
pretty different from the V channel. Similarly, JPEG compresses the down-sampled Cb and Cr channels.
We down-sample Q and R channels by two to match the JPEG 420 framework.
3.2 Two-Directional Two-Dimensional PCA Block Transform
When using DCT as a transform kernel, some weak correlations remain. As Figure 3.2 shows, the first
several AC coefficients remain highly correlated with others, while single image-based PCA can de-correlate
the image pretty well.
19
Input Image (2D)^2 PCA
x(i,j)
(8x8)
X(u,v) Uniform
Scalar
Quantizer
H(u,v)
Entropy
Coding in
JPEG
Inverse
Entropy
Coding in
JPEG
Bitstream Constant
stepsize
Uniform
Scalar
Dequantizer
X, Z
Machine
Learning Inverse
PCA
Reconstructed
Image
Encoding
Decoding
Figure 3.1: Proposed DCST image coding framework.
We propose using (2D)
2P CA [148] to perform block transform, first used to extract features for image
representation. Compared with 2D PCA [145], it can achieve the same or even higher recognition accuracy
with much-reduced weight requirements. N × N block will have N4 kernel weights for 2D PCA while
(2D)
2P CA only have 2N2 kernel weights.
(2D)
2P CA contains 2D PCA on horizontal direction and alternative 2D PCA on vertical direction.
Considering a m × n random block A. Let X ∈ Rn×d be a matrix with orthonormal columns. Projecting A
onto X yields an m × d matrix Y = AX. Let Z ∈ Rm×q be a matrix with orthonormal columns. Projecting
A onto Z yields an q × n matrix B = Z
T X. If there are M m × n training matrices Ak(k = 1, 2, ...M), then
the average matrix A¯ can be computed as:
A¯ =
1
M
X
M
k=1
Ak (3.1)
For 2D PCA, let A
(i)
k
and A¯(i) denote the i
th row vectors of Ak and A¯, respectively, so that
Ak = [(A
(1)
k
)
T
,(A
(2)
k
)
T
, . . . ,(A
(n)
k
)
T
]
T
20
(a) DCT (b) PCA
Figure 3.2: Correlation among first 16 DC/AC components of DCT and PCA in an 8 × 8 block.
and
A¯ = [(A¯(1))
T
,(A¯(2))
T
, . . . ,(A¯(n)
)
T
]
T
Then, co-variance matrix Ghorizontal can be computed as:
Ghorizontal =
1
M
X
M
k=1
Xm
i=1
(A
(i)
k − A¯(i)
)
T
· (A
(i)
k − A¯(i)
) (3.2)
For alternative 2D PCA, let A
(j)
k
and A¯(j) denote the j
th column vectors of Ak and A¯, respectively, so
that Ak = [(A
(1)
k
)(A
(2)
k
). . .(A
(n)
k
)] and A¯ = [(A¯(1))(A¯(2)). . .(A¯(n)
)]. Then, co-variance matrix Gvertical can
be computed as:
Gvertical =
1
M
X
M
k=1
Xn
j=1
(A
(j)
k − A¯(j)
) · (A
(j)
k − A¯(j)
)
T
(3.3)
Having co-variance matrix Ghorizontal and Gvertical, we can compute projection matrix X and Z. It is the
same step as PCA, which performs eigenvalue decomposition and sorts the eigenvectors based on eigenvalues.
We can call them horizontal and vertical kernels of (2D)
2P CA, respectively. Doing forward (2D)
2P CA, it
projects m × n block A into X and Z simultaneously, yielding an q × d matrix C, which is the transformed
matrix:
C = Z
T
· A · X (3.4)
21
Inverse transform can reconstruct the original matrix A from the transformed matrix:
A = Z · C · XT
(3.5)
In the proposed method, A are 8 × 8 non-overlapping blocks from input images, and C are transformed
blocks that need to be quantized. Both horizontal and vertical kernels (X and Z) are 8 × 8 kernels. It has
been proven that the optimal value for the projection matrix Xopt is composed of orthonormal eigenvectors
X1, ..., Xd of Ghorizontal corresponding to decreasing eigenvalues. Similarly, the optimal value for projection
matrix Zopt comprises orthonormal eigenvectors Z1, ..., Zq of Gvertical corresponding to decreasing eigenvalues. According to this property, the transformed coefficients of each block after doing (2D)
2P CA will follow
the same trend as the coefficients after performing DCT: energy of the coefficient will decrease from up to
down and from left to right so that it can get benefit from run-length coding after zigzag scanning.
We train (2D)
2P CA for every image during encoding to get the transform kernels to achieve better
energy compaction than DCT. Therefore, the overheads are the horizontal and vertical co-variance matrices,
Ghorizontal and Gvertical.
3.3 Quantization Based on Human Visual System
We follow the modulation transfer function applied in [29] and do some modifications to generate our
quantization matrix for coefficients from (2D)
2P CA.
Table 3.1: A proposed 8 × 8 quantization matrix of one image in Kodak dataset, based on Human Visual
System
16 16 16 16 17 18 22 24
16 16 16 16 17 18 22 24
16 16 16 16 17 19 23 25
16 16 16 17 20 22 26 29
17 17 17 20 29 34 41 45
18 18 19 22 34 44 56 62
22 22 23 26 42 56 79 91
24 24 25 29 46 63 92 108
22
MTF in [29] is shown as a nonlinear point transform:
H(u, v) =
2.2(0.192 + 0.144 ˆf(u, v))e
(−(0.114fˆ(u,v))1.1
if ˆf(u, v) > f
1.0 otherwise
(3.6)
where ˆf(u, v) is radial spatial frequency in cycles/degree, and f is the frequency of 8 cycles/degree at which
the exponential peak.
First, discrete horizontal and vertical frequencies, f(u) and f(v), are computed as:
f(u) = R(u)
∆ ∗ 2N
for u = 1, 2, ..., N (3.7)
f(v) = R(v)
∆ ∗ 2N
for v = 1, 2, ..., N (3.8)
where dot pitch ∆ is about 0.25mm on a high-resolution computer display, and N is the number of frequencies.
In [127], R and C are the numbers of sign changes in each row and column of the Hadmard Transform
matrix. In the proposed method, R and C are computed by the sum of the absolute slope value on each
sign change in each vertical and horizontal kernel column. Given vertical kernel Z ∈ RN×N and horizontal
kernel X ∈ RN×N , R and C are computed as:
R(u) = a
N
X−1
i=0
|Z(i, u) − Z(i + 1, u)| × sc(Z(i, u)) (3.9)
C(v) = a
N
X−1
i=0
|X(i, v) − X(i + 1, v)| × sc(X(i, v)) (3.10)
where a is the factor that scales the proposed horizontal and vertical frequencies to the range [0, N].
The reason is that the range of horizontal and vertical frequencies used in the DCT domain or Hadmard
domain is also [0, N]. Therefore, using the same range, the proposed quantization matrix will contain the
same coefficient level as JPEG, which performs well on HVS. When N = 8, a is set to 1.4 according to the
experiments.
23
sc(K(i, j)) denotes the sign change between K(i, j) and K(i + 1, j):
sc(K(i, j)) =
1 if K(i, j) × K(i + 1, j) < 0;
0 otherwise
(3.11)
To compute ˆf(u, v), discrete horizontal and vertical frequencies are converted to radial frequencies and
scaled for a viewing distance, dis, in millimeters:
f(u, v) = π
180 arcsin √
1
1+dis2
p
f(u)
2 + f(v)
2 (3.12)
dis is 512mm because the appropriate viewing distance is four times the height.
Then, these frequencies are normalized by an angular-dependent function, s(θ(u, v)), to account for
variations in visual MTF:
ˆf(u, v) = f(u, v)
s(θ(u, v)) (3.13)
where s(θ(u, v)) is computed by:
s(θ(u, v)) = 1 − ω
2
cos(4θ(u, v)) + 1 + ω
2
(3.14)
and
θ(u, v) = arctan( f(u)
f(v)
) (3.15)
where ω is set to 0.7 [102].
Finally, the human visual frequency weight matrix H(u, v) is calculated using Eq.3.6. Dividing by H(u, v),
coefficients in the transformed block are equally important on HVS so that a uniform scalar quantizer can
quantize them. The quantization table is computed as:
Q(u, v) = Round[
q
H(u, v)
] (3.16)
where q is the step size of a uniform scalar quantizer.
24
Figure 3.3: L2 norm of differences between kernels learned with different QFs.
Table 3.2: Evaluation results of the optimal inverse kernel on test images in the Kodak, DIV2K, and MBT
datasets use YCbCr as input, where training QFs are set to 50, 70, and 90.
Dataset
QF 50 70 90
Kodak PSNR (dB) +0.1930 +0.2189 +0.2057
SSIM +0.0029 +0.0025 +0.0012
DIV2K PSNR (dB) +0.1229 +0.1488 +0.1144
SSIM +0.0024 +0.0020 +0.0008
MBT PSNR (dB) +0.1603 +0.2537 +0.3038
SSIM +0.0025 +0.0022 +0.0005
Table 3.1 shows an 8 × 8 quantization table generated by proposed methods for one image in the Kodak
dataset. The proposed table is adaptive to horizontal and vertical PCA kernels and performs better than
the quantization matrix used in JPEG.
Considering that the proposed quantization matrix can be computed by horizontal and vertical kernels
generated in section 3.2, there is no extra overhead in the quantization part.
3.4 Machine Learning Inverse PCA Transform
To make the learner have a small model size, we used linear regression to find the optimal inverse kernel,
which can model the quantization error and result in better decoding. To match with the two-stage PCA we
25
Figure 3.4: PSNR v. Bit rate curve comparison between JPEG and the results from our proposed method
on Kodak image dataset
use, we split the learning process into two parts to learn the vertical and horizontal optimal inverse transform
kernels, which would significantly reduce the weight and increase the speed against one-stage learning.
The learning process happens during encoding time when the raw image is available. The first stage is
the regression between de-quantized image blocks BdQ and coefficients after vertical transform Bvertical−T
(without quantization) to find the optimal horizontal inverse matrix. Both BdQ and Bvertical−T
contains
N 8 × 8 image blocks. Then for each pair B
vertical−T
i ∈ Bvertical−T
, BdQ
i ∈ BdQ we can form the following
equation:
B
vertical−T
i = B
dQ
i
· X∗
(3.17)
By solving these N equations, we can find the optimal horizontal inverse kernel X∗
.
Second stage performs linear regression between B
dQ
i
· X∗
, BdQ
i ∈ BdQ and input image blocks Xi
.
Bi = Z
∗
· (B
dQ
i
· X∗
) (3.18)
Two optimal inverse transform kernel X∗ and Z
∗ have shape 8 × 8. Each time we encode an image, we
find a pair of corresponding optimal inverse kernels using data from the Y channel and save them in bit-rate
26
Table 3.3: BD-Rate and BD-PSNR between JPEG and our proposed framework using YCbCr and PQR as
compressor input using data in Figure 3.4
Color
BD BD-PSNR BD-Rate
YCbCr 0.5013 -8.4795
PQR 0.5738 -9.5713
(a) Image from Kodak (b) Zoom-in (c) JPEG (d) Ours
Figure 3.5: JPEG vs. Ours @QF = 70.
as overhead. Considering the Cb and Cr channels, we keep the transform and optimal inverse transform
kernel the same as the Y channel, which helps reduce the overheads and complexities.
Besides, we also extend the optimal inverse idea to color space transformations since images are encoded
in PQR/YCbCr color space, while in most cases, we will view it in RGB color space. After quantization, the
distribution for color value would change slightly. If we use the PQR/YCbCr inverse kernel, the result could
be better while not optimal. In that case, we will compute a 3 × 3 optimal inverse color space transform
kernel for each image and save it as overhead.
In total, we will save the co-variance matrix (which is symmetric) horizontal (8 × 8) and vertical (8 × 8)
optimal inverse (2D)
2P CA transform kernel and a 3 × 3 optimal inverse color transform kernel as overhead.
Even though we use 16 bits to save each value, the overall bit rate increment on a Kodak image (512×768×3)
is less than 0.007bpp, which is acceptable.
3.5 Experiment
In the experiment parts, we will use Kodak [66], Multiband Texture (MBT) [2] and DIV2K [6] datasets.
27
(a) Image from DIV2K (b) Zoom-in (c) JPEG (d) Ours
Figure 3.6: JPEG vs. Ours @QF = 70.
(a) Image from MBT (b) Zoom-in (c) JPEG (d) Ours
Figure 3.7: JPEG vs. Ours @QF = 70.
PSNR, SSIM [138] and Bjontegaard metric [17] are used to evaluate the performance. We realize our
ideas based on libjpeg [73] and use it as the baseline.
Color Space Influence To verify the importance of PQR color space, we stuck to our new framework
and tried YCbCr and PQR as input to the compressor. Table 3.3 shows that both PQR’s BD-PSNR and
BD-rate would outperform YCbCr inputs.
Optimal (2D)
2P CA Inverse Transform To test the effectiveness of machine learning optimal kernel,
especially the optimal inverse for (2D)
2P CA. We show the averaged PSNR and SSIM incremental offered
by the proposed IDCT designed with fixed QF values over the standard DCT in Table 3.2.
28
(a) Kodak PSNR versus QF (b) Kodak SSIM versus QF
(c) DIV2K PSNR versus QF (d) DIV2K SSIM versus QF
(e) MBT PSNR versus QF (f) MBT SSIM versus QF
Figure 3.8: Comparison of quality of decoded test images using the standard IDCT in JPEG and the proposed
optimal IDCT for the Kodak, DIV2K, and MBT datasets.
29
(a) PSNR versus QF (b) SSIM versus QF
Figure 3.9: Comparison of quality of decoded images using the standard IDCT and the proposed IDCT
trained by the Kodak dataset yet tested on the DIV2K dataset.
Miss Match of QF We experiment with the QF mismatch effect on kernel learning; we show L2 norm of
differences between learned kernels derived by different QFs in Figure 3.3. The figure shows that the kernel
learned from a certain QF is close to those computed from its neighboring QF values except for minimal QF
values (say, less than 20). When QF is very small, quantized DCT coefficients have many zeros, especially
in the high-frequency region, which makes linear regression poor. IDCT kernel derived from these quantized
coefficients contains more zeros in each column, which makes it different from others and can be used when
re-processing/editing the images.
Texture Level’s Affection As shown in Table 3.2, MBT has a higher PSNR gain than Kodak and
DIV2K. It is well known that texture images contain high-frequency components. When QF is smaller,
these components are quantized to zero, making it challenging to learn a suitable kernel. Yet, when QF
is larger, high-frequency components are retained, and the learned kernel can compensate for quantization
errors better for a more considerable PSNR gain on the whole dataset.
Impact of Image Content. Another phenomenon of interest is the relationship between image content
and the performance of the proposed IDCT method. On the one hand, we would like to argue that the
learned IDCT kernel is generally applicable. It is not too sensitive to image content. To demonstrate this
point, we use ten images from the Kodak dataset to train the IDCT kernel with QF=70 and then apply it to
all photos in the DIV2K dataset. This kernel offers a PSNR gain of 0.24dB over the standard IDCT kernel.
On the other hand, it is still advantageous if the training and testing image content matches well. As shown
30
in Table 3.2, MBT has a higher PSNR gain than Kodak and DIV2K. It is well known that texture images
contain high-frequency components. When QF is smaller, these components are quantized to zero, making
it challenging to learn a good kernel. Yet, when QF is larger, high-frequency components are retained, and
the learned kernel can compensate for quantization errors better for a more considerable PSNR gain on the
whole dataset.
Overall Performance From Table 3.3 and Figure 3.4, we can find the proposed methods would outperform JPEG a lot, which shows the power of PCA when using it in both color space conversion and block
transform and the optimal inverse kernel.
A zoom-in region of three representative images is shown in Figs. 3.5, 3.6 and 3.7 for visualization
inspection.
Our proposed method can have better color reconstructions and better edge, smooth, and texture regions
in these examples. After JPEG compression, the red turns to grey, strongly influencing the visual experience.
On the other hand, our method can preserve color much better than JPEG. Edge boundaries suffer less from
the Gibbs phenomenon due to the learned kernel. Similarly, texture regions are better preserved, and the
smooth areas close to edge boundaries are smoother using the learned kernel. All quantization artifacts
decrease significantly, and the overall encoding and decoding complexity remains reasonable.
31
Chapter 4
GIC-v1: Green Image Codec With Multi-Grid Representation
And Multi-Block-Size Vector Quantization
This chapter is organized as follows. The multi-grid block size vector quantization (MGBVQ) method
(also referred to as the first version of green image coding (GIC-v1)) is introduced. It consists of two
parts multi-grid representation (in Section 4.1) and multi-block size VQ (in Section 4.2). Other tools like
adaptive codebook selection and large block size wrapping will be introduced in Section 4.3 and Section 4.4.
Experimental results in 256 × 256 and 512 × 512 resolution are shown in Section 4.5.
4.1 Multi-Grid (MG) Representation
Multi-grid representation / Laplacian pyramid [18] has been proposed for a while, it gives good performance
in various tasks like image fusion [35, 130], recognition/classification [115], image enhancement [33]. SIFT
[30, 85, 142] uses the Laplacian pyramid to generate multiple representations and derive features that cover
different sizes.
As [53] shown, multi-grid representation plays an important role in de-correlation, especially for longrange correlations (whose range is much larger than the coding blocks’ size). Similar multi-scale idea has been
used in varies deep learning tasks like image processing [103], super-resolution [70], semantic segmentation
[42]. In the traditional coding standard, the images are split into blocks and perform de-correlation within
the blocks (with some reference pixels from neighbor pixels only). Even if they did a spectacular job in
achieving an excellent RD gain through the design of many inter/intra prediction modes, different spatial to
32
(a) (b)
Figure 4.1: Forward framework (a) to get the multi-grid representation, where the Lanczos interpolation is
used to downsample an input image into different spatial resolutions. Backward framework (b) to encode
the multi-grid representation.
spectrum transforms, multiple transform block sizes, carefully parameter fine turnings, etc., they failed to
further reduce the correlation among different blocks.
Starting from JPEG, predictive coding like differential pulse code modulation (DPCM) [99] has been used
to reduce the long-range correlations of DC components among blocks and used in many later standards. In
DPCM, the DC value is predicted using the previously decoded value and only the difference between the
predicted value and the true value is encoded. This helped to reduce the dynamic range for the DC coefficient
a lot. Secondary transform using Hadamard transform [108] is introduced in H.264 [88] for DCs after 16 ×16
intra-prediction to further de-correlate the DCs. However, this method is only applied to the DC component,
33
the inter-block correlation among low-frequency ACs are not considered and they are coded independently.
Even though later intra-prediction helped to reduce these kinds of correlations, that’s far from satisfactory.
In VVC, the idea of history-based motion vector prediction [150] can be explained as trying to enlarge the
motion vector search range to explore the long-range correlations among different frames. However, there
remains exists redundant information caused by the correlations among different blocks in intra frames.
To further remove long-range correlations, multi-grid representation is the most possible solution without
increasing the complexity of the codec dramatically. We can capture long-range correlation on a small scale,
the middle-range correlation will be handled in median resolution and short-range correlations are encoded
in the input scale.
Even though there exists some work like H.264/SVC (Scalable Video Coding) [114] which is built based
on a Laplacian pyramid to introduce the progressive decoding characteristic to the bitstream by splitting
it into several subset bitstreams. They encode the lower resolution images or lower frame rate or lower
quality (base layer) and the corresponding residual (enhancement layer) separately. While coding the residual/enhancement layer, they treat it as another image that uses the normal codec with minor modifications
to encode it. This not-optimized design of enhancement layer coding led to the degradation of performance
compared with the H.264/AVC [141].
The existence of multi-grid representation also partially explains the reason why deep learning based
codec works so well: with the fully convolutional structure or recurrent structure and the pooling/aggregation
layers, the input image is transformed into a representation that has a very small spatial resolution which
helps to further reduce the spatial redundancy. As Figure 4.3 shown, even though there are 512 pixels gap
between the two pixels, the correlation remains huge. This shows the importance of further exploring the
long-range correlations. Starting from GIC-v1 [133, 135], inspired by the multi-grid idea that decomposes
images into a hierarchical representation. Our basic idea is sketched below. First, GIC down-samples
an input image into several spatial resolutions (Figure 4.2 from fine-to-coarse grids and computes image
residuals between two adjacent grids. Then, it encodes the coarsest content, interpolates content from
coarse-to-fine grids, encodes residuals, and adds residuals to interpolated images for reconstruction. All
coding steps are implemented by vector quantization while all interpolation steps are conducted by the
34
Figure 4.2: Multi-grid Representation. Images are split into a down-sampled image (bottom row) and the
residual during the down-sampling process (top row).
Lanczos interpolation [34]. The interpolation difference against traditional Laplacian/Gaussian pyramids [18]
enable us send more energy to the small grids. This is important to our coding frameworks because the same
amount of bits spend in small grid will provides more gain than spending them in larger grids. With more
energy in smaller grids, we can achieve better coding performance.
With the help of the multi-grid representation, long-range correlations can be handled by small-resolution
representations, while short-range correlations are represented by large grids.
To encode/decode images of spatial resolution 2N × 2
N , we build a sequence of grids, Gn, of spatial
resolution 2n × 2
n, where n = 1, · · · , N, to represent them as shown in Figure 4.1b. To achieve this goal,
we begin with an image, IN , at the finest grid GN . We decompose it into a smooth component and a
residual component. The latter is the difference between the original signal and the smooth component. For
convenience, the smooth and the residual components are called DC and AC components for the rest of this
paper. Since the DC component is smooth, we downsample it using the Lanczos operator [72] and represent
it in grid GN−1. Mathematically, this can be written as
IN−1 = D(IN ), (4.1)
35
Figure 4.3: Long range correlation among luma component. Correlation among pixel at position (i, j),(i +
512, j),(i, j + 512),(i + 512, j + 512)
where D denotes the downsampling operation from GN to GN−1. Similarly, we can upsample image IN−1
to IN via
˜IN = U(IN−1) = U[D(IN )]. (4.2)
where the upsampling operation, U, is also implemented by the Lanczos operator. Then, the AC component
can be computed by
ACN = IN − ˜IN = IN − U[D(IN )]. (4.3)
The above decomposition process can be repeated for IN−1, IN−2, up to I2.
It is worthwhile to comment that the majority of the existing image/video coding standards adopt the
single-grid representation. The difference between smooth and textured/edged regions is handled by block
partitioning with different block sizes. Yet, there are low-frequency components in textured/edged regions,
and partitioning these regions into smaller blocks has a negative impact on exploiting long-range pixel
correlations, which can be overcome by the MG representation.
Another advantage of the MG representation is its effectiveness in RD performance. Suppose that grid
Gn pays the cost of one bit per pixel (bpp) to reduce a certain amount of mean-squared error (MSE) denoted
by ∆MSE. The cost becomes 4n−N bpp at grid GN since the MSE reduction is shared by 4N−n pixels.
36
(a) G6 RD curve (b) G8 RD curve
Figure 4.4: The RD curves for grids G6 and G8, which are benchmarked with one-stage, two-stage, and
three-stage VQ designs.
Table 4.1: Parameter setting for 256 × 256 images, where we specify the number of codewords and the
number of spectral components in codebook Cn,m by (#codeword, #components). For C8,3, we have N ∈
{8, 16, 32, 64, 128} which gives the 5 points in Figure 4.12a
G8 G7 G6 G5 G4 G3
C∗,8 (64,150) - - - - -
C∗,7 (128,150) (64,150) - - - -
C∗,6 (512,150) (128,150) (64,100) - - -
C∗,5 (512,50) (512,50) (128,40) (64,40) - -
C∗,4 (512,30) (512,30) (512,20) (128,20) (64,20) -
C∗,3 (N,-) (512,20) (512,12) (512,12) (32,12) (64,12)
C∗,2 - (128,-) (512,-) (512,-) (64,-) (32,-)
4.2 Multi-Block-Size Vector Quantization
The block diagram is shown in Figure 4.5. We use VQ to encode AC components in grids Gn, n = 2, · · · , N.
Our proposed VQ scheme has several salient features as compared with the traditional VQ. That is, multiple
codebooks of different block sizes (or codeword sizes1
) are trained at each grid. The codebooks are denoted
by Cn,m, where subscripts n and m indicate the index of grid Gn and the codeword dimension (2m × 2
m),
respectively. We have m ≤ n since the codeword dimension is upper bounded by the grid dimension. We
use |Cn,m| to denote the number of codewords in Cn,m.
At grid Gn, we begin with codebook Cn,n of the largest codeword size, where the codeword size is the
same as the grid size. As a result, the AC component can be coded by one codeword from the codebook
Cn,n. After that, we compute the residual between the original and the Cn,n-quantized AC images, partition
1We use codeword sizes and block sizes interchangeable below.
37
2nx2n
2n-1x2n-1
2kx2k
…
… … … …
…
…
…
Decoded VQ 2kx2k
Decoded VQ 2nx2n
Decoded
…
Σ
Residual of VQ 2kx2k
(Final)
Residual of VQ 2nx2n
Input Residual
Residual of VQ 2k+1x2k+1
Figure 4.5: Framework of MBVQ.
it into four non-overlapping regions, and encode each region with a codeword from codebook Cn,n−1. This
process is repeated until the codeword size is sufficiently small with respect to the target grid.
Before proceeding, it is essential to answer the following several questions:
• Q1: How to justify this design?
• Q2: How to train VQ codebooks of large codeword sizes? How to encode an input AC image with
trained VQ codebooks at Grid Gn?
• Q3: How to determine codebook size |Cn,m|?
• Q4: When to switch from codebook Cn,m to codebook Cn,m−1?
VQ learns the distribution of samples in a certain space, denoted by RL, and finds a set of representative
ones as codewords. The error between a sample and its representative one has to be normalized by dimension
L. If there is an upper bound of the average error (i.e. the MSE), a larger L value is favored. On the other
38
Figure 4.6: Rate distortion curve for every single grid/hop on Y channel, red star means the switching point.
Hop0 represents Grid8, Hop1 represents Grid7.
hand, there is also a lower bound on the average error since the code size cannot be too large (due to
worse RD performance). To keep good RD performance, we should switch from codebook Cn,m to codebook
Cn,m−1. Actually, Q1, Q3, and Q4 can be answered rigorously using the RD analysis. This will be presented
in an extended version of this paper. Some RD results of codebooks of different codeword sizes are shown
in the experimental section.
To answer Q2, we transform an image block of sizes 2m × 2
m from the spatial domain to the spectral
domain via two directional two dimensional PCA mentioned in Chapter 3.2. We split a large PCA into two
parts to reduce the number of parameters needed. All transform parameters are learned from the data. We
train codebook Cn,n based on the clustering of K components in the spectral domain. After quantization,
we can transform the quantized codeword back to the spatial domain via the inverse c/w Saab transform.
This can be implemented as a look-up table since it has to be done only once. Then, we can compute the
AC residual by subtracting the quantized AC value from the original AC value.
39
Figure 4.7: Overall RDO curve. Summarizing the switching point in Fig.4.6
4.3 Adaptive Codebook Selection
We notice that for images with smooth content, it is not useful and costly to use a codebook with a large
number of codewords as our previous version, while for other images, more codewords can lead to a huge
boost in performance compared with the small codebooks.
Based on the above observation, we proposed an adaptive codebook selection mechanism to select the
optimal codebook from several available codebooks to be used for one image. Multiple codebooks with a
different number of codewords are trained for blocks to satisfy the energy check-in previous section.
Let’s take the example of 4 codebooks C1, C2, C3, C4. The number of codewords |Ci
| follows:
|C1| < |C2| < |C3| < |C4| (4.4)
For the same level VQ from one single image, we can have 5 modes to encode these blocks: skip mode and
use one of the codebooks from {C1, C2, C3, C4}.
40
Figure 4.8: Performance comparison of adaptive selection against fixed codebook using the training residual
after Grid-8 16 × 16 blocks. 4 codebooks with codebook {32, 64, 128, 256} are used. The green points are
computed from the mean ∆P SNR and ∆MSE across different images when using the same threshold for
the ∆P SNR
∆MSE ratio. Blue points are the result when all the images using the same codebook to encode
The decision for which mode to use is based on the ratio
∆P SNR
∆BP P
(4.5)
using Algorithm 1.
The main idea for this adaptive selection is to maximize efficiency for every single bit to enable our result
to have a better slope as Figure 4.8 shown.
41
Algorithm 1 Codebook Selection
T H ← Hyper-parameter for threshold
Available codebook set C ← {C1, C2, C3, ..., Ck}
Y ← Reference representation
iY ← Decoded representation up to this VQ
prevP SNR ← func : calP SNR(Y, iY )
prevBP P ← 0
for Codebook Ci
in C do
Encode the residual using Ci
curBP P ← bitrate for entropy coding the labels
iR ← decoded residual
curP SNR ← func : calP SNR(Y, iY + iR)
∆P SNR ← curP SNR − prevP SNR
∆BP P ← curBP P − prevBP P
if ∆P SNR
∆BP P ≤ T H then
if i == 1 then
return Skip mode
else
return Encode use codebook Ci−1
end if
else
prevP SNR ← curP SNR
prevBP P ← curBP P
end if
end for
return Encode use codebook Ck
Figure 4.9: Framework of large block size wrapping.
42
4.4 Large Block Size Wrapping
When we perform VQ, even though we tried to perform spatial to spectrum transform and truncate the
sub-set of all the components before starting the VQ process, it remains costly for matrix multiplication
during the transform operation especially for the blocks greater than 32 × 32.
To reduce the complexity, we tried to wrap the input VQ blocks into a smaller resolution. For example,
as Figure 4.9 shown, given the 32×32 blocks to be vector quantized, we will first downsample it to a smaller
resolution like 16×16 before we start the spatial to spectrum transform. This helps to reduce the complexity
a lot while retaining the performance.
The major reason that can explain this is that during the wrapping process, the loss parts are mainly
high-frequency components that will be truncated after the spatial to spectrum transform. In that case, we
would not lose much for the low-frequency parts we want to encode. During the decoding time, after we
decoded the wrapped blocks, a de-wrap operation (upsampling) is used to get the block in the original VQ
block size. The smaller block size would help to reduce the model size we need to save as well. The simple
Huffman coder is adopted as the entropy coder for VQ indices. The codeword indices of each codebook Cn,m
are coded using a Huffman table learned from training data. It is simple and low in complexity.
It is advantageous to adopt different codebooks in different spatial regions at finer grids. For example,
we can switch to codebooks of smaller block sizes in complex regions while staying with codebooks of larger
blocks in simple regions. This is a rate control problem. We implement rate control using a quad-tree.
If a region is smooth, there is no need to switch to codebooks of smaller code sizes. This is called early
termination. Early termination helps reduce the number of bits a lot in finer grids. For example, without
early termination, the coding of Lena of resolution 256×256 at G8 with C8,8, which has 64 codewords, needs
0.09bpp. Early termination with entropy coding can reduce it to 0.0246bpp.
In order to utilize the correlation between the previously encoded content and the content to be encoded.
We propose to use guidance content that is available at both the encoder and decoder sides prior to splitting
the residual to be encoded into different sub-groups based on the existing correlations. Considering that all
the sub-codebook indices are derived, there is no additional cost during signaling.
43
(a) H.264-444
21.38dB@0.2447bpp
(b) JPEG Progressive
22.21dB@0.2355bpp
(c) BPG-444
28.84dB@0.2520bpp
(d) Our Method
26.29dB@0.2452bpp
Figure 4.10: Evaluation results for Lena of resolution 256 × 256.
(a) H.264-444
25.79dB@0.1176bpp5
(b) JPEG Progressive
21.29dB@0.1176bpp
(c) BPG-444
29.72dB@0.1191bpp
(d) Our Method
27.18dB@0.1194bpp
Figure 4.11: Evaluation results for Lena of resolution 512 × 512.
To further reduce the model size as well as speed up the training process, we propose to share the
codebooks among different VQs. Only one set of the codebook for each block size is needed (one set means
that for adaptive codebook selection, we have several different codebooks for one block size with a different
number of codewords). Besides, after performing the large block size wrapping, we only need to save one set
of codebooks for these large blocks. Otherwise, it would be costly to save the large block-size codebooks.
4.5 Experiments
In the experiments, we use images from the CLIC training set [28] and Holopix50k [54] as training ones,
and use Lena and CLIC test images as test ones. Except for Lena, which is of resolution 512 × 512, we crop
images of resolution 1024×1024 with stride 512 from the original high-resolution images to serve as training
and test images. To verify the progressive characteristics, we down-sample images of resolution 1024 × 1024
to 512 × 512 and 256 × 256 using the Lanczos interpolation. There are 186 test images in total.
44
(a) Scale: 256 × 256 (b) Scale: 512 × 512
Figure 4.12: Evaluation results on the average of 186 test images.
Table 4.2: Bit rate distribution as a function of the block size at each grid for Lena (256 × 256,
26.19dB@0.2286bpp), where QT stands for the cost of saving the quad-tree at each grid. Only DC in
G2 is coded. Other numbers in the first column show the block size and ”-” means either no bit needed or
not applicable.
G8 G7 G6 G5 G4 G3 G2
QT 4.0e-3 4.8e-3 1.3e-3 3.2e-4 7.6e-5 1.5e-5 -
DC - - - - - - 2.4e-4
C∗,8 1.5e-5 - - - - - -
C∗,7 9.1e-5 1.5e-5 - - - - -
C∗,6 7.6e-4 1.5e-4 1.5e-5 - - - -
C∗,5 5.1e-3 1.2e-3 1.6e-5 7.6e-5 - - -
C∗,4 0.0213 7.2e-3 1.7e-3 3.1e-4 6.1e-5 - -
C∗,3 0.0246 0.029 8.2e-3 2.1e-3 3.2e-4 7.6e-5
C∗,2 - 0.073 0.0322 8.6e-3 1.5e-3 3.2e-4 -
We select 3400 images from the training set to train VQ codebooks. They are trained by the faiss.KMeans
method [62], which provides a multi-thread implementation. Figure 4.4 shows multiple codebooks Cn,m at
grids Gn in terms of the number of codewords and the number of the kept spectral components in the final
stage in VQ training which results in the parameters we used shown in Table 4.1. The RD curve for every
single grid is shown in Figure 4.6. We select the optimal switching point to get the most convex RD curve.
The overall RD curve is shown in Figure 4.7.
Figs. 4.10 and 4.11 compare decoded images from several standards and our MGBVQ method at the
same bit rate of resolutions 256 × 256 and 512 × 512, respectively. Figs. 4.12a and 4.12b compare the
averaged RD performance for 186 test images of resolutions 256 × 256 and 512 × 512, respectively. We see
from these figures that MGBVQ achieves remarkable performance under a very simple framework without
45
any post-processing. More textures are retained by our framework since the traditional transform coding
scheme tends to discard high frequencies to achieve more zeros especially when the bit rate is low. While in
our framework, high frequencies are captured by larger grid VQ and smaller block VQ when necessary.
The distribution of bits spent in coding at each grid is summarized in Table 4.2. We see from the table
that the coding of DC and VQ indices in coarse grids accounts for a small percentage of the total bit rate.
The great majority of bits are spent in the finest grid and smallest block sizes, say, the coding of indices of
codebooks C8,3, C7,2, C7,3 and C6,2.
46
Chapter 5
GIC-v2: Green Image Codec With Channel-Wise Transform And
Dimension Progressive Vector Quantization
Figure 5.1: GIC-v2 framework
5.1 Introduction
After introducing GIC-v1 in Chapter 4, we found some drawbacks that prevent further improvements in the
framework in both RD gain and complexity. (1) Large vector dimension. This would make it difficult to
train an effective codebook and enlarge the model size. (2) Complete RD optimization is not involved in the
framework. (3) Over-fitting. To boost the performance, we need to increase the codebook size dramatically,
which would lead to over-fitting.
47
Figure 5.2: Channel wise transform
To solve these problems. We proposed the improved version: Saab transform content adaption dimension
progression vector quantization (SaCaDip VQ). It uses the channel-wise Saab transform (Section 5.2) to
reduce transform complexity and reduce the dimension of each vector and dimension progressive VQ (Section
5.3) to reduce the complexity of codebook training. Section 5.6 uses the color transform idea proposed in
Section 3.1 to achieve better energy compaction. An RD optimization method based on the The Lagrange
multiplier will be introduced in Section 5.8. In Section 5.9, experiment results of our improved framework
are shown, and we extend the results to 1024 × 1024 resolution. It helps reduce the complexity by 10X and
maintain performance.
5.2 Channel-Wise Transform
In our previous work, we would process each VQ’s input block in the spatial domain. It is easy to implement
and straightforward to understand. It performs well when the VQ block size is small because of the low
dimension of the codewords. However, when handling a large block size, for example, 32 × 32 block (1024
48
Figure 5.3: RD curve for different block sizes. The largest block will provide the best RD slop while the
smaller block can give more MSE reduction.
codewords dimension), it is inefficient to train a VQ codebook because of the number of training samples and
the required training time. On the other hand, these large block size helps explore the long-range correlations
and can provide much better RD gain than the small blocks, as Figure 5.3 shows, so it is impossible to neglect
them. The large blocks cannot significantly reduce the MSE unless we train a large codebook. In that case,
the optimal solution is to cascade the different block sizes, starting from the large blocks, to enjoy the best
RD curve of each block size. The channel-wise transform helped us to partition the images further into more
bands. Compared with the traditional multi-grid representation or the sub-band VQ [58, 95], we can have
a much finer partition, which enables us to have better control of each band. The residual propagation in
the channel-wise framework will let us encode the low frequencies (important components) multiple times,
different from the sub-band VQ, which splits the components and handles each frequency separately.
To speed up the codebook training for large blocks, we proposed to perform spatial to spectrum transform
to reduce the dimensions. The covariance matrix is needed to compute the transform kernel, which would
cost a lot for a large block like 32 × 32. Moreover, even though the transform helps to compact the energy,
we cannot reduce the dimension a lot. Otherwise, the codebook or decoded image quality will degrade
dramatically. For example, to keep 90% of total energy for a 16 × 16 block, typically, we need to retain
49
Table 5.1: Energy compaction after performing the spatial to spectrum on G5 8 × 8 × 3 and 16 × 16 × 3
blocks using 512 codewords
8 × 8 × 3
#dim 10 20 30 70 140 192
Energy 0.5102 0.6896 0.7957 0.9518 0.9952 1.0
Bit/Index 8.292 8.254 8231 8.231 8.255 8.229
MSE 456.61 445.45 444.82 444.05 443.63 442.82
Time (sec) 167.6 215.8 262.19 487.9 829.7 1081.8
16 × 16 × 3
#dim 10 30 70 140
Energy 0.2553 0.4742 0.6861 0.8463
Bit/Index 8.556 8.5728.373 8.469 8.373
MSE 579.1 565.92 565.40 564.17
Time (sec) 164.9 257.1 472.6 799.72
80 to 90 kernels out of 256 as Table 5.1 illustrates. We train the codebook using KMeans, a non-convex
optimization and would suffer greatly if the vector dimension is too high [67]. This brings another problem
for codebook training in such high dimensions. The sparsity in high-dimension space lets the codewords
occur in sub-spaces with a higher probability than in entire spaces. The existing ideas like entropy-weighted
KMeans [61] that help to perform the sub-spaces clustering in high-dimension space are not efficient enough
for our codebook training.
Expect the above problem from transform or dimension reduction; we found more problems when increasing large codebook sizes becomes a problem in our VQ, which conflicts with our lightweight purpose.
Large codebook sizes can boost the performance as Table 5.2 shown, but it has the following side-effects
• Long training time: a single codebook may take several days to train
• Data hungry: with more codewords, we need to add more and more training vectors
• Over-fitting: if the training is not sufficient, over-fitting will result in many redundant codewords that
will never be used during the test phase
• Large model size: the cost of saving large codebooks is huge, especially when the number of codewords
is large.
These drawbacks motivated us to modify our framework to reduce the training complexity and the model
size further.
50
Proposed in [21], channel-wise Saab transform has shown its power inefficient energy compaction. Characterized by lightweight and low complexity, it has become a widely used solution for feature extraction in
multiple green learning [69] tasks like point cloud classification [151] and face recognition [112, 113]. We
will utilize this channel-wise transform in our green image coding for efficient de-correlation and spectrum
representation.
The idea of channel-wise transform is shown in Figure 5.2. It consists of a sequence of small data-driven
transforms. In most cases, we will perform a 2 × 2 transform on each channel if its energy exceeds some
threshold (expect the first transform where we perform a 2 × 2 × 3 transform to de-correlate each input
channel). In this case, we split the coefficients in this level into two parts: one retained in the current level
and another sent to the deeper level. We regard the component retained at the current level (small energy)
as the local correlation in the current receptive field.
With this layered-by-layered transform, we can quickly decompose a large transform to several small
transforms and limit the coefficient dimension in each layer. For example, to get 32 × 32 transform, if we
perform a single PCA by reshaping the 2D block into 1D, its kernel will be a 964 ×964 matrix; if we perform
2D separable PCA, which will perform a horizontal and a vertical transform separately. Each one has kernel
size 32 × 32. For our channel-wise transform, each kernel is only 2 × 2 and typically needs 5 to 10 of these
transforms depending on the threshold we set.
Besides, it is crucial to note that the channel-wise transform has built-in aggregation/multi-grid properties. Starting from the input image, the spatial resolution will be reduced until 1 × 1 (the receptive field
is the same as the input size). With its help, we can decouple the input images into several levels. The
shallowest level has the smallest receptive field, which de-correlates the short-range correlations. On the
other hand, the deepest level has the largest receptive field (same as the input size) target at de-correlating
the long-range correlations. This helped us save much effort to compute another multi-grid representation.
Compared with the JPEG-2000, which also splits the images into different spectrum bands, Our proposed
channel-wise transform will perform much better than DWT for the following reasons. (1) Data-driven. Our
proposed idea uses the kernel derived from the training data. This can provide better energy compaction
properties than the fixed DWT kernel. (2) More level of transformation. In DWT, they perform 2 × 2
51
Table 5.2: Relationship among codebook size, performance, model size, and training time on G3 4×4 blocks
(16 dimensions, total 3517440 vectors). MSE represents the VQ quantization error in G3, and Bit/index
represents the average cost for each VQ index after being encoded by Huffman. #Par represents several
parameters/coefficients.
#Codewords MSE Bit/Index #Par Training Time (sec)
512 613.08 9.864 32768 121.1
1024 546.86 9.841 65536 482.4
2048 490.91 10.830 131072 1879.5
4096 442.97 11.788 262144 6214.3
8192 403.06 12.825 524288 12397.9
16384 366.51 13.880 1048576 24786.1
transform but only two to three levels so that the low-frequency band has a larger spatial resolution and
long-range correlations remain. In our cases, we will perform a series of transformations until the spatial
resolution of the final level becomes 1 × 1. This will help to compact the correlations better.
VQ will start from the deepest level/largest receptive field to encode the coefficients, and the quantization
residual will be transformed back to the previous level (to avoid information leakage) along with the coefficients retained in that layer to perform another VQ. In that case, the global content with a more significant
impact on image quality will be enhanced multiple times.
5.3 Dimension Progressive Vector Quantization
After we derive the coefficients in each level from the channel-wise transform, we need to use VQ to encode
it. Our previous framework used the most straightforward solution to combine all the vector dimensions
to train a codebook with many codewords (ex., 8000, 16000, or even 32000 codewords). As the number of
codewords increases, the cost of training increases a lot. The codebook is more likely to over-fit the training
distribution to generate some codewords that will never be used in the testing stage. This would degrade
the performance a lot when applied to test sets. To solve this, we must collect more training samples and a
PC with huge memory to train the codebook or develop some distributed training algorithm. Besides, the
training time of the VQ will depend on the vector dimension, the number of codewords, and the number
of training samples. We may need several days to get a trained codebook. Another critical weakness that
motivates us to change is the codebook/model size, which will be saved as overhead in both the encoder and
decoder. The large model size makes our idea less competitive and less green.
52
Table 5.3: The residual’s standard deviation after performing VQ (use all dimensions) with different numbers
of codewords
dim init 2 4 8 16 32 64
0 32.75 23.40 16.54 14.43 12.50 10.06 8.86
1 20.75 20.75 18.94 16.98 13.00 10.47 9.05
2 20.14 20.15 19.89 15.48 12.66 10.85 9.11
3 10.81 10.80 10.80 10.81 10.80 10.64 9.33
The traditional multi-stage VQ [38,87] solves this problem by cascading several VQ with a small amount
of codewords. The solution is feasible, but it will involve more parameters (the number of codewords in each
VQ). The performance will degrade greatly with the incremental number of stages. This will bring another
problem: how many stages shall we use? and add more burden on the RD optimization.
The above weakness motivates us to develop a new idea called dimension progressive VQ after converting
the block into the spectrum domain. The coefficients in different dimensions will have different standard
deviations (ST D). The low-frequency trends have more energy and will dominate when performing VQ.
If the amount of codewords is small, KMeans will automatically reduce the large STD components before
touching the other components as Table 5.3 shown. The first dimension has the largest STD, so when
only two codewords exist, the quantized residual is only reduced in the first dimension. Not until the first
and second dimensions had similar STDs did the second dimension start to take effect. Even though the
codewords are trained in 4 dimensions, some would not provide much information. It can be found that
the last dimension’s STD is not changed until there are 32 codewords, and the first three dimensions have
a similar STD. Then, adding more codewords would help reduce the STD for all the dimensions with the
same speed. Compared with the multi-stage VQ mentioned above. Our DPVQ makes it easier to select the
parameters and further reduces the model size because we only perform the VQ on the entire vector at the
last stage VQ.
Based on this observation, we can perform the small codebook size VQ on the first several components
to reduce the STD first to reduce their STD before adding the small STD dimension. This led to the idea
of dimension progressive VQ. Follows steps describe the idea of DPVQ shown in Figure 5.4:
Step 1 • Perform VQ on first c1 dimensions of the input vector to reduce residual’s ST D similar to later
dimensions.
53
Figure 5.4: Dimension Progressive VQ
Step 2 Perform VQ on first c2 dimensions. Notice that the first c1 dimensions are quantization residuals
from the previous step.
Step 3 Repeat Step 2 until all the dimension is used.
For each intermediate VQ, our goal is to reduce each dimension’s ST D similar to later ones that haven’t
performed any VQ. This goal makes it easy to select the number of codewords. Typically, we can have a relatively small number of codewords n0, n1, .., ni−1to achieve our target ST D for each dimension d0, d1, ...di−1.
We may add more codewords ni for the last VQ to capture more energy.
With these multiple VQs in dimension progressive VQ, we can approach the performance of VQ with a
single codebook with Πi
k=0nk with 10x less complexity in both time and model size.
5.4 Skip Scheme
It is inefficient to perform VQ on all the blocks in the large grid for small block sizes. Particularly for
the smooth blocks, adding more VQs on it would not provide good RD gain (the RD slop is very flat for
these blocks). In that case, the skip scheme is essential to avoid these inefficient costs and help us remain
competitive in RD gain.
54
Figure 5.5: MSE saturation for different types of blocks. We split the input into four groups, and each curve
represents VQ with a different number of codewords in one group. The MSE is computed on samples from
all groups, but only the current group has distortion while the other has 0 distortions.
The idea of skipping is simple. After encoding the blocks, we check each block’s mean square error
reduction. We will not confirm to perform the current level VQ on the block unless its MSE reduction is
larger than some pre-defined threshold. Otherwise, we will skip performing the current VQ on the block.
In most cases, the block with large energy will always have a higher probability of enjoying a large MSE
reduction than the block with small energy.
Using this idea, we can easily promise to achieve a good RD gain while the detail threshold selection will
be determined by the rate-distortion optimization later.
5.5 Content Adaptive Codebook
When dealing with the codebook training, using the same codebook to quantize the coefficients from smooth,
texture, and edge regions is improper. Based on this observation, we propose to use the content adaptive
codebook to encode the different types of blocks.
55
Figure 5.6: Context model design for VQ skip flag coding. The flag at the gray position is previously
encoded.
The idea is first to classify the vector type into different groups; then, for other groups, we can train a
codebook using only the assigned sample. In the current framework, we will use KMeans to split the groups
based on the energy of each vector dimension.
Considering the signaling, we will signal the corresponding codebook’s group indices and the VQ codeword
indices. By classifying different content and designing a codeword correspondingly, we can reduce the number
of codewords required for every codebook. We can have even fewer codewords for smooth regions to achieve
better RD gain because these blocks’ MSE reduction will saturate quickly, as Figure 5.5 shows. If we
use a single codebook with many codewords, these blocks will be skipped by the skip scheme with a high
probability. This will prevent us from getting high bitrate results.
5.6 Color Channel Processing
Instead of grouping the RGB three channels to perform VQ like what we previously did, which will further
increase the vector dimension, we use PCA to perform color transform [1, 132] to de-correlate the RGB into
P, Q, R channels and process them separately to reduce the complexity and increase flexibility.
After the PCA color transforms, our primary focus is the P channel, where we will spend over 70% of
our bit rate budget since it contains the majority of energy. For Q and R channels, we can encode them only
at large receptive field VQs with a small number of codewords.
56
Figure 5.7: None zero skip flag’s percentage vs. bit per image for one level. The percentage of zeros can
linearly estimate the bitrate.
5.7 Entropy Coding
The entropy coding includes two parts: VQ indices coding and VQ skip flag coding.
CABAC encodes the VQ skip flag; we designed the context model based on the parent (skip flag in
previous VQ) and the neighbor skip flags. As Figure 5.6 shows, suppose we will encode the flag at position
X, we use the four neighbors A, B, C, D which have been previously encoded. If the parent skip flag is
available (in our model, it typically will have a quarter size as the current skip flag), we will also use the
corresponding parent skip flag E. Each flag is binary so that we will have 16/32 context models. Each model
has a corresponding binary arithmetic coder to encode the assigned bits. It achieves a very nice performance
because of the neighbor information we used.
VQ indices coding, on the other hand, is slightly different from the VQ skip flag coding because we will
only have VQ indices in the position where the VQ skip flag is False. In that case, the VQ indices are very
sparse in the spatial domain, making it difficult to get some useful neighbors. To make the design simple,
we use the Huffman with global statics to encode them. The label histogram/codebook is derived from the
training set and then applied to test images.
57
Table 5.4: Standard deviation change in each dimension in grid 3 (G3) P channel, level 2 coefficients has
receptive field 8 × 8. (Testing set). VQ(k,d) represents the VQ with k codewords of dimension d.
Dimension DC AC1 AC2 AC3
initial 557.2 224.3 215.6 134.9
VQ(2,1) 349.4 - - -
VQ(4,2) 182.2 176.7 - -
VQ(4,3) 135.1 126.7 155.7 -
VQ(1024,4) 31.9 33.7 33.0 28.4
5.8 Rate Distortion Optimization
After introducing the skip scheme, we need to decide the suitable skip threshold for a different level of VQ. If
the threshold is manageable, we would skip too many blocks which can result in a reasonable slope, however,
we cannot reduce the total mean square error a lot. On the other hand, if we skip very few blocks, even if
we can reduce the MSE a lot, the RD curve would not look good because many blocks provide tiny MSE
reduction.
Our rate-distortion optimization strategy follows the mainstream in traditional codecs’ RD approach [119]
and the basic loss function in DL-based compression.
J = D + λ · R (5.1)
where J is the cost, D is the distortion (in our case, D represents MSE), λ is the Lagrange multiplier that
helps to control the importance of R, and R is the bit rate. We compare different settings and select the one
that gives the smallest cost.
For each different threshold T Hi
, we can derive a corresponding Di and Ri
; then, we can compute a cost
Ji with the given λ. Among different threshold options, we need to select a threshold that gives the smallest
cost. That threshold will be used for the actual encoding.
Using the CABAC to form multiple contexts to compress the VQ skip flag is time-consuming and costly.
To speed up the RD process, we tried to estimate the bit rate of the skip index using the percentage of zeros.
As Figure 5.7 shown, the bit required is linear to the amount of non-zeros coefficients when it is within the
reasonable range.
58
0.00 0.02 0.04 0.06 0.08 0.10 0.12
BPP
17
18
19
20
21
22
23
PSNR
BPG-444
JPG-444
WebP
Ours (a) 256 × 256
0.000 0.002 0.004 0.006 0.008
BPP
17
18
19
20
21
22
PSNR
BPG-444
JPG-444
WebP
Ours
(b) 1024 × 1024
Figure 5.8: Performance of our framework
59
(a) MSE=212@117 bytes (b) MSE=622.73@132 bytes
(c) MSE=199.87@119 bytes (d) MSE=603.03@143 bytes
Figure 5.9: Evaluation results in our testing set with 256×256 resolution. The first row shows BPG’s results,
and the second shows our results.
5.9 Experiment
In the experiments, we use images from the CLIC training set [28] and Holopix50k [54] as training ones and
use Lena and CLIC test images as test ones. Except for Lena, which is of resolution 512 × 512, we crop
images of resolution 1024×1024 with stride 512 from the original high-resolution images to serve as training
and test images. We down-sample images of resolution 1024 × 1024 to 512 × 512 and 256 × 256 using the
Lanczos interpolation to verify the progressive characteristics. There are 186 test images in total. Webp’s
result is generated by libwebp-0.4.1-mac-10.8. JPEG’s uses libjpeg release 6b. BPG image encoder version
0.9.8 is used for BPG compression.
Table 5.5 shows the amount of the kernels sent to the next level at the different threshold. We can control
the number of kernels forwarded to the next level using energy percentage or directly specify the number
60
Table 5.5: Energy threshold (percent of total energy) vs. #components send to next level.
Level 30% 50% 60% 80% 98%
0 1 1 2 3 4
1 1 2 3 7 14
2 1 3 5 17 50
3 1 4 8 43 183
Table 5.6: number of parameters (transform kernels and codebooks) and decoding complexity (the number of
multiplication/addition required to decode one pixel without any optimization) of our proposed framework.
Model #Par Decoding Complexity
Ours 91472 294
of kernels to be sent to the next level. For both settings, dimensions with larger energy will have higher
priority when sent to the next level. We classify those large energy components as having larger long-range
correlations.
Table 5.4 shows the STD reduction in each dimension on testing sets when using the proposed dimension
progressive VQ. It shows that with a small amount of codebook, we can reduce the STD to a similar level as
the latter. For the setting used in the table, we only need 4118 bytes to save the cluster centroids (suppose
each value needs one byte), but it can have a similar effect as a codebook with 32768 codewords, which
requires 131072 bytes to save it. This shows that it can successfully reduce the computation complexity and
the model size while retaining the performance.
We select the suitable λ for the RD process through multiple experiments. During the RD process, the
bits for the skip flag are estimated using linear functions to save time, while the bits for VQ indices are
encoded because Huffman is straightforward and fast.
For the VQ skip flag coding, using the context model based on decoded neighbor and parents provides
6% more additional compression ratio comparing the one without context model when there are around 20%
none zero flags. This shows the power of our context model.
The VQ indices are encoded by Huffman, which gives 5% − 10% of compression ratio compared to fixed
length coding, which is not good enough. In addition to the reasons from Huffman itself; the nearly uniform
label distribution after the skip scheme contributes to this poor performance. The most commonly used
codewords are the smooth block that will be skipped a lot.
61
Because of the framework, we can quickly achieve variable bitrate/progressive decoding through the
truncation of the bit stream to discard the VQ starting from a large grid small block size. This helps to
solve the problem that one single model cannot achieve multiple bitrates in DL-based models.
Figure 5.8 shows the performance plots of our codec again, BPG 444 in 256 × 256 and 1024 × 1024
resolutions. The very low bit rate part of the BPG’s results is derived through downsampling the raw image
to a small scale and encoded by BPG. Figure 5.9 shows the example plot of our decoded image against
BPG-444 at a similar bitrate. Table 5.6 shows our overall model size and complexity. We can achieve a low
complexity even without any optimization.
62
Chapter 6
GIC-v3: Green Image Codec With Rate Control
6.1 Introduction
Good rate control schemes are critical to image/video coding performance. Many rate control methods have
been developed in the last four decades and applied to various image/video coding standards, such as JPEG,
MPEG-2, H.264/AVC, H.265/HEVC, and H.266/VVC. GIC adopts a multi-grid image representation and
the VQ coding, which are entirely different from the single-grid image representation and SQ coding. Its
rate control is nontrivial. It cannot be easily generalized from rate control methods for the traditional image
coding standards, say, JPEG and H.264/AVC Intra. Furthermore, this important the problem has not yet
been thoroughly covered in earlier chapters. Missing systemic RDO is also a reason the VQ-based coding
is not comparable with SQ-based codec. The traditional VQ-based image coding [43, 52, 63] will seek to
design better codebooks or increase the coding speed; there is no systemic RDO proposed for VQ comparing
the complex RDO for SQ in the current standards. However, to achieve comparable performance against
SQ-based image coding, we need systemic RDO to help us select the best coding parameters.
In this chapter, we study the design of rate control schemes for the proposed GIC method from three
perspectives.
• Rate control tools for VQ
Rate control for the SQ-based image coding method is determined by the quality factor and the
quantization matrix. Different block contents will yield different rate-distortion (R-D) tradeoffs. These
tradeoff curves are typically modeled by mathematical R-D models. A Lagrangian multiplier method
63
can maximize The coded image quality for a given budget. We need to develop a new tool suitable for
the VQ-based image coding method.
• Rate control for multiple grids jointly
We use the coding of residuals at each grid as a basic unit and allocate different bits to all grids simultaneously. To implement this idea, we develop an empirical R-D model for each grid and exhaustively
search the set of optimal Lagrangian multipliers at a couple of selected bit rates . Then, we find its
two adjacent bit rates for a given bit rate and use an interpolation scheme to find the corresponding
set of optimal Lagrangian multipliers.
• Embedded rate control
It is difficult to offer insights into the joint rate control scheme. We adopt the embedded rate control
idea to make the rate control scheme more transparent. Consider the two-grid example - image coding
at the coarse grid and residual coding at the fine grid. We assign bits to the coarser grid alone until
its R-D slope is worse than the R-D slope of the fine grid. Then, we assign bits to the fine grid alone
afterward. This means that the R-D curve of the coarse grid becomes saturated after the switching
point, and we will not revisit the coding of the coarse grid. Although this assumption may not be true,
it greatly simplifies the rate control problem since we only need to focus on one grid at a time (i.e.,
rate control without intergrid interaction). Another advantage is that such a rate control scheme can
generate an embedded bit stream, which is a nice functionality.
The rest of this chapter is organized as follows. Section 6.2 discusses several rate control tools. The
joint rate control (JRC) scheme is presented in Section 6.3. The embedded rate control (ERC) scheme is
investigated in Section 6.3. Experimental results of both JRC and ERC are shown in Section 6.5.
6.2 Rate Control Tools of Vector Quantization
For VQ, there are two ways to control the bit rate; one is to perform VQ on part of the blocks, another is
to have a codebook with different number of codewords. We realize these two approaches in our framework.
64
The first the method is realized by the skip mechanism, and the second method is realized by adaptive
codebook selection.
6.2.1 Skip and Variable Bitrate
We cannot get variable bitrate using the previous framework because the number of blocks and codewords is
fixed. It’s not optimal to use this setting, especially for smooth images, since getting good RD gain is hard.
Even though with more codewords, the vector space has been partitioned finer, the cost for each vq indices
also increases. Many of the blocks will spend a lot of bit without reducing the MSE a lot. In that case, the
skip scheme is essential to avoid these inefficient costs and help us remain competitive.
The idea of skip is simple. After we encode the blocks, we check the MSE reduction for each block. We
will not confirm performing the VQ for one block unless its MSE reduction is larger than some pre-defined
threshold. Otherwise, we will skip performing the current VQ on this block. In most cases, the block with
large energy will always have a higher probability of enjoying a large MSE reduction than the block with
small energy.
By using this idea, we can easily promise that we can achieve a good RD gain while the detail skip
threshold selection will be determined by the rate-distortion optimization later.
6.2.2 Adaptive Codebook Selection
The adaptive codebook selection provides different candidate codebooks to encode one image. We use one
codebook to encode one level of VQ for each image to reduce the cost of signaling the codebook index.
Fig.6.1 shows that, with adaptive codebook selection alone, we can achieve a better performance curve on
the testing set than when we use a fixed codebook. In this case, the smooth image can select a codebook
with a small number of codewords, and a complex image can select a codebook with many codewords to
achieve a similar slope controlled by λ.
Then, the problem becomes how we can obtain these codebooks with a different number of codewords.
The simplest solution is to train multiple codebooks using the same training data. However, it’s timeconsuming and results in a huge model size. Even though the traditional approach has the tree-structured
65
Figure 6.1: Performance of adaptive codebook selection.
VQ [140], it also provides different candidate codebooks with different numbers of codewords. But in both
cases, all the codebooks will cover the entire vector space in this case. With the decreasing number of
codewords, they become coarser and coarser and will be skipped in most cases. With these useless codewords,
the entropy will increase and reduce the RD gain.
To solve these problems, we propose the idea of codeword filtering. The idea is that we train a parent
codebook with many codewords. Then, for each codeword, we will compute a score to measure its MSE
reduction ability and the probability of its use. For MSE reduction ability, we do not want to select some
codewords close to the zero vector because they cannot bring a lot of RD gain and will be skipped. While
the probability of being used helps to filter out some outlier/overfitting codewords because these codewords
typically give very good MSE reduction, it is very unlikely to be used in encoding.
We will select the codewords with the top K0, K1, K2, . . . highest scores to form the candidate codebooks.
This will also reduce the codewords that cannot provide good RD gain and improve the codebook quality.
The model size can also be reduced; we only need to save the parent codebook and the sorting index to
reconstruct all the candidates.
With candidate codebooks from codeword filtering, the small-sized codebook will not cover the entire
vector space; instead, it will only cover a subspace with finer resolution. This helps to ensure that they will
provide sound MSE reduction.
66
6.2.3 Global and Local Vector Quantization
Figure 6.2: Global and local VQ
Previously, we used the same codebook to handle the inputs (global VQ). However, this will mix the
texture and smooth blocks, resulting in sub-optimal performance because of the huge differences in RD
performance of these block types.
To further improve the RD performance, we proposed combining global and local VQ (local content
adaptive VQ); the idea is shown in Fig.6.2. The global VQ, which shares the codebook with the input block,
encodes the large receptive vectors (similar to the previous framework). For the small receptive VQ, we
will encode them using different codebooks based on the block type. Using the same codebook to encode
them will not be efficient since the smooth blocks will have very different RD performance compared to the
complex blocks. We need different RD models/codebooks to handle different block types. On the other
hand, we should be careful about the signaling cost for classifying these different block types. Considering
the signaling cost for the group index, we cannot afford to have a group index for each small receptive field
VQ (even if we have skip). We use the local VQ after the global VQ to trade off the signaling cost and
the MSE reduction. After we perform global VQ, we may stop at a VQ with a receptive field 16imes16 or
32imes32. Then we apply local VQ with receptive field 8 × 8, 4 × 4, etc. We will assign the group index
based on the global VQ’s receptive field for each local VQ. I.e., the local VQ within the receptive field of
67
the global VQ will have the same group index. For each group, we can find a suitable codebook (different
codebooks would vary in number of codewords) to encode it to ensure we can achieve the best performance.
One test result of this idea is shown in Fig.6.3. The signal cost will increase in early grids if we use the
local and global VQ, making it inferior to the one with only global VQ. With the increase of the grid, the
local adaptations help to increase the performance a lot.
Figure 6.3: Performance of global and local VQ
Figure 6.4: Context model design for VQ skip flag coding. The flag at the the gray position is previously
encoded.
68
(a) Receptive field 8 × 8 with 1024 codewords.
Figure 6.5: Entropy with/without context model of the label to be encoded at different skip threshold.
Receptive field 4 × 4 with 1024 codewords.
6.2.4 Entropy Coding
The entropy coding consists of two parts: VQ indices coding and VQ skip flag coding.
The VQ skip flag is encoded using CABAC. We designed the context model based on the parent (skip
flag in the previous VQ) and the neighboring skip flags. As shown in Fig.6.4, when encoding the flag at
position X, we can have the two neighbors {B, D} that has been previously encoded. If the parent skip flag
is available (typically a quarter size of the current skip flag in our model), we also use the corresponding
parent skip flag {E}. Each flag is binary, resulting in 4 (8 if the parent flag is available) context models.
Each model has a corresponding binary arithmetic coder to encode the assigned bits. This approach achieves
excellent performance due to the use of neighboring information.
We encoded the VQ indices (not labeled as skip) in the previous framework using Huffman coding. While
it is a simple solution and provides reasonable performance, it has limitations. Observing that the VQ skip
flag can be encoded effectively using context-based binary arithmetic coding (CABAC), we also decided to
69
apply a similar context model [143] to the VQ indices. However, three questions make this problem different
from the previous one.
Q1: The VQ indices will be missing in some places due to the skip scheme. How do we
handle the missing places?
We proposed to use the neighbor’s neighbor to build the context model when the current neighbor is
missing to increase the hit rate of a good context model (none of the contexts is missing). We will check the
B1(D1) when B(D) is not available. If B1(D1) is unavailable, we will further check the B2(D2) position. We
will use another symbol to represent skipped vq indices if none of these positions are available.
Q2: We cannot afford to build so many context models due to the large dynamic range of
VQ indices. How do we reduce the number of possible context models?
For VQ indices, we typically have more than 256 codewords. If we build the context model based on
two neighbors, it could result in (256 + 1)2 possible context models (with one additional skipped index)
are impractical due to the lack of training data and memory. To address this, we decided to group the
codewords into groups and use group indices when building the context model. Similar codewords can be
grouped, efficiently reducing the number of context models. For example, grouping the codewords into
15 groups and adding one group to represent the missing indices would result in 4 categories, yielding
(15 + 1)2 = 256 possible context models.
Q3: VQ indices are not binary. We need to find a mapping function to use the binary
arithmetic coder. Which entropy coder/mapping function should we choose?
We need mapping functions to convert the VQ indices to binary strings to use BAC in each context
model. However, finding the optimal mapping function or combination of functions can be time-consuming.
Alternatively, we can use multi-symbol arithmetic coding to encode the VQ indices with the same context
model, but this would significantly increase complexity. Based on these considerations, we decided to use
Huffman coding to encode the index. It is reasonably efficient and offers lower complexity.
After solving these three problems, we built the entropy coder for the VQ indices. Fig.6.5 and Table.6.1
shows the entropy coder’s performance comparison with and without the context model.
70
Table 6.1: Performance of context model based entropy coding on (win × win, n-codewords) VQ.
(4 × 4, 256) VQ (8 × 8, 1024) VQ
Skip Percentage (%) 80 38 20 84 39 23
Entropy 5.87 7.25 7.45 7.49 9.26 9.40
Entropy with context 5.23 6.75 6.85 6.96 8.79 8.81
Context Arithmetic 5.28 6.84 7.03 7.09 9.10 9.31
Context Huffman 5.34 6.98 7.05 7.14 9.13 9.32
6.3 Joint Rate Control over Multi-Grids
6.3.1 Problem Formulation
Our rate-distortion optimization strategy aligns with the mainstream the approach in both traditional codecs’
RD methodology [119] and the fundamental loss function in DL-based compression.
J(λ) = D8(λ3, · · · , λk) + ΣkλkRk(λ3, · · · , λk), (6.1)
where J is the RD cost, D is the distortion (in our case D represents MSE), λ is the Lagrange multiplier
and Rk is the the bit rate for grid k. Each grid’s distortion and bitrate are determined by the lambda in
current and previous grids.
For different thresholds T Hi and candidate codebook, we derive corresponding Di and Ri
, and compute
a cost Ji with the given λ. We select the threshold with the smallest cost for actual encoding among the
different threshold options.
Using CABAC to form multiple contexts to compress the VQ skip flag is time-consuming. To expedite
the RD process, we attempted to estimate the bit rate of the skip index using the percentage of zeros. The
bit required is linear to the number of non-zero coefficients within a reasonable range. Similarly, we can
use the entropy of the labels to be encoded to estimate the bitstream length compressed by context-based
Huffman coding for VQ indices.
71
Considering that the cost J is convex against the skip threshold, we can terminate the search process
early once the most minor cost/local minima have been found.
In summary, our overall RD optimization selects the codebook and the corresponding skip threshold that
provides the smallest cost.
6.3.2 Search of Lagrangian Multipliers
With the rate-distortion optimization (RDO) methods mentioned above, we can control the performance of
each grid with a single parameter, λ. Although our framework inherently supports progressive encoding and
decoding, utilizing the progressive mode to generate the overall performance is not optimal. This is because
we can always find other λ settings that may outperform the partial bitstream. This raises the problem of
achieving all points on the convex curve when using different λ settings.
We first fully search the validation dataset to address this issue. We can find a convex curve called
the standard RD curve with the full search results. Each point on the convex curve will have a series of
lambda for each grid. We represent the λ in each grid as a vector, denoted as vλ = [λ3, λ4, ..., λ8]. Then,
we can build a lookup table based on the bitrate and the corresponding lambda vector. To achieve an
intermediate bitrate point, we use interpolation methods to find the possible λ, creating a new λ vector,
v
′ = interpolate(v1, bpp1
, v2, bpp2
). This interpolation can be any method. Leveraging lambda vector interpolation eliminates the need to save numerous parameters in the lookup table while achieving a continuous
bitrate.
Fig. 6.6 illustrates the idea, where the ’*’ points contribute to the convex curve, and the corresponding
lambda settings are labeled next to them. The points are sparse, prompting lambda vector interpolation to
fill the gaps. Simple linear interpolation is employed to generate the new lambda vector. The figure shows
that the resulting point slightly outperforms our convex curve (a result of the sparsity of the convex curve),
indicating that this lambda vector interpolation works well in finding parameters.
72
0.01 0.02 0.03 0.04 0.05 0.06
19.5
20.0
20.5
21.0
21.5
22.0
22.5
23.0
23.5
[10000, 30000, 70000, 70000]
[10000, 30000, 60000, 70000]
[10000, 30000, 70000, 60000]
[10000, 30000, 50000, 40000]
[10000, 30000, 40000, 30000]
[10000, 30000, 30000, 20000]
[10000, 30000, 20000, 20000]
[10000, 30000, 20000, 10000]
[10000, 20000, 20000, 10000]
[10000, 10000, 30000, 10000]
[10000, 10000, 10000, 10000]
[10000,30000,35000,25000]
[10000,30000,32000,22000]
[10000,30000,20000,15000]
Figure 6.6: Lambda vector interpolation is performed based on the lambda vectors labeled near the star
points. Three interpolated lambda vectors are shown in the plot. These interpolated vectors provide results
on the convex curve.
6.3.3 Determination of Target Bit Rate
The next problem is how can we achieve the target bitrate on a random input. The problem can be
summarized into
λ
∗ = [λ3, · · · , λ8] = arg min D8(λ3, · · · , λ8), (6.2)
subject to
Σk Rk(λ3, · · · , λk) ≤ RT . (6.3)
We proposed two approaches to address this question. The first one is a two-pass approach based on the
standard RD convex curve derived from the full search.
73
We first need to derive a standard performance convex curve, which is obtained through a full search
on validation sets. The distortion D can be estimated using R and two parameters αs and βs through the
formula, which is the same as the RD estimation in traditional codec
Ds = αsR
βs
s
. (6.4)
When given an input image, we will have a different convex curve, but we cannot afford to run another full
search to obtain it (as well as the lambda vector in the lookup table). However, we can estimate the RD
using the same formula with different parameters. The parameters can be found through two test encodings:
Da = αaR
βa
T
. (6.5)
Take the partial derivative for both formulas against R, we have
∂Js
∂Rs
=
∂Ds
∂Rs
+ λs = αsβsR
βs−1
s + λs = 0, (6.6)
∂Ja
∂RT
=
∂Da
∂RT
+ λa = αaβaR
βa−1
T + λa = 0. (6.7)
This yields,
λs = −αsβsR
βs−1
s
, (6.8)
λa = −αaβaR
βa−1
T
. (6.9)
To find the mapping from the standard convex curve to the convex curve from the input, or in other words,
to determine what bitrate the same lambda vector in the lookup table can achieve on the new input, we can
set λs = λa. Thus,
−αsβsR
βs−1
s = −αaβaR
βa−1
T
. (6.10)
74
Then, we can get
Rs =
αaβa
αsβs
R
βa−1
βs−1
T
. (6.11)
Consequently, we can map the target input image’s bitrate RT to the standard convex curve. Given the
target bitrate on the input image RT , we can first convert RT to the bitrate on standard performance curve
Rs. The corresponding lambda vector/lambda setting can be found through the lookup table with the Rs.
6.4 Embedded Rate Control
6.4.1 Inter-Grid Embedded Rate Control
We further extend the model to a mathematical form to enhance our understanding of the parameter settings
among different grids. To simplify, we consider only the relationship between two grids. Suppose the task
is to decide the parameters in Grid-7 and Grid-8 while all previous parameters are pre-determined. In this
case, we need to model the relationship between rate (R), distortion (D), and skip threshold (T H). Inspired
by the RDO for H.264/SVC [23, 57, 82, 83, 120], we extend their RDO method to our VQ based framework.
The R − D relationship within one grid differs from the convex curve we derived. The form αRβ does
not fit the curve well. Instead, we found that the expression
α
x + β
+ γ
provides a better prediction for the R − D relationship as Fig.6.7 shown. The same equation can be applied
to the T H − R relationship.
Then, we can derive the R − D and T H − R relationship in Grid-7. That leads to
D7 = fR7−D7
(R7) = α7
R + β7
+ γ7, (6.12)
R7 = fTH7−R7
(T H7) = a7
T H7 + b7
+ c7. (6.13)
75
0.0 0.1 0.2 0.3 0.4
BPP
80
100
120
140
160
180
200
220
MSE
Grid-7 RD curve
fitted Grid-7 curve a/(x+b)+c
Grid-8 RD curve
fitted Grid-7 curve a/(x+b)+c
0 500 1000 1500 2000 2500 3000
TH
0.00
0.02
0.04
0.06
0.08
0.10
BPP
Grid-7 TH-R curve
fitted Grid-7 curve a/(x+b)+c
Figure 6.7: Curve fitting results for R-D and TH-R relationship.
76
0.0 0.1 0.2 0.3 0.4
BPP
120
100
80
60
40
20
0
MSE
Figure 6.8: Grid-8’s R-D curve when Grid-7 has different parameters. The starting point for each curve has
been moved to (0, 0)
Grid-7’s performance will decide Grid-8’s RD curve’s starting point. Another crucial assumption is that
the R− D curve in Grid-8 will not change its shape a lot, given different parameter settings in Grid-7, which
can be proved by Fig.6.8. Regarding R8, which represents the bits spent in Grid-8, we can simplify it as
independent of Grid-7 and only related to the parameter settings in Grid-8.
These assumptions allow us to fit Grid-8’s R − D curve at any starting point/Grid-7 parameters. So our
starting point for this Grid-8’s R − D curve is (p, q). Next, we can left shift p and shift down q to let the
curve start at (0, 0):
Dori
8 =
α8
R8 + p + β8
− q + γ8. (6.14)
Then, any D8 can be written as a function of R8 and R7:
D8 = fR8−D8
(R7, R8) = α8
R8 − R7 + p + β8
+ D7 − q + γ8, (6.15)
R8 = fTH8−R8
(T H8) = a8
T H8 + b8
+ c8. (6.16)
77
Table 6.2: Standard deviation change in each dimension in grid 3 (G3) P channel, level 2 coefficients has
receptive field 8 × 8. (Testing set). VQ(k,d) represents the VQ with k codewords of dimension d.
Dimension DC AC1 AC2 AC3
initial 557.2 224.3 215.6 134.9
VQ(2,1) 349.4 - - -
VQ(4,2) 182.2 176.7 - -
VQ(4,3) 135.1 126.7 155.7 -
VQ(1024,4) 31.9 33.7 33.0 28.4
We have two lambdas in this case; we need to split the RT into RT ,7 and RT ,8 and we have
RT = RT ,7 + RT ,8
The cost function becomes:
J = D8 + λ7(R7 − RT ,7) + λ8(R8 − RT ,8). (6.17)
To calculate λ, T Hs’ values, we need to solve the following equations,
∂J
∂TH7
=
∂D8
∂TH7
+ λ7
∂R7
∂TH7
= 0, (6.18)
∂J
∂TH8
=
∂D8
∂TH8
+ λ8
∂R8
∂TH8
= 0, (6.19)
∂J
∂λ7
= R7 − RT ,7 = 0, (6.20)
∂J
∂λ8
= R8 − RT ,8 = 0. (6.21)
Then, the problem becomes how we split RT into RT ,7 and RT ,8. One simple solution is using the energy
percentage of resolution error in each grid to split the bitrate budget. With this relationship, we can solve
these equations and get the answers.
6.5 Experiments
In the experiments, we use images from the CLIC training set [28] and Holopix50k [54] as training dataset,
and use Kodak dataset for testing.
78
Table 6.3: Energy threshold (percent of total energy) vs. #components send to next level for a 16×16 block.
Level 30% 50% 60% 80% 98%
0 1 1 2 3 4
1 1 2 3 7 14
2 1 3 5 17 50
3 1 4 8 43 183
We first partition the training and testing images into 256 × 256 blocks. In the encoding preparation
step, we perform multiple 2 × 2 downsampling on the blocks until Grid-3 with 8 × 8 resolution. During the
encoding stage, a 2x2 channel-wised transform will be applied. The first VQ in Grid-3 will be applied to the
spectrum coefficients from 8 × 8 images. All latter VQs will be performed on spectrum residuals.
Table 6.3 shows the amount of the kernels sent to the next level at the different threshold. We can control
the number of kernels forwarded to the next level using energy percentage or directly specify the number of
kernel to be sent to the next level. For both settings, dimensions with larger energy will have higher priority
when sent to the next level. We will send 1 to 3 channel(s) to the next level each time.
Table 6.2 shows the STD reduction in each dimension on testing sets when using the proposed dimension
progressive VQ. It shows that with a small amount of codebook, we can reduce the STD to a similar level
as the later one. For the setting used in the table, we only need 4118 bytes to save the cluster centroids
(suppose each value needs one byte), but it can have a similar effect as a codebook with 32768 codewords,
which requires 131072 bytes to save it. This shows that it can successfully reduce the computation complexity
and the model size.
To get the standard RD curve and lambda vector settings. We are running the grid search on the testing
set. The final overall performance curve is generated by computing the convex curve of all the performance
points. We will also record the corresponding lambda vectors on the convex to form the lambda vector
lookup table. During the RD process, the bits for the skip flag are estimated using linear functions to save
time, while the bits for VQ indices are encoded because Huffman is straight forward and fast. This helps to
speed up the RD search a lot.
For the entropy coder, the VQ skip flag, we use the context model-based on decoded neighbor and parents
provides 6% more additional compression ratio comparing the one without context model when there are
around 20% none zero flags. Huffman encodes the VQ indices, which gives 5% − 10% of compression ratio.
79
Table 6.4: Number of parameters (transform kernels and codebooks) and decoding complexity (the number of
multiplication/addition required to decode one pixel without any optimization) of our proposed framework.
Model KFlops #Par (byte)
Ours 0.3 141174
H.264 0.344 -
BPG - 377858
VAE [14] 124.23 1050883
Besides, our framework can quickly achieve variable bitrate/progressive decoding through the truncation of
the bit stream to discard the VQ starting from a large grid small block size.
Fig. 6.9 shows the performance plots of our codec again BPG, H.264, WebP JPEG, and several end-to-end
based codecs.
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
BPP
20
22
24
26
28
30
32
PSNR
BPG
H264
WEBP
JPEG_PROGRESSIVE
Ballé 2017
Theis 2017
GIC (Ours)
Figure 6.9: Performance of our GIC-v3 on Kodak dataset.
Table 6.4 shows our overall model size and complexity.
80
Chapter 7
Conclusion and Future Work
7.1 Conclusion
Besides the introduction and background, this dissertation consists of four main chapters. First, a datadriven color/spatial transform that achieves better energy compaction and models the quantization error in
the inverse transform kernel was studied in Chapter 3. GIC-v1, which verifies the feasibility of VQ-based
coding, was examined in Chapter 4. Then, GIC-v2, which reduces the training and testing complexity by 10X
while maintaining the coding performance, was detailed in Chapter 5. Detailed rate distortion optimizations
are added to form the GIC-v3 in Chapter 6. Based on the preliminary experimental results, we see the
potential of the proposed GIC method. The green image codec (GIC) is still under development. A few
building components need to be further optimized. They will be mentioned in Sec. 7.2. Furthermore, the
generalization of GIC to green video coding (GVC) will be elaborated in Sec. 7.3.
7.2 Further Improvement On Green Image Codec (GIC)
Channel-wise Transform
The transform kernel is learned from all the training images during the training phase in our current
pipeline. Even though it is optimized for the whole training set, it is only for some images/content. To
81
make it perform better and have a better energy compaction effect. We can further introduce the contentdependent channel-wise transform for an even better energy compaction effect. We can have a series of
transform kernels available and, for each content, select the most suitable kernels to transform the images.
The integer transform and integer codebook should be used as well. Currently, we perform all the
operations in float, which is both time and memory costly. We can convert these float kernels to integer
kernels and use them in the integer domain to further reduce the coding complexity and the model size while
the performance would not be affected much.
To further reduce the kernel size and complexity, the 2D separable transform mentioned in Chapter 3.2
should be used by splitting the transform into vertical and horizontal transforms.
We currently use the same transform kernel to perform forward and inverse transforms on decoded images.
However, as [134] notices, the distribution of the decoded coefficients will alter after quantization. The
original kernel may not be optimal for decoded coefficients. We can add some compensation when performing
the inverse transform. The idea of using some linear regression to find an optimal inverse transform kernel
can be applied to this scenario. The kernel is computed not to increase the encoding/decoding complexity
or model size in the training phase.
Besides, we can combine the color transform with the channel-wise transform for further improvement.
To achieve the goal, we must perform a 2 × 2 × 3 transform in the first stage to compact the energy. Then,
all latter levels will remain the same as the previous proposal. For the first stage, we can transform different
channels to compact the energy of color channels.
Rate-Distortion Optimization
In our current model, we pre-train the fixed-size codebook for each level of VQ. The RDO is only
performed on the different skip thresholds. However, we can re-organize the VQ codebook into tree structure
[140] to achieve a series of codebooks with varying amounts of codewords (one level of tree nodes can be
regarded as a codebook). Then we can perform RDO in different VQ codebook sizes and skip thresholds.
This can provide better RD gain because the smooth block is skipped in current frameworks. However, to
achieve a high bit rate, it is necessary to have these soft blocks quantized. For these smooth images, the
82
MSE reduction will saturate at a very early point, which means using a small codebook size can provide a
better RD gain than the large blocks.
Besides, to speed up the RD search process, it is essential to incorporate more RD estimation methods
to predict the cost in a smaller search space and only perform accurate RD cost computation on the highest
possible candidates to reduce encoding time.
Bit rate allocation among three different channels should be improved. We can assign the bits based
on the energy of each channel. To further enhance the performance and avoid giving too many bits to less
important channels.
We currently use the MSE/PSNR as quality measurement. We could change it to some metric closer to
the human visual system and bring the attention module to our codec.
Entropy Coding
The Huffman codes encode the VQ indices for each level in our current framework. However, there are
better solutions than this since it is based on global statics and block coding. This means that the codebook
may not be optimal for the input image. We can still observe many consecutive ones/zeros in the bitstream.
The correlations among VQ indices are not considered in the current framework. Even though these
VQ indices are metadata and sparse in the spatial domain. For example, given that the neighbors are
all frequently used codewords, the current VQ index is likely to be frequently used codewords as well.
Using arithmetic codes with carefully designed context, we can utilize the neighbors or even the parent-level
information to encode VQ indices better.
A better context model design can be achieved for the VQ skip flag to increase the compression ratio
further. A better way to handle the boundary cases would improve the performance. One ongoing experiment
shows that if we split the boundary cases and build more context models, we can reduce the compression
ratio by 2-5 percent.
Parameter fine-tuning for entropy coding is important. We currently use the state transition table from
previous works and use the same initialization for different context models. A new state transition table is
needed to better align with the data we need to compress. The initial states for different context models
must be found through more experiments.
83
Post Processing
Post-processing is essential in the modern codec. It helps to reduce coding artifacts. We have yet
to include any post-processing filters in our framework. Considering that we use vector quantization and
transforms, it is necessary to introduce de-blocking filters to GIC to remove coding artifacts. Apart from
the traditional post-processing filters, we may apply some machine learning-based post-processing filters to
better compensate for errors.
Larger Resolution
Currently, our result is limited to 1024 × 1024 resolution. With the help of the channel-wise transform,
we can easily extend to a larger resolution by adding several 2 × 2 transformers. We need to verify our
framework on 4K/8K images further. They are the mainstream of the current image resolutions. With such
high resolution, the advantage of our framework to explore the long-range correlation will become even more
apparent.
Parallel Coding
Our current code base is to be optimized because of the time limitation. For example, our codec can be
parallelized based on our framework to reduce the run time. Each transform in the same layer of channelwise transform can be computed together, the VQ coding for the same level of the transformed coefficients.
The decoding process of each VQ indices (as long as the VQ skip flag is decoded, there would not have any
dependency in decoding order) can be modified to work in parallel modes. This can significantly reduce the
coding time.
7.3 Extension to Green Video Code (GVC)
As video content accounts for over 70 percent of internet traffic, we will extend GIC to GVC. Our GIC is
designed to be lightweight and green, so it is ideal for video coding. Besides, because of the characteristic
of channel-wise transform, we can better cope with the 4k even 8k videos for the long-range correlations
achieve superior performance with low complexity.
84
The most straightforward extension from image coding to video coding is using the GIC to replace the
intra-coding part and keep the other features the same. This approach is straightforward; we can compare
our work with different codecs.
What’s more, replacing the intra-coding will never be our ultimate goal. We need further to extend our
GIC to a new GVC. As shown in [156], the features/coefficients generated from the channel-wise transform
are powerful enough to track objects. We can utilize these features to reduce the complexity of motion vector
search.
Ideas like the 3D channel-wise transform may be explored as well. We can group consecutive frames
to perform the transform. Based on the characteristics of the channel-wise transform, correlations among
different frames and the same frame can be reduced.
85
References
[1] Arash Abadpour and Shohreh Kasaei. An efficient pca-based color transfer method. Journal of Visual
Communication and Image Representation, 18(1):15–34, 2007.
[2] Safia Abdelmounaime and He Dong-Chen. New brodatz-based image databases for grayscale color and
multiband texture analysis. ISRN Machine Vision, 2013, 2013.
[3] Maleen Abeydeera, Manupa Karunaratne, Geethan Karunaratne, Kalana De Silva, and Ajith Pasqual.
4k real-time hevc decoder on an fpga. IEEE Transactions on Circuits and Systems for Video Technology, 26(1):236–249, 2015.
[4] Eirikur Agustsson, Fabian Mentzer, Michael Tschannen, Lukas Cavigelli, Radu Timofte, Luca Benini,
and Luc V Gool. Soft-to-hard vector quantization for end-to-end learning compressible representations.
Advances in neural information processing systems, 30, 2017.
[5] Eirikur Agustsson and Lucas Theis. Universally quantized neural compression. In H. Larochelle,
M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing
Systems, volume 33, pages 12367–12376. Curran Associates, Inc., 2020.
[6] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset
and study. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops,
July 2017.
[7] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 221–231, 2019.
[8] Mohiuddin Ahmed, Raihan Seraj, and Syed Mohammed Shamsul Islam. The k-means algorithm: A
comprehensive survey and performance evaluation. Electronics, 9(8):1295, 2020.
[9] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on
Computers, 100(1):90–93, 1974.
[10] Hiroaki Akutsu, Akifumi Suzuki, Zhisheng Zhong, and Kiyoharu Aizawa. Ultra low bitrate learned
image compression by selective detail decoding. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, pages 118–119, 2020.
[11] Umar Albalawi, Saraju P Mohanty, and Elias Kougianos. A hardware architecture for better portable
graphics (bpg) compression encoder. In 2015 IEEE International Symposium on Nanoelectronic and
Information Systems, pages 291–296. IEEE, 2015.
[12] Ahmed Ben Atitallah, Patrice Kadionik, Fahmi Ghozzi, Patrice Nouel, Nouri Masmoudi, and
Ph Marchegay. Optimization and implementation on fpga of the dct/idct algorithm. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 3, pages III–III.
IEEE, 2006.
[13] Johannes Ball´e, Valero Laparra, and Eero P Simoncelli. Density modeling of images using a generalized
normalization transformation. arXiv preprint arXiv:1511.06281, 2015.
86
[14] Johannes Ball´e, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression.
arXiv preprint arXiv:1611.01704, 2016.
[15] Johannes Ball´e, David Minnen, Saurabh Singh, Sung Jin Hwang, and Nick Johnston. Variational
image compression with a scale hyperprior. arXiv preprint arXiv:1802.01436, 2018.
[16] Fabrice Bellard. libbpg. https://github.com/mirrorer/libbpg.
[17] Gisle Bjontegaard. Calculation of average psnr differences between rd-curves. VCEG-M33, 2001.
[18] Peter J Burt and Edward H Adelson. The laplacian pyramid as a compact image code. In Readings
in computer vision, pages 671–679. Elsevier, 1987.
[19] Li-Heng Chen, Christos G Bampis, Zhi Li, Andrey Norkin, and Alan C Bovik. Proxiqa: A proxy
approach to perceptual optimization of learned image compression. IEEE Transactions on Image
Processing, 30:360–373, 2020.
[20] Yue Chen, Debargha Murherjee, Jingning Han, Adrian Grange, Yaowu Xu, Zoe Liu, Sarah Parker,
Cheng Chen, Hui Su, Urvang Joshi, et al. An overview of core coding tools in the av1 video codec. In
2018 Picture Coding Symposium (PCS), pages 41–45. IEEE, 2018.
[21] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. Pixelhop++:
A small successive-subspace-learning-based (ssl-based) model for image classification. In 2020 IEEE
International Conference on Image Processing (ICIP), pages 3294–3298. IEEE, 2020.
[22] Zhengxue Cheng, Heming Sun, Masaru Takeuchi, and Jiro Katto. Deep convolutional autoencoderbased lossy image compression. In 2018 Picture Coding Symposium (PCS), pages 253–257. IEEE,
2018.
[23] Yongjin Cho. Dependent RD modeling for H. 264/SVC bit allocation. PhD thesis, University of
Southern California, 2010.
[24] B Choi, YK Wang, MM Hannuksela, Y Lim, and A Murtaza. Information technology–coded representation of immersive media (mpeg-i)–part 2: Omnidirectional media format. ISO/IEC, pages 23090–2,
2017.
[25] Philip A Chou, Tom Lookabaugh, and Robert M Gray. Entropy-constrained vector quantization. IEEE
Transactions on acoustics, speech, and signal processing, 37(1):31–42, 1989.
[26] Clifford Clausen and Harry Wechsler. Color image compression using pca and backpropagation learning. pattern recognition, 33(9):1555–1560, 2000.
[27] Pamela C Cosman, Robert M Gray, and Martin Vetterli. Vector quantization of image subbands: A
survey. IEEE Transactions on image processing, 5(2):202–225, 1996.
[28] CVPR. CLIC 2020: Challenge on Learned Image Compression.
[29] Scott Daly. Subroutine for the generation of a two dimensional human visual contrast sensitivity
function. Eastman Kodak, Rochester, NY, Tech. Rep. Y, 233203:1987, 1987.
[30] Lowe David. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
[31] Antonin Descampe, Franois-Olivier Devaux, Gal Rouvroy, Jean-Didier Legat, Jean-Jacques
Quisquater, and Benot Macq. A flexible hardware jpeg 2000 decoder for digital cinema. IEEE Transactions on Circuits and Systems for Video Technology, 16(11):1397–1410, 2006.
[32] Dandan Ding, Guangyao Chen, Debargha Mukherjee, Urvang Joshi, and Yue Chen. A cnn-based
in-loop filtering approach for av1 video codec. In 2019 Picture Coding Symposium (PCS), pages 1–5.
IEEE, 2019.
87
[33] Sabine Dippel, Martin Stahl, Rafael Wiemker, and Thomas Blaffert. Multiscale contrast enhancement
for radiographies: Laplacian pyramid versus fast wavelet transform. IEEE Transactions on medical
imaging, 21(4):343–353, 2002.
[34] Randal Douc and Olivier Capp´e. Comparison of resampling schemes for particle filtering. In Ispa
2005. proceedings of the 4th international symposium on image and signal processing and analysis,
2005., pages 64–69. IEEE, 2005.
[35] Jiao Du, Weisheng Li, Bin Xiao, and Qamar Nawaz. Union laplacian pyramid with multiple features
for medical image fusion. Neurocomputing, 194:326–339, 2016.
[36] W-C Du and W-J Hsu. Adaptive data hiding based on vq compressed images. IEE Proceedings-Vision,
Image and Signal Processing, 150(4):233–238, 2003.
[37] Karen Egiazarian, Jaakko Astola, Nikolay Ponomarenko, Vladimir Lukin, Federica Battisti, and Marco
Carli. New full-reference quality metrics based on hvs. In Proceedings of the second international
workshop on video processing and quality metrics, volume 4, 2006.
[38] S Esakkirajan, T Veerakumar, V Senthil Murugan, and Pl Navaneethan. Image compression using
multiwavelet and multi-stage vector quantization. International Journal of Signal Processing, 4(4):246–
253, 2008.
[39] Chih-Ming Fu, Elena Alshina, Alexander Alshin, Yu-Wen Huang, Ching-Yeh Chen, Chia-Yang Tsai,
Chih-Wei Hsu, Shaw-Min Lei, Jeong-Hoon Park, and Woo-Jin Han. Sample adaptive offset in the hevc
standard. IEEE Transactions on Circuits and Systems for Video technology, 22(12):1755–1764, 2012.
[40] Allen Gersho and Robert M Gray. Vector quantization and signal compression, volume 159. Springer
Science & Business Media, 2012.
[41] Allen Gersho and Bhaskar Ramamurthi. Image coding using vector quantization. In ICASSP’82.
IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 7, pages 428–431.
IEEE, 1982.
[42] Golnaz Ghiasi and Charless C Fowlkes. Laplacian pyramid reconstruction and refinement for semantic segmentation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 519–534. Springer, 2016.
[43] Morris Goldberg, Paul Boucher, and Seymour Shlien. Image compression using adaptive vector quantization. IEEE Transactions on Communications, 34(2):180–187, 1986.
[44] Google. Webp. https://developers.google.com/speed/webp.
[45] Robert Gray. Vector quantization. IEEE Assp Magazine, 1(2):4–29, 1984.
[46] Dan Grois, Tung Nguyen, and Detlev Marpe. Performance comparison of av1, jem, vp9, and hevc
encoders. In Applications of Digital Image Processing XL, volume 10396, page 103960L. International
Society for Optics and Photonics, 2018.
[47] Onur G Guleryuz, Philip A Chou, Hugues Hoppe, Danhang Tang, Ruofei Du, Philip Davidson, and
Sean Fanello. Sandwiched image compression: Wrapping neural networks around a standard codec.
In 2021 IEEE International Conference on Image Processing (ICIP), pages 3757–3761. IEEE, 2021.
[48] Tiansheng Guo, Jing Wang, Ze Cui, Yihui Feng, Yunying Ge, and Bo Bai. Variable rate image
compression with content adaptive optimization. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, pages 122–123, 2020.
[49] Zongyu Guo, Yaojun Wu, Runsen Feng, Zhizheng Zhang, and Zhibo Chen. 3-d context entropy model
for improved practical image compression. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops, pages 116–117, 2020.
88
[50] J¨urgen Herre. Temporal noise shaping, qualtization and coding methods in perceptual audio coding:
A tutorial introduction. In Audio Engineering Society Conference: 17th International Conference:
High-Quality Audio Coding. Audio Engineering Society, 1999.
[51] Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international
conference on pattern recognition, pages 2366–2369. IEEE, 2010.
[52] Ming-Huwi Horng. Vector quantization using the firefly algorithm for image compression. Expert
Systems with Applications, 39(1):1078–1091, 2012.
[53] Yueyu Hu, Wenhan Yang, and Jiaying Liu. Coarse-to-fine hyper-prior modeling for learned image
compression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages
11013–11020, 2020.
[54] Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco,
and Edward Li. Holopix50k: A large-scale in-the-wild stereo image dataset. arXiv preprint
arXiv:2003.11172, 2020.
[55] Ken Huffman. Profile: David a. huffman. Scientific American, pages 54–58, 1991.
[56] Noor A Ibraheem, Mokhtar M Hasan, Rafiqul Z Khan, and Pramod K Mishra. Understanding color
models: a review. ARPN Journal of science and technology, 2(3):265–275, 2012.
[57] Parul Jadhav and Shirish Kshirsagar. Independent rate control scheme for spatial scalable layer in h.
264svc. In Fifth International Conference on Advances in Recent Technologies in Communication and
Computing (ARTCom 2013), pages 52–58. IET, 2013.
[58] Euee S Jang and Nasser M Nasrabadi. Subband coding with multistage vq for wireless image communication. IEEE transactions on circuits and systems for video technology, 5(3):247–253, 1995.
[59] Feng Jiang, Wen Tao, Shaohui Liu, Jie Ren, Xun Guo, and Debin Zhao. An end-to-end compression
framework based on convolutional neural networks. IEEE Transactions on Circuits and Systems for
Video Technology, 28(10):3007–3018, 2017.
[60] Hong Jiang. The intel® quick sync video technology in the 2nd-generation intel core processor family.
In 2011 IEEE Hot Chips 23 Symposium (HCS), pages 1–23. IEEE, 2011.
[61] Liping Jing, Michael K Ng, and Joshua Zhexue Huang. An entropy weighting k-means algorithm
for subspace clustering of high-dimensional sparse data. IEEE Transactions on knowledge and data
engineering, 19(8):1026–1041, 2007.
[62] Jeff Johnson, Matthijs Douze, and Herv´e J´egou. Billion-scale similarity search with gpus. IEEE
Transactions on Big Data, 2019.
[63] Nicolaos B Karayiannis and P-I Pai. Fuzzy vector quantization algorithms and their application in
image compression. IEEE Transactions on Image Processing, 4(9):1193–1201, 1995.
[64] Ioannis Katsavounidis, C-CJ Kuo, and Zhen Zhang. Fast tree-structured nearest neighbor encoding
for vector quantization. IEEE Transactions on Image Processing, 5(2):398–404, 1996.
[65] Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint
arXiv:1906.02691, 2019.
[66] Kodak. Kodak images.
[67] Trupti M Kodinariya and Prashant R Makwana. Review on determining number of cluster in k-means
clustering. International Journal, 1(6):90–95, 2013.
[68] Jan Kufa and Tomas Kratochvil. Software and hardware hevc encoding. In 2017 International Conference on Systems, Signals and Image Processing (IWSSIP), pages 1–5. IEEE, 2017.
89
[69] C-C Jay Kuo and Azad M Madni. Green learning: Introduction, examples and outlook. Journal of
Visual Communication and Image Representation, page 103685, 2022.
[70] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid
networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 624–632, 2017.
[71] Jani Lainema, Frank Bossen, Woo-Jin Han, Junghye Min, and Kemal Ugur. Intra coding of the hevc
standard. IEEE transactions on circuits and systems for video technology, 22(12):1792–1801, 2012.
[72] Cornelius Lanczos. Evaluation of noisy data. Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, 1(1):76–85, 1964.
[73] Thomas G. Lane. libjpeg.
[74] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[75] Hl Lee and D Lee. A gain-shape vector quantizer for image coding. In ICASSP’86. IEEE International
Conference on Acoustics, Speech, and Signal Processing, volume 11, pages 141–144. IEEE, 1986.
[76] Jooyoung Lee, Seunghyun Cho, and Seung-Kwon Beack. Context-adaptive entropy model for end-toend optimized image compression. arXiv preprint arXiv:1809.10452, 2018.
[77] Wei-Cheng Lee and Hsueh-Ming Hang. A hybrid image codec with learned residual coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages
138–139, 2020.
[78] Chen Li, Li Song, Rong Xie, and Wenjun Zhang. Cnn based post-processing to improve hevc. In 2017
IEEE International Conference on Image Processing (ICIP), pages 4577–4580. IEEE, 2017.
[79] Ming Li. A better color space conversion based on learned variances for image compression. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,
pages 0–0, 2019.
[80] Muyang Li, Ji Lin, Yaoyao Ding, Zhijian Liu, Jun-Yan Zhu, and Song Han. Gan compression: Efficient architectures for interactive conditional gans. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 5284–5294, 2020.
[81] Tianyi Li, Mai Xu, Runzhi Tang, Ying Chen, and Qunliang Xing. Deepqtmt: A deep learning approach
for fast qtmt-based cu partition of intra-mode vvc. IEEE Transactions on Image Processing, 30:5377–
5390, 2021.
[82] Jiaying Liu, Yongjin Cho, and Zongming Guo. Single pass dependent bit allocation for h. 264 temporal
scalability. In 2012 19th IEEE International Conference on Image Processing, pages 705–708. IEEE,
2012.
[83] Jiaying Liu, Yongjin Cho, Zongming Guo, and Jay Kuo. Bit allocation for spatial scalability coding
of h. 264/svc with dependent rate-distortion analysis. IEEE Transactions on Circuits and Systems for
Video Technology, 20(7):967–981, 2010.
[84] Zhenyu Liu, Xianyu Yu, Yuan Gao, Shaolin Chen, Xiangyang Ji, and Dongsheng Wang. Cu partition
mode decision for hevc hardwired intra encoder using convolution neural network. IEEE Transactions
on Image Processing, 25(11):5088–5103, 2016.
[85] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of
computer vision, 60:91–110, 2004.
[86] Ming Lu, Peiyao Guo, Huiqing Shi, Chuntong Cao, and Zhan Ma. Transformer-based image compression. arXiv preprint arXiv:2111.06707, 2021.
90
[87] Zhe-Ming Lu, Dian-Guo Xu, and Sheng-He Sun. Multipurpose image watermarking algorithm based
on multistage vector quantization. IEEE Transactions on Image Processing, 14(6):822–831, 2005.
[88] Henrique S Malvar, Antti Hallapuro, Marta Karczewicz, and Louis Kerofsky. Low-complexity transform and quantization in h. 264/avc. IEEE Transactions on circuits and systems for video technology,
13(7):598–603, 2003.
[89] James Mannos and David Sakrison. The effects of a visual fidelity criterion of the encoding of images.
IEEE transactions on Information Theory, 20(4):525–536, 1974.
[90] Michael W Marcellin, Michael J Gormish, Ali Bilgin, and Martin P Boliek. An overview of jpeg-2000.
In Proceedings DCC 2000. Data Compression Conference, pages 523–541. IEEE, 2000.
[91] Detlev Marpe, Heiko Schwarz, and Thomas Wiegand. Context-based adaptive binary arithmetic coding
in the h. 264/avc video compression standard. IEEE Transactions on circuits and systems for video
technology, 13(7):620–636, 2003.
[92] Fabian Mentzer, Eirikur Agustsson, Michael Tschannen, Radu Timofte, and Luc Van Gool. Conditional probability models for deep image compression. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 4394–4402, 2018.
[93] Steinar Midtskogen and Jean-Marc Valin. The av1 constrained directional enhancement filter (cdef).
In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
1193–1197. IEEE, 2018.
[94] David Minnen, Johannes Ball´e, and George D Toderici. Joint autoregressive and hierarchical priors
for learned image compression. Advances in neural information processing systems, 31, 2018.
[95] Nader Mohsenian and Nasser M Nasrabadi. Edge-based subband vq techniques for images and video.
IEEE transactions on circuits and systems for video technology, 4(1):53–67, 1994.
[96] Nasser M Nasrabadi and Robert A King. Image coding using vector quantization: A review. IEEE
Transactions on communications, 36(8):957–971, 1988.
[97] Pauline C Ng and Steven Henikoff. Sift: Predicting amino acid changes that affect protein function.
Nucleic acids research, 31(13):3812–3814, 2003.
[98] Andrey Norkin, Gisle Bjontegaard, Arild Fuldseth, Matthias Narroschke, Masaru Ikeda, Kenneth
Andersson, Minhua Zhou, and Geert Van der Auwera. Hevc deblocking filter. IEEE Transactions on
Circuits and Systems for Video Technology, 22(12):1746–1754, 2012.
[99] JB O’Neal Jr. Predictive quantizing systems (differential pulse code modulation) for the transmission
of television signals. Bell System Technical Journal, 45(5):689–721, 1966.
[100] Davis Pan. A tutorial on mpeg/audio compression. IEEE multimedia, 2(2):60–74, 1995.
[101] I-Ming Pao and Ming-Ting Sun. Modeling dct coefficients for fast video encoding. IEEE Transactions
on Circuits and Systems for Video Technology, 9(4):608–616, 1999.
[102] Thrasyvoulos N Pappas, Jan P Allebach, and David Neuhoff. Model-based digital halftoning. IEEE
Signal processing magazine, 20(4):14–27, 2003.
[103] Sylvain Paris, Samuel W Hasinoff, and Jan Kautz. Local laplacian filters: Edge-aware image processing
with a laplacian pyramid. ACM Trans. Graph., 30(4):68, 2011.
[104] Sang-hyo Park, Kiho Choi, and Euee S Jang. Zero coefficient-aware fast butterfly-based inverse discrete
cosine transform algorithm. IET Image Processing, 10(2):89–100, 2016.
[105] Woon-Sung Park and Munchurl Kim. Cnn-based in-loop filtering for coding efficiency improvement.
In 2016 IEEE 12th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages
1–5. IEEE, 2016.
91
[106] Giuseppe Patan`e and Marco Russo. The enhanced lbg algorithm. Neural networks, 14(9):1219–1237,
2001.
[107] Soo-Chang Pei and Li-Heng Chen. Image quality assessment using human visual dog model fused with
random forest. IEEE Transactions on Image Processing, 24(11):3282–3292, 2015.
[108] William K Pratt, Julius Kane, and Harry C Andrews. Hadamard transform image coding. Proceedings
of the IEEE, 57(1):58–68, 1969.
[109] Bhaskar Ramamurthi and Allen Gersho. Classified vector quantization of images. IEEE Transactions
on communications, 34(11):1105–1115, 1986.
[110] Reza Rassool. Vmaf reproducibility: Validating a perceptual practical video quality metric. In 2017
IEEE international symposium on broadband multimedia systems and broadcasting (BMSB), pages 1–2.
IEEE, 2017.
[111] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with
vq-vae-2. Advances in neural information processing systems, 32, 2019.
[112] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo. Facehop: A
light-weight low-resolution face gender classification method. In International Conference on Pattern
Recognition, pages 169–183. Springer, 2021.
[113] Mozhdeh Rouhsedaghat, Yifan Wang, Shuowen Hu, Suya You, and C-C Jay Kuo. Low-resolution face
recognition in resource-constrained environments. Pattern Recognition Letters, 149:193–199, 2021.
[114] Heiko Schwarz, Detlev Marpe, and Thomas Wiegand. Overview of the scalable video coding extension of the h. 264/avc standard. IEEE Transactions on circuits and systems for video technology,
17(9):1103–1120, 2007.
[115] Ling Shao, Xiantong Zhen, Dacheng Tao, and Xuelong Li. Spatio-temporal laplacian pyramid coding
for action recognition. IEEE Transactions on Cybernetics, 44(6):817–827, 2013.
[116] Panu Sj¨ovall, Vili Viitam¨aki, Jarno Vanne, Timo D H¨am¨al¨ainen, and Ari Kulmala. Fpga-powered
4k120p hevc intra encoder. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS),
pages 1–5. IEEE, 2018.
[117] Hui Su, Mingliang Chen, Alexander Bokov, Debargha Mukherjee, Yunqing Wang, and Yue Chen.
Machine learning accelerated transform search for av1. In 2019 Picture Coding Symposium (PCS),
pages 1–5. IEEE, 2019.
[118] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology,
22(12):1649–1668, 2012.
[119] Gary J Sullivan and Thomas Wiegand. Rate-distortion optimization for video compression. IEEE
signal processing magazine, 15(6):74–90, 1998.
[120] Yih Han Tan, Wei Siong Lee, and Jo Yew Tham. Complexity control and computational resource
allocation during h. 264/svc encoding. In Proceedings of the 17th ACM international conference on
Multimedia, pages 897–900, 2009.
[121] Lucas Theis, Wenzhe Shi, Andrew Cunningham, and Ferenc Husz´ar. Lossy image compression with
compressive autoencoders. arXiv preprint arXiv:1703.00395, 2017.
[122] George Toderici, Sean M O’Malley, Sung Jin Hwang, Damien Vincent, David Minnen, Shumeet Baluja,
Michele Covell, and Rahul Sukthankar. Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085, 2015.
92
[123] George Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, and
Michele Covell. Full resolution image compression with recurrent neural networks. In Proceedings of
the IEEE conference on Computer Vision and Pattern Recognition, pages 5306–5314, 2017.
[124] Luc Trudeau, Nathan Egge, and David Barr. Predicting chroma from luma in av1. In 2018 Data
Compression Conference, pages 374–382. IEEE, 2018.
[125] Jean-Marc Valin, Nathan E Egge, Thomas Daede, Timothy B Terriberry, and Christopher Montgomery.
Daala: A perceptually-driven still picture codec. In 2016 IEEE International Conference on Image
Processing (ICIP), pages 76–80. IEEE, 2016.
[126] Jean-Marc Valin and Timothy B Terriberry. Perceptual vector quantization for video coding. In Visual
Information Processing and Communication VI, volume 9410, page 941009. International Society for
Optics and Photonics, 2015.
[127] K Veeraswamy, S Srinivaskumar, and BN Chatterji. Designing quantization table for hadamard transform based on human visual system for image compression. ICGST-GVIP Journal, 7(3):31–38, 2007.
[128] Gregory K Wallace. The jpeg still picture compression standard. IEEE transactions on consumer
electronics, 38(1):xviii–xxxiv, 1992.
[129] Ching-Yang Wang, Shiuh-Ming Lee, and Long-Wen Chang. Designing jpeg quantization tables based
on human visual system. Signal Processing: Image Communication, 16(5):501–506, 2001.
[130] Wencheng Wang and Faliang Chang. A multi-focus image fusion method based on laplacian pyramid.
J. Comput., 6(12):2559–2566, 2011.
[131] Ye-Kui Wang, Robert Skupin, Miska M Hannuksela, Sachin Deshpande, Virginie Drugeon, Rickard
Sj¨oberg, Byeongdoo Choi, Vadim Seregin, Yago Sanchez, Jill M Boyce, et al. The high-level syntax
of the versatile video coding (vvc) standard. IEEE Transactions on Circuits and Systems for Video
Technology, 31(10):3779–3800, 2021.
[132] Yifan Wang, Zhanxuan Mei, Ioannis Katsavounidis, and C-C Jay Kuo. Dcst: a data-driven
color/spatial transform-based image coding method. In Applications of Digital Image Processing XLIV,
volume 11842, pages 195–204. SPIE, 2021.
[133] Yifan Wang, Zhanxuan Mei, Ioannis Katsavounidis, and C-C Jay Kuo. Lightweight image codec via
multi-grid multi-block-size vector quantization (mgbvq). arXiv preprint arXiv:2209.12139, 2022.
[134] Yifan Wang, Zhanxuan Mei, Chia-Yang Tsai, Ioannis Katsavounidis, and C-C Jay Kuo. A machine learning approach to optimal inverse discrete cosine transform (idct) design. arXiv preprint
arXiv:2102.00502, 2021.
[135] Yifan Wang, Zhanxuan Mei, Qingyang Zhou, Ioannis Katsavounidis, and C-C Jay Kuo. Green image
codec: a lightweight learning-based image coding method. In Applications of Digital Image Processing
XLV, volume 12226, pages 70–75. SPIE, 2022.
[136] Zhou Wang, Alan C Bovik, and Ligang Lu. Why is image quality assessment so difficult? In 2002
IEEE International conference on acoustics, speech, and signal processing, volume 4, pages IV–3313.
IEEE, 2002.
[137] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from
error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[138] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from
error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[139] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality
assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003,
volume 2, pages 1398–1402. Ieee, 2003.
93
[140] Li-Yi Wei and Marc Levoy. Fast texture synthesis using tree-structured vector quantization. In
Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages
479–488, 2000.
[141] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the h. 264/avc
video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7):560–576,
2003.
[142] Jian Wu, Zhiming Cui, Victor S Sheng, Pengpeng Zhao, Dongliang Su, and Shengrong Gong. A
comparative study of sift and its variants. Measurement science review, 13(3):122–131, 2013.
[143] Xiaolin Wu, Jiang Wen, and Wing Hung Wong. Conditional entropy coding of vq indexes for image
compression. IEEE transactions on image processing, 8(8):1005–1013, 1999.
[144] Cheng-Hsing Yang and Yi-Cheng Lin. Reversible data hiding of a vq index table based on referred
counts. Journal of Visual Communication and Image Representation, 20(6):399–407, 2009.
[145] Jian Yang, David Zhang, Alejandro F Frangi, and Jing-yu Yang. Two-dimensional pca: a new approach
to appearance-based face representation and recognition. IEEE transactions on pattern analysis and
machine intelligence, 26(1):131–137, 2004.
[146] Jiayu Yang, Chunhui Yang, Yi Ma, Shiyi Liu, and Ronggang Wang. Learned low bit-rate image
compression with adversarial mechanism. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops, pages 140–141, 2020.
[147] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv
preprint arXiv:1409.2329, 2014.
[148] Daoqiang Zhang and Zhi-Hua Zhou. (2d) 2pca: Two-directional two-dimensional pca for efficient face
representation and recognition. Neurocomputing, 69(1-3):224–231, 2005.
[149] Fan Zhang, Chen Feng, and David R Bull. Enhancing vvc through cnn-based post-processing. In 2020
IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020.
[150] Li Zhang, Kai Zhang, Hongbin Liu, Hsiao Chiang Chuang, Yue Wang, Jizheng Xu, Pengwei Zhao,
and Dingkun Hong. History-based motion vector prediction in versatile video coding. In 2019 Data
Compression Conference (DCC), pages 43–52. IEEE, 2019.
[151] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop++: A lightweight
learning model on point sets for 3d classification. In 2020 IEEE International Conference on Image
Processing (ICIP), pages 3319–3323. IEEE, 2020.
[152] Ximin Zhang, Shan Liu, and Shawmin Lei. Intra mode coding in hevc standard. In 2012 Visual
Communications and Image Processing, pages 1–6. IEEE, 2012.
[153] Xingyu Zhang, Christophe Gisquet, Edouard Francois, Feng Zou, and Oscar C Au. Chroma intra prediction based on inter-channel correlation for hevc. IEEE Transactions on Image Processing,
23(1):274–286, 2013.
[154] David Y Zhao, Jonas Samuelsson, and Mattias Nilsson. Gmm-based entropy-constrained vector
quantization. In 2007 IEEE International Conference on Acoustics, Speech and Signal ProcessingICASSP’07, volume 4, pages IV–1097. IEEE, 2007.
[155] Lei Zhou, Zhenhong Sun, Xiangji Wu, and Junmin Wu. End-to-end optimized image compression with
attention mechanism. In CVPR workshops, page 0, 2019.
[156] Zhiruo Zhou, Hongyu Fu, Suya You, and C. C. Jay Kuo. Gusot: Green and unsupervised single object
tracking for long video sequences, 2022.
94
Abstract (if available)
Abstract
Image compression techniques play an increasingly important role in our daily lives as we can easily take pictures with our cameras or phones and view millions of images online. Compression techniques are required to store more photos on one device. Lossy and lossless compression are two solutions. In lossless compression, no information is lost during the encoding time, which results in ideal reconstruction. However, the lossless algorithm can only reduce the file size to one-half. Conversely, lossy compression will discard some information to yield a much larger compression ratio. The discarded information is not so important that the viewer may not be aware.
In the traditional compression field, many well-designed codecs, like JPEG, JPEG2000, WebP, BPG, etc., have been proposed to compress the image efficiently to a small size. They are standardized and widely used in our lives. Many improvements, like intra-prediction, coding tree unit partition, etc., are being adopted to the standards to improve the rate-distortion gain. These improvements increase the burden for rate-distortion optimization. The coding complexity increases exponentially against the additional RD gain. VVC intra recently reported state-of-the-art performance with unprecedented complexity.
A newly emerging compression approach is the deep-learning-based codec. It achieves astonishing performance compared with traditional codecs. The variational auto-encoder-based framework is the most popular deep learning-based compression algorithm. It transforms the input images into some latent representation. Non-linear transform, end-to-end optimization, attention mechanism, etc., are introduced to enhance coding results further. On the other hand, their complexity is a huge issue. Millions of operations are required to decode one single pixel.
To reduce the complexity and model size while maintaining the performance, we propose the green image codec (GIC) in this dissertation. It attempts to merge the advantages of traditional codecs and deep-learning-based codecs. The multi-grid representation, regarded as the critical point of the success of deep learning-based codec, serves as the foundation of GIC. We realize it through the Lanczos interpolation and channel-wised transform. Based on this, we propose a series of coding tools like dimension progressive vector quantization, PQR color space, etc. Rate distortion optimization is redesigned for our multi-grid framework. A VQ-specialized RDO through lambda vector interpolation is proposed. We also adopt tools from traditional coding standards, like CABAC, Huffman, and RDO in scalable video coding, etc. This new framework results in a low-complexity and small-decoder-size codec with moderate performance compared with others.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Green image generation and label transfer techniques
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Green learning for 3D point cloud data processing
PDF
Distributed source coding for image and video applications
PDF
Advanced techniques for stereoscopic image rectification and quality assessment
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Human motion data analysis and compression using graph based techniques
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
A green learning approach to image forensics: methodology, applications, and performance evaluation
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Techniques for compressed visual data quality assessment and advanced video coding
PDF
A data-driven approach to compressed video quality assessment using just noticeable difference
PDF
Advanced technologies for learning-based image/video enhancement, image generation and attribute editing
PDF
Explainable and lightweight techniques for blind visual quality assessment and saliency detection
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Green unsupervised single object tracking: technologies and performance evaluation
PDF
Efficient graph learning: theory and performance evaluation
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
Asset Metadata
Creator
Wang, Yifan
(author)
Core Title
Advanced techniques for green image coding via hierarchical vector quantization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-05
Publication Date
01/25/2024
Defense Date
01/18/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
compression,green image codec,image compression,low complexity,multi-grid
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
wang608@usc.edu,yifanwang0916@outlook.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113814343
Unique identifier
UC113814343
Identifier
etd-WangYifan-12632.pdf (filename)
Legacy Identifier
etd-WangYifan-12632
Document Type
Dissertation
Format
theses (aat)
Rights
Wang, Yifan
Internet Media Type
application/pdf
Type
texts
Source
20240130-usctheses-batch-1123
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
compression
green image codec
image compression
low complexity
multi-grid